KicktippAi experiment analysis
gpt-5.4-nano (none) vs gpt-5.5 (none)
Task: repeated-match-slice
Primary metric: avg_kicktipp_points
Runs: 2
Pairings: 10
Compact head to head
Significant
p-value
0.0234
gpt-5.4-nano (none)
18.9000
avg points
gpt-5.5 (none)
16.2000
avg points
Prediction distribution
gpt-5.4-nano (none)
n=150
2:1
53
1:1
37
1:2
29
1:3
12
0:2
7
2:0
5
1:0
3
1:4
2
3:1
2
gpt-5.5 (none)
n=150
1:1
57
2:1
56
1:2
18
1:3
13
2:0
4
1:4
2
Matches
15 fixtures
gpt-5.4-nano (none)
n=10
2.0000
avg points
1:3
2pt
8
1:4
2pt
2
gpt-5.5 (none)
n=10
2.0000
avg points
1:3
2pt
8
1:4
2pt
2
gpt-5.4-nano (none)
n=10
2.1000
avg points
2:1
2pt
9
3:1
3pt
1
gpt-5.5 (none)
n=10
2.0000
avg points
2:1
2pt
10
gpt-5.4-nano (none)
n=10
0.0000
avg points
1:1
0pt
10
gpt-5.5 (none)
n=10
0.0000
avg points
1:1
0pt
10
gpt-5.4-nano (none)
n=10
0.0000
avg points
0:2
0pt
5
1:3
0pt
3
1:2
0pt
2
gpt-5.5 (none)
n=10
0.0000
avg points
1:2
0pt
5
1:3
0pt
5
gpt-5.4-nano (none)
n=10
0.0000
avg points
1:2
0pt
7
0:2
0pt
2
1:3
0pt
1
gpt-5.5 (none)
n=10
0.0000
avg points
1:2
0pt
10
gpt-5.4-nano (none)
n=10
2.0000
avg points
1:1
2pt
10
gpt-5.5 (none)
n=10
2.0000
avg points
1:1
2pt
10
gpt-5.4-nano (none)
n=10
3.0000
avg points
2:1
3pt
10
gpt-5.5 (none)
n=10
3.0000
avg points
2:1
3pt
10
gpt-5.4-nano (none)
n=10
0.0000
avg points
2:0
0pt
5
2:1
0pt
4
3:1
0pt
1
gpt-5.5 (none)
n=10
0.0000
avg points
2:1
0pt
10
gpt-5.4-nano (none)
n=10
0.0000
avg points
1:2
0pt
10
gpt-5.5 (none)
n=10
2.0000
avg points
1:1
2pt
10
gpt-5.4-nano (none)
n=10
4.0000
avg points
1:2
4pt
10
gpt-5.5 (none)
n=10
1.2000
avg points
1:1
0pt
7
1:2
4pt
3
gpt-5.4-nano (none)
n=10
2.0000
avg points
2:1
2pt
10
gpt-5.5 (none)
n=10
2.0000
avg points
2:1
2pt
10
gpt-5.4-nano (none)
n=10
2.0000
avg points
2:1
2pt
10
gpt-5.5 (none)
n=10
2.0000
avg points
2:1
2pt
7
2:0
2pt
3
gpt-5.4-nano (none)
n=10
0.6000
avg points
2:1
0pt
7
1:1
2pt
3
gpt-5.5 (none)
n=10
0.0000
avg points
2:1
0pt
9
2:0
0pt
1
gpt-5.4-nano (none)
n=10
1.2000
avg points
1:1
0pt
4
1:0
2pt
3
2:1
2pt
3
gpt-5.5 (none)
n=10
0.0000
avg points
1:1
0pt
10
gpt-5.4-nano (none)
n=10
0.0000
avg points
1:1
0pt
10
gpt-5.5 (none)
n=10
0.0000
avg points
1:1
0pt
10
Summary
Datasetmatch-predictions/bundesliga-2025-26/pes-squad/repeated-match-slices/all-matchdays-after-20251202t230000z/random-15x10-seed-20260517-after-20251203
Task typerepeated-match-slice
Primary metricavg_kicktipp_points
Alpha0.0500
Dataset metadata
| Field |
Value |
| Competition | bundesliga-2025-26 |
| Community | pes-squad |
| Season | 2025/2026 |
| Slice | random-15x10-seed-20260517-after-20251203 |
| Source Pool | all-matchdays-after-20251202t230000z |
| Matches | 15 |
| Repetitions | 10 |
| Predictions | 150 |
| Sample Size | 150 |
| Sample Method | repeated-match-slice |
| Sample Seed | 20260517 |
| Scope | repeated-match-slice |
| Slice Kind | repeated-match-slice |
| Source Dataset | match-predictions/bundesliga-2025-26/pes-squad |
| Starts After | 2025-12-03T00:00:00 Europe/Berlin (+01) |
| Rank |
Run |
Model |
Primary metric |
| 1 | gpt-5.4-nano (none) | gpt-5.4-nano (none) | 18.9000 |
| 2 | gpt-5.5 (none) | gpt-5.5 (none) | 16.2000 |
Better rungpt-5.4-nano (none)
Other rungpt-5.5 (none)
avg_kicktipp_points delta2.7000
Wilcoxon p-value0.0234
Mean difference2.7000
Median difference3.5000
Per-item W/T/L8/0/2
Effect size confidence intervals
| Statistic |
Point estimate |
Low |
High |
| Mean difference | 2.7000 | 1.1000 | 4.5000 |
| Median difference | 3.5000 | 2.0000 | 7.0000 |