KicktippAi experiment analysis
o3 (medium) vs gpt-5.5 (medium)
Task: repeated-match-slice
Primary metric: avg_kicktipp_points
Runs: 2
Pairings: 10
Compact head to head
Significant
p-value
0.0020
o3 (medium)
19.8000
avg points
gpt-5.5 (medium)
15.0000
avg points
Prediction distribution
o3 (medium)
n=150
2:1
40
1:2
38
2:0
16
3:1
11
0:1
9
0:2
9
1:1
8
1:3
8
0:3
6
1:0
2
1:4
2
3:2
1
gpt-5.5 (medium)
n=150
1:1
55
2:1
47
1:2
17
2:0
12
0:3
10
0:2
5
1:3
3
3:1
1
Matches
15 fixtures
o3 (medium)
n=10
2.0000
avg points
0:3
2pt
6
1:3
2pt
2
1:4
2pt
2
gpt-5.5 (medium)
n=10
2.0000
avg points
0:3
2pt
10
o3 (medium)
n=10
2.0000
avg points
2:1
2pt
9
3:2
2pt
1
gpt-5.5 (medium)
n=10
2.0000
avg points
2:1
2pt
10
o3 (medium)
n=10
0.0000
avg points
1:2
0pt
8
0:1
0pt
1
1:1
0pt
1
gpt-5.5 (medium)
n=10
0.0000
avg points
1:1
0pt
8
1:2
0pt
2
o3 (medium)
n=10
0.0000
avg points
1:3
0pt
6
0:2
0pt
2
1:2
0pt
2
gpt-5.5 (medium)
n=10
0.0000
avg points
1:2
0pt
7
1:3
0pt
3
o3 (medium)
n=10
0.0000
avg points
0:2
0pt
7
1:2
0pt
2
0:1
0pt
1
gpt-5.5 (medium)
n=10
0.0000
avg points
0:2
0pt
5
1:2
0pt
5
o3 (medium)
n=10
0.4000
avg points
1:2
0pt
7
1:1
2pt
2
0:1
0pt
1
gpt-5.5 (medium)
n=10
1.8000
avg points
1:1
2pt
9
1:2
0pt
1
o3 (medium)
n=10
2.7000
avg points
2:1
3pt
7
3:1
2pt
3
gpt-5.5 (medium)
n=10
3.0000
avg points
2:1
3pt
10
o3 (medium)
n=10
0.0000
avg points
2:1
0pt
6
3:1
0pt
4
gpt-5.5 (medium)
n=10
0.0000
avg points
2:1
0pt
10
o3 (medium)
n=10
0.2000
avg points
1:2
0pt
4
0:1
0pt
3
2:1
0pt
2
1:1
2pt
1
gpt-5.5 (medium)
n=10
1.8000
avg points
1:1
2pt
9
1:2
0pt
1
o3 (medium)
n=10
3.2000
avg points
1:2
4pt
8
1:1
0pt
2
gpt-5.5 (medium)
n=10
0.4000
avg points
1:1
0pt
9
1:2
4pt
1
o3 (medium)
n=10
2.0000
avg points
2:1
2pt
8
3:1
2pt
2
gpt-5.5 (medium)
n=10
2.0000
avg points
2:1
2pt
10
o3 (medium)
n=10
2.0000
avg points
2:0
2pt
8
3:1
2pt
2
gpt-5.5 (medium)
n=10
2.0000
avg points
2:1
2pt
7
2:0
2pt
2
3:1
2pt
1
o3 (medium)
n=10
0.0000
avg points
2:0
0pt
8
1:0
0pt
1
2:1
0pt
1
gpt-5.5 (medium)
n=10
0.0000
avg points
2:0
0pt
10
o3 (medium)
n=10
1.6000
avg points
2:1
2pt
7
1:1
0pt
2
1:0
2pt
1
gpt-5.5 (medium)
n=10
0.0000
avg points
1:1
0pt
10
o3 (medium)
n=10
3.7000
avg points
1:2
4pt
7
0:1
3pt
3
gpt-5.5 (medium)
n=10
0.0000
avg points
1:1
0pt
10
Summary
Datasetmatch-predictions/bundesliga-2025-26/pes-squad/repeated-match-slices/all-matchdays-after-20251202t230000z/random-15x10-seed-20260517-after-20251203
Task typerepeated-match-slice
Primary metricavg_kicktipp_points
Alpha0.0500
Dataset metadata
| Field |
Value |
| Competition | bundesliga-2025-26 |
| Community | pes-squad |
| Season | 2025/2026 |
| Slice | random-15x10-seed-20260517-after-20251203 |
| Source Pool | all-matchdays-after-20251202t230000z |
| Matches | 15 |
| Repetitions | 10 |
| Predictions | 150 |
| Sample Size | 150 |
| Sample Method | repeated-match-slice |
| Sample Seed | 20260517 |
| Scope | repeated-match-slice |
| Slice Kind | repeated-match-slice |
| Source Dataset | match-predictions/bundesliga-2025-26/pes-squad |
| Starts After | 2025-12-03T00:00:00 Europe/Berlin (+01) |
| Rank |
Run |
Model |
Primary metric |
| 1 | o3 (medium) | o3 (medium) | 19.8000 |
| 2 | gpt-5.5 (medium) | gpt-5.5 (medium) | 15.0000 |
Better runo3 (medium)
Other rungpt-5.5 (medium)
avg_kicktipp_points delta4.8000
Wilcoxon p-value0.0020
Mean difference4.8000
Median difference6.0000
Per-item W/T/L10/0/0
Effect size confidence intervals
| Statistic |
Point estimate |
Low |
High |
| Mean difference | 4.8000 | 3.5000 | 6.2000 |
| Median difference | 6.0000 | 5.0000 | 9.0000 |