Friedman test across all paired runs; pairwise Wilcoxon signed-rank tests use holm correction, with bootstrap confidence intervals for paired differences.
Dataset metadata
repeated-match-slice dataset for 150 item(s) on random-15x10-seed-20260517-after-20251203
Field
Value
Competition
bundesliga-2025-26
Community
pes-squad
Season
2025/2026
Slice
random-15x10-seed-20260517-after-20251203
Source Pool
all-matchdays-after-20251202t230000z
Matches
15
Repetitions
10
Predictions
150
Sample Size
150
Sample Method
repeated-match-slice
Sample Seed
20260517
Scope
repeated-match-slice
Slice Kind
repeated-match-slice
Source Dataset
match-predictions/bundesliga-2025-26/pes-squad
Starts After
2025-12-03T00:00:00 Europe/Berlin (+01)
Run ranking
Rank
Run
Model
Primary metric
1
o3 (medium)
o3 (medium)
19.8000
2
gpt-5.5 (high)
gpt-5.5 (high)
19.5000
3
gpt-5.4-nano (none)
gpt-5.4-nano (none)
18.9000
4
gpt-5.5 (none)
gpt-5.5 (none)
16.2000
5
gpt-5.5 (medium)
gpt-5.5 (medium)
15.0000
Multi-run comparison
Friedman p-value 0.0001
o3 (medium)
gpt-5.5 (high)
0.3000
0.4102
1.0000
no
8/0/2
o3 (medium)
gpt-5.4-nano (none)
0.9000
0.4062
1.0000
no
5/1/4
o3 (medium)
gpt-5.5 (none)
3.6000
0.0195
0.1172
no
9/0/1
o3 (medium)
gpt-5.5 (medium)
4.8000
0.0020
0.0195
yes
10/0/0
gpt-5.5 (high)
gpt-5.4-nano (none)
0.6000
0.4297
1.0000
no
5/1/4
gpt-5.5 (high)
gpt-5.5 (none)
3.3000
0.0020
0.0195
yes
10/0/0
gpt-5.5 (high)
gpt-5.5 (medium)
4.5000
0.0020
0.0195
yes
10/0/0
gpt-5.4-nano (none)
gpt-5.5 (none)
2.7000
0.0234
0.1172
no
8/0/2
gpt-5.4-nano (none)
gpt-5.5 (medium)
3.9000
0.0039
0.0273
yes
9/0/1
gpt-5.5 (none)
gpt-5.5 (medium)
1.2000
0.3125
1.0000
no
4/5/1
Per-item win/tie/loss counts compare paired Kicktipp points for the listed run ordering on each prepared dataset item.