KicktippAi experiment analysis

gpt-5.5 (xhigh) vs gpt-5-nano

match-predictions/bundesliga-2025-26/pes-squad/repeated-match/md01-fc-bayern-munchen-vs-rb-leipzig/repeat-25-knowledge-cutoff-bayern-rbl-md1

Task: repeated-match Primary metric: avg_kicktipp_points Runs: 2 Pairings: 25

At a glance

not significant · p-value 0.1573
Match to predict

FC Bayern München vs RB Leipzig

Matchday 12025-08-22T21:30:00 UTC+02 (+02)
Actual outcome FC Bayern München 6 - 0 RB Leipzig
Compact head to head

Not significant

p-value 0.1573
gpt-5.5 (xhigh)
2.1600 avg points
gpt-5-nano
2.0000 avg points

Prediction distribution

gpt-5.5 (xhigh) n=25
3:1 20
2:1 3
6:0 2
gpt-5-nano n=25
2:1 20
3:1 4
3:2 1

Summary

Datasetmatch-predictions/bundesliga-2025-26/pes-squad/repeated-match/md01-fc-bayern-munchen-vs-rb-leipzig/repeat-25-knowledge-cutoff-bayern-rbl-md1
Task typerepeated-match
Primary metricavg_kicktipp_points
Alpha0.0500

Paired Wilcoxon signed-rank test on per-item Kicktipp-point differences; bootstrap confidence intervals summarize mean and median paired differences.

Dataset metadata

Bundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome.

Field Value
FixtureFC Bayern München vs RB Leipzig
Actual ResultFC Bayern München 6 - 0 RB Leipzig
Matchday1
Repetitions25
Why InterestingBundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome.
Competitionbundesliga-2025-26
Communitypes-squad
Season2025/2026
Slicerepeat-25-knowledge-cutoff-bayern-rbl-md1
Source Poolmd01-fc-bayern-munchen-vs-rb-leipzig
Sample Size25
Sample Methodrepeated-match
Scoperepeated-match
Slice Kindrepeated-match
Source Datasetmatch-predictions/bundesliga-2025-26/pes-squad

Run ranking

Rank Run Model Primary metric
1gpt-5.5 (xhigh)gpt-5.5 (xhigh)2.1600
2gpt-5-nanogpt-5-nano2.0000

Two-run comparison

not significant
Better rungpt-5.5 (xhigh)
Other rungpt-5-nano
avg_kicktipp_points delta0.1600
Wilcoxon p-value0.1573
Mean difference0.1600
Median difference0.0000
Per-item W/T/L2/23/0

Effect size confidence intervals

Statistic Point estimate Low High
Mean difference0.1600-0.08000.3200
Median difference0.00000.00000.0000

Per-item win/tie/loss counts compare paired Kicktipp points for the listed run ordering on each prepared dataset item.