KicktippAi experiment analysis

gpt-5.5 (xhigh) vs gpt-5-nano

match-predictions/bundesliga-2025-26/pes-squad/repeated-match/md01-fc-bayern-munchen-vs-rb-leipzig/repeat-25-knowledge-cutoff-bayern-rbl-md1

Task: repeated-match Primary metric: avg_kicktipp_points Runs: 2 Pairings: 25

At a glance

not significant · p-value 0.1573

Match to predict

FC Bayern München vs RB Leipzig

Matchday 12025-08-22T21:30:00 UTC+02 (+02)

Actual outcome FC Bayern München 6 - 0 RB Leipzig

Compact head to head

Not significant

p-value 0.1573

gpt-5.5 (xhigh)

2.1600 avg points

gpt-5-nano

2.0000 avg points

Prediction distribution

gpt-5.5 (xhigh) n=25

3:1 20

2:1 3

6:0 2

gpt-5-nano n=25

2:1 20

3:1 4

3:2 1

Summary

Datasetmatch-predictions/bundesliga-2025-26/pes-squad/repeated-match/md01-fc-bayern-munchen-vs-rb-leipzig/repeat-25-knowledge-cutoff-bayern-rbl-md1

Task typerepeated-match

Primary metricavg_kicktipp_points

Alpha0.0500

Paired Wilcoxon signed-rank test on per-item Kicktipp-point differences; bootstrap confidence intervals summarize mean and median paired differences.

Dataset metadata

Bundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome.

Field	Value
Fixture	FC Bayern München vs RB Leipzig
Actual Result	FC Bayern München 6 - 0 RB Leipzig
Matchday	1
Repetitions	25
Why Interesting	Bundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome.
Competition	bundesliga-2025-26
Community	pes-squad
Season	2025/2026
Slice	repeat-25-knowledge-cutoff-bayern-rbl-md1
Source Pool	md01-fc-bayern-munchen-vs-rb-leipzig
Sample Size	25
Sample Method	repeated-match
Scope	repeated-match
Slice Kind	repeated-match
Source Dataset	match-predictions/bundesliga-2025-26/pes-squad

Run ranking

Rank	Run	Model	Primary metric
1	gpt-5.5 (xhigh)	gpt-5.5 (xhigh)	2.1600
2	gpt-5-nano	gpt-5-nano	2.0000

Two-run comparison

not significant

Better rungpt-5.5 (xhigh)

Other rungpt-5-nano

avg_kicktipp_points delta0.1600

Wilcoxon p-value0.1573

Mean difference0.1600

Median difference0.0000

Per-item W/T/L2/23/0

Effect size confidence intervals

Statistic	Point estimate	Low	High
Mean difference	0.1600	-0.0800	0.3200
Median difference	0.0000	0.0000	0.0000

Per-item win/tie/loss counts compare paired Kicktipp points for the listed run ordering on each prepared dataset item.