KicktippAi experiment analysis
gpt-5.5 (xhigh) vs gpt-5-nano vs gpt-5.5 (none)
match-predictions/bundesliga-2025-26/pes-squad/repeated-match/md01-fc-bayern-munchen-vs-rb-leipzig/repeat-25-knowledge-cutoff-bayern-rbl-md1
At a glance
Prediction distribution
100x low follow-up
Exact 6:0: 5 / 100A later gpt-5.5 low run repeats the same source match, hosted prompt route, and exact pre-kickoff evaluation time on a 100x repeated-match dataset. It is published as a separate single-run page because this report is a paired 25x comparison.
Follow-up report: gpt-5.5 (low) 100x knowledge cutoff follow-up. Companion writeup: knowledge-cutoff-bayern-rbl-repeated-match.md.
Summary
Friedman test across all paired runs; pairwise Wilcoxon signed-rank tests use holm correction, with bootstrap confidence intervals for paired differences.
Dataset metadata
Bundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome.
| Field | Value |
|---|---|
| Fixture | FC Bayern München vs RB Leipzig |
| Actual Result | FC Bayern München 6 - 0 RB Leipzig |
| Matchday | 1 |
| Repetitions | 25 |
| Why Interesting | Bundesliga 2025/26 opening match, FC Bayern München vs RB Leipzig on matchday 1, ended 6:0. Repeated-match dataset for probing whether models with knowledge after the fixture reproduce the exact known outcome. |
| Competition | bundesliga-2025-26 |
| Community | pes-squad |
| Season | 2025/2026 |
| Slice | repeat-25-knowledge-cutoff-bayern-rbl-md1 |
| Source Pool | md01-fc-bayern-munchen-vs-rb-leipzig |
| Sample Size | 25 |
| Sample Method | repeated-match |
| Scope | repeated-match |
| Slice Kind | repeated-match |
| Source Dataset | match-predictions/bundesliga-2025-26/pes-squad |
Run ranking
| Rank | Run | Model | Primary metric |
|---|---|---|---|
| 1 | gpt-5.5 (xhigh) | gpt-5.5 (xhigh) | 2.1600 |
| 2 | gpt-5-nano | gpt-5-nano | 2.0000 |
| 3 | gpt-5.5 (none) | gpt-5.5 (none) | 2.0000 |
Multi-run comparison
Friedman p-value 0.1353| gpt-5.5 (xhigh) | gpt-5-nano | 0.1600 | 0.1573 | 0.4719 | no | 2/23/0 |
| gpt-5.5 (xhigh) | gpt-5.5 (none) | 0.1600 | 0.1573 | 0.4719 | no | 2/23/0 |
| gpt-5-nano | gpt-5.5 (none) | 0.0000 | 1.0000 | 1.0000 | no | 0/25/0 |
Per-item win/tie/loss counts compare paired Kicktipp points for the listed run ordering on each prepared dataset item.