KicktippAi experiment analysis
match-predictions/bundesliga-2025-26/ehonda-ai-arena/community-to-date/through-md28/community-to-date-md28
Summary
Community standings
| Rank | Participant | Kicktipp Points | p-value vs baseline |
|---|---|---|---|
| 1 | o3 | 346 | |
| 2 | gpt-5 | 336 | |
| 3 | gpt-5-nano | 331 | |
| 4 | o4-mini | 329 | |
| 5 | gpt-5-mini | 328 |
Multi-run comparison
Friedman p-value 0.7401| o3 | gpt-5 | 10.0000 | 0.7767 | 1.0000 | no | 29/199/24 |
| o3 | gpt-5-nano | 15.0000 | 0.4962 | 1.0000 | no | 42/175/35 |
| o3 | o4-mini | 17.0000 | 0.4686 | 1.0000 | no | 33/195/24 |
| o3 | gpt-5-mini | 18.0000 | 0.4106 | 1.0000 | no | 35/191/26 |
| gpt-5 | gpt-5-nano | 5.0000 | 0.6785 | 1.0000 | no | 36/183/33 |
| gpt-5 | o4-mini | 7.0000 | 0.6530 | 1.0000 | no | 28/200/24 |
| gpt-5 | gpt-5-mini | 8.0000 | 0.7128 | 1.0000 | no | 34/191/27 |
| gpt-5-nano | o4-mini | 2.0000 | 0.8791 | 1.0000 | no | 32/186/34 |
| gpt-5-nano | gpt-5-mini | 3.0000 | 0.7823 | 1.0000 | no | 36/177/39 |
| o4-mini | gpt-5-mini | 1.0000 | 0.9403 | 1.0000 | no | 23/205/24 |
Per-item win/tie/loss counts compare paired Kicktipp points for the listed run ordering on each prepared dataset item.