Measuring LLM hallucinations in stats rich domain (Cricket - T20)
Abstract
We benchmark models on single-match cricket facts generated directly from CricSheet T20 YAML scorecards. Each item is either numeric (integer) or multiple-choice (options baked from the source). Models must emit strict JSON: {"number": int}
, {"choice": "<option>"}
, or {"no_answer": true}
.On 100 items per model, GPT-5 achieves the lowest hallucination rate when it answers; a search enabled model (gpt-4o-search-preview) attains both high coverage and high accuracy. This supports the thesis that in dense stat domains (like cricket) models cannot realistically memorize the long tail; retrieval is required for high accuracy.
Source: Github Attribution: Data derived from CricSheet.
Methods (brief)
-
Source of truth: Each QA item is derived from exactly one CricSheet YAML; we store the file path with the item. No cross-match aggregation.
-
Question types (10):
- toss_winner (choice: 2 teams)
- toss_decision (choice: bat/field)
- match_winner (choice: 2 teams)
- victory_margin_runs (number)
- victory_margin_wkts (number)
- team_total (number, per team)
- top_scorer_name (choice: all players in match; ties allowed via
gold_set
) - top_scorer_runs (number for a fixed top batter)
- top_wicket_taker_name (choice: all players; ties allowed via
gold_set
) - total_match_runs (number)
-
I/O contract: Single prompt per item. Model returns:
{"choice":"<one of options>"}
for names/teams,{"number": <int>}
for numeric,{"no_answer": true}
if unsure.
-
Scoring:
- Answered = valid JSON and valid field (choice in options or integer).
- Correct = choice in
gold_set
or number equalsgold
. - Hallucination = answered but incorrect.
- Errors/invalid JSON/invalid options = unanswered.
Results (N=100 per model)
Model | Answer rate | Accuracy (overall) | Accuracy (when answered) | Hallucination rate (when answered) | Wrong / 100 prompts |
---|---|---|---|---|---|
gpt-4o-search-preview | 0.96 | 0.88 | 0.9082 | 0.0918 | 9.00 |
gpt-5 | 0.35 | 0.27 | 0.7714 | 0.2286 | 8.00 |
gpt-4o-mini | 0.37 | 0.14 | 0.3784 | 0.6216 | 23.00 |
gpt-5-mini | 0.05 | 0.02 | 0.4000 | 0.6000 | 3.00 |
Notes:
- Wrong per 100 prompts =
answer_rate * hallucination_rate_when_answered * 100
. - gpt-5-mini’s low overall wrong count is driven by high abstention (very low coverage).
Discussion and Thesis
- Dense stats are not memorized. With many “nearby” entities and numbers, parametric memory is insufficient. Hallucinations cluster around plausible near misses (e.g., wrong victory margin, teammate names).
- Retrieval wins. gpt-4o-search-preview (with search) reaches both high coverage and high faithfulness, confirming that some form of RAG or built-in search is the practical path to high accuracy.
- Operational guidance.
- When data that you need to search for doesn’t fit context:
- Critical domains: prefer a conservative model (e.g., gpt-5-mini or similar behavior) plus retrieval; abstain when evidence is missing.
- General Q&A: a stronger model (e.g., GPT-5) plus retrieval balances coverage and reliability.
- When data that you need to search for doesn’t fit context:
- Abstention as a knob. Lower hallucinations per 100 prompts can be achieved either by higher precision or by answering less; choose based on risk tolerance.
Limitations
- Domain is cricket only; replicate on other dense stat domains (baseball, finance, claim databases).
- Sample size per model is 100; increase N for tighter intervals and per-type CIs.
- Wrong answers in search retrieval method aren’t necessarily wrong. It is just that cricsheet data isn’t agreeing with other sources of data that search used.
Appendix A: Prompt Contract (single prompt, model decides type)
System: Output valid JSON only. If unsure, return {"no_answer": true}.
User: <item.prompt>
If options are shown, choose from them.
Return ONLY one of:
{"choice":"<one of options>"}
{"number": <integer>}
{"no_answer": true}
Attribution: Thanks to CricSheet for high-quality structured scorecards: https://cricsheet.org