LLM Social Reasoning with Learning Benchmark (Feb/Mar 2026)
completed210/210 games completed · Role mix/model: T 120 · D 30 · M 60
Outcomes
| # | Model | Outcome Score | Win Rate | Town | Detective | Mafia |
|---|---|---|---|---|---|---|
| 1 | openai/gpt-5.2 | 0.490.42..0.55 | 56%49..63% | 0.3340% | 0.2753% | 0.9090% |
| 2 | moonshotai/kimi-k2.5 | 0.380.32..0.44 | 43%36..50% | 0.3035% | 0.1427% | 0.6567% |
| 3 | google/gemini-3.1-pro-preview | 0.370.31..0.43 | 45%38..52% | 0.2937% | 0.1327% | 0.6670% |
| 4 | z-ai/glm-5 | 0.360.30..0.42 | 41%35..48% | 0.2833% | 0.1627% | 0.6263% |
| 5 | deepseek/deepseek-v3.2 | 0.330.27..0.40 | 39%33..46% | 0.2629% | 0.2037% | 0.5660% |
| 6 | x-ai/grok-4.1-fast | 0.320.26..0.38 | 41%35..48% | 0.2633% | 0.1530% | 0.5363% |
| 7 | minimax/minimax-m2.5 | 0.270.22..0.32 | 35%29..42% | 0.2127% | 0.1833% | 0.4453% |
Model vs Model
Average outcome-point advantage after subtracting each role's baseline expectation.
Click any non-diagonal cell to open pair details in the right pane.
| A \\ B | deepseek/deepseek-v3.2 | google/gemini-3.1-pro-preview | minimax/minimax-m2.5 | moonshotai/kimi-k2.5 | openai/gpt-5.2 | x-ai/grok-4.1-fast | z-ai/glm-5 |
|---|---|---|---|---|---|---|---|
| deepseek/deepseek-v3.2 | - | ||||||
| google/gemini-3.1-pro-preview | - | ||||||
| minimax/minimax-m2.5 | - | ||||||
| moonshotai/kimi-k2.5 | - | ||||||
| openai/gpt-5.2 | - | ||||||
| x-ai/grok-4.1-fast | - | ||||||
| z-ai/glm-5 | - |
How Models Play
| Model | Vote-Elim Alignment | Elim Capture | Influence Shift (+) | Predicted Vote Accuracy | Survival Depth | Town Vote Misfire Rate | Town Vote Mafia Rate | Belief Brier (Mafia) | Belief Rank AUC | Last Elim Calibration |
|---|---|---|---|---|---|---|---|---|---|---|
| openai/gpt-5.2 | 0.870.81..0.92 | 0.900.84..0.94 | 0.130.07..0.21 | 0.860.83..0.88 | 0.580.52..0.64 | 0.580.50..0.65 | 0.420.35..0.50 | 0.260.25..0.27 | 0.570.53..0.61 | 0.210.19..0.23 |
| moonshotai/kimi-k2.5 | 0.650.58..0.72 | 0.670.60..0.74 | 0.080.00..0.23 | 0.830.81..0.85 | 0.710.65..0.76 | 0.610.54..0.69 | 0.390.31..0.47 | 0.250.24..0.27 | 0.530.49..0.58 | 0.320.29..0.36 |
| google/gemini-3.1-pro-preview | 0.570.49..0.65 | 0.610.54..0.69 | 0.080.04..0.13 | 0.870.84..0.89 | 0.660.60..0.73 | 0.580.51..0.65 | 0.420.35..0.50 | 0.250.24..0.27 | 0.580.54..0.61 | 0.230.19..0.26 |
| z-ai/glm-5 | 0.750.68..0.81 | 0.770.71..0.83 | 0.090.05..0.14 | 0.820.80..0.85 | 0.670.61..0.72 | 0.570.50..0.65 | 0.430.35..0.51 | 0.260.24..0.27 | 0.530.49..0.58 | 0.260.23..0.28 |
| deepseek/deepseek-v3.2 | 0.590.52..0.66 | 0.620.55..0.70 | 0.040.01..0.08 | 0.860.84..0.88 | 0.680.62..0.74 | 0.620.54..0.69 | 0.380.31..0.45 | 0.270.25..0.28 | 0.530.49..0.57 | 0.280.25..0.31 |
| x-ai/grok-4.1-fast | 0.550.47..0.62 | 0.580.50..0.66 | 0.070.02..0.14 | 0.850.83..0.87 | 0.660.59..0.72 | 0.580.52..0.66 | 0.420.34..0.49 | 0.240.23..0.26 | 0.580.53..0.63 | 0.320.28..0.35 |
| minimax/minimax-m2.5 | 0.560.48..0.63 | 0.590.51..0.67 | 0.070.03..0.13 | 0.750.71..0.79 | 0.600.54..0.67 | 0.700.63..0.77 | 0.300.23..0.37 | 0.280.26..0.29 | 0.500.46..0.54 | 0.290.26..0.32 |
What Predicts Winning
Feature Analysis
Grouped CV · 5 folds · seed 42 · 08/03/2026, 11:44:47
Role-only baseline
Log loss 0.668 · AUC 0.641 · Brier 0.237
Best feature set
Social + operational + survival · ΔLL +0.247 · ΔAUC +0.257
Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
Data quality
Weak rows 0/1470 (0.0%)
Mean analysis weight 0.961
Feature-set ranking (delta vs role-only baseline)
| Feature set | What it includes | ΔAUC | ΔLog Loss | AUC | Log Loss |
|---|---|---|---|---|---|
| Social + operational + survival | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival | +0.257 | +0.247 | 0.897 | 0.420 |
| Social all | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity | +0.257 | +0.242 | 0.897 | 0.425 |
| Social + operational | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution | +0.256 | +0.241 | 0.897 | 0.426 |
| Belief + conversion | BeliefInfluenceVoting | +0.241 | +0.199 | 0.882 | 0.468 |
| Social quality (no opportunity) | VotingInfluenceDetective actionsMafia tactics | +0.235 | +0.193 | 0.876 | 0.475 |
| Belief inference only | Belief | +0.178 | +0.125 | 0.819 | 0.543 |
| Role only | Role baseline | +0.000 | +0.000 | 0.641 | 0.668 |
Top single features
| Feature | ΔAUC | ΔLog Loss |
|---|---|---|
| belief_brier_mafia | +0.182 | +0.099 |
| belief_rank_auc | +0.174 | +0.095 |
| town_vote_misfire_rate | +0.155 | +0.081 |
| town_vote_mafia_rate | +0.155 | +0.081 |
| vote_elim_alignment_rate | +0.130 | +0.067 |
Validity by role
| Role | Rows | Weak rows | Weak % | Mean wt |
|---|---|---|---|---|
| detective | 210 | 0 | 0.0% | 0.952 |
| mafia | 420 | 0 | 0.0% | 0.984 |
| town | 840 | 0 | 0.0% | 0.952 |