LLM Social Reasoning Benchmark (Feb/Mar 2026)
completed210/210 games completed · Role mix/model: T 120 · D 30 · M 60
Outcomes
| # | Model | Outcome Score | Win Rate | Town | Detective | Mafia |
|---|---|---|---|---|---|---|
| 1 | openai/gpt-5.2 | 0.470.42..0.53 | 56%49..63% | 0.3345% | 0.1730% | 0.9192% |
| 2 | google/gemini-3.1-pro-preview | 0.410.35..0.47 | 47%40..53% | 0.3133% | 0.2143% | 0.7275% |
| 3 | z-ai/glm-5 | 0.390.33..0.45 | 47%40..53% | 0.2935% | 0.1937% | 0.7075% |
| 4 | deepseek/deepseek-v3.2 | 0.370.31..0.44 | 43%36..50% | 0.2933% | 0.1533% | 0.6568% |
| 5 | moonshotai/kimi-k2.5 | 0.360.30..0.43 | 42%35..49% | 0.2730% | 0.2040% | 0.6467% |
| 6 | minimax/minimax-m2.5 | 0.270.21..0.33 | 32%26..39% | 0.2428% | 0.0513% | 0.4450% |
| 7 | x-ai/grok-4.1-fast | 0.190.14..0.24 | 30%25..37% | 0.1323% | 0.1430% | 0.3547% |
Model vs Model
Average outcome-point advantage after subtracting each role's baseline expectation.
Click any non-diagonal cell to open pair details in the right pane.
| A \\ B | deepseek/deepseek-v3.2 | google/gemini-3.1-pro-preview | minimax/minimax-m2.5 | moonshotai/kimi-k2.5 | openai/gpt-5.2 | x-ai/grok-4.1-fast | z-ai/glm-5 |
|---|---|---|---|---|---|---|---|
| deepseek/deepseek-v3.2 | - | ||||||
| google/gemini-3.1-pro-preview | - | ||||||
| minimax/minimax-m2.5 | - | ||||||
| moonshotai/kimi-k2.5 | - | ||||||
| openai/gpt-5.2 | - | ||||||
| x-ai/grok-4.1-fast | - | ||||||
| z-ai/glm-5 | - |
How Models Play
| Model | Vote-Elim Alignment | Elim Capture | Influence Shift (+) | Predicted Vote Accuracy | Survival Depth | Town Vote Misfire Rate | Town Vote Mafia Rate | Belief Brier (Mafia) | Belief Rank AUC | Last Elim Calibration |
|---|---|---|---|---|---|---|---|---|---|---|
| openai/gpt-5.2 | 0.810.74..0.88 | 0.870.81..0.93 | 0.010.00..0.04 | 0.810.78..0.84 | 0.380.34..0.41 | 0.700.62..0.78 | 0.300.23..0.37 | 0.270.26..0.27 | 0.530.50..0.56 | 0.180.15..0.21 |
| google/gemini-3.1-pro-preview | 0.680.61..0.74 | 0.720.65..0.79 | 0.060.02..0.11 | 0.850.82..0.87 | 0.800.74..0.84 | 0.540.47..0.60 | 0.460.40..0.54 | 0.260.24..0.28 | 0.540.49..0.58 | 0.220.20..0.25 |
| z-ai/glm-5 | 0.700.63..0.77 | 0.740.67..0.81 | 0.110.05..0.19 | 0.800.77..0.82 | 0.730.67..0.79 | 0.650.58..0.72 | 0.350.28..0.41 | 0.260.25..0.28 | 0.490.44..0.54 | 0.210.19..0.24 |
| deepseek/deepseek-v3.2 | 0.690.62..0.75 | 0.730.66..0.80 | 0.010.00..0.03 | 0.820.79..0.84 | 0.750.69..0.80 | 0.590.52..0.66 | 0.410.34..0.48 | 0.250.24..0.26 | 0.550.50..0.59 | 0.280.25..0.30 |
| moonshotai/kimi-k2.5 | 0.710.63..0.77 | 0.760.69..0.83 | 0.030.00..0.11 | 0.790.76..0.82 | 0.680.62..0.73 | 0.590.53..0.66 | 0.410.34..0.48 | 0.240.23..0.26 | 0.540.49..0.58 | 0.300.27..0.34 |
| minimax/minimax-m2.5 | 0.630.56..0.71 | 0.680.61..0.75 | 0.090.03..0.19 | 0.650.60..0.68 | 0.720.66..0.78 | 0.570.50..0.64 | 0.430.36..0.50 | 0.270.25..0.28 | 0.530.49..0.57 | 0.250.22..0.28 |
| x-ai/grok-4.1-fast | 0.250.18..0.32 | 0.270.19..0.35 | 0.020.00..0.08 | 0.790.77..0.81 | 0.410.35..0.47 | 0.680.59..0.76 | 0.320.25..0.40 | 0.260.24..0.27 | 0.490.45..0.53 | 0.240.19..0.28 |
What Predicts Winning
Feature Analysis
Grouped CV · 5 folds · seed 42 · 04/03/2026, 11:27:53
Role-only baseline
Log loss 0.666 · AUC 0.650 · Brier 0.236
Best feature set
Social + operational + survival · ΔLL +0.223 · ΔAUC +0.232
Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
Data quality
Weak rows 0/1470 (0.0%)
Mean analysis weight 0.959
Feature-set ranking (delta vs role-only baseline)
| Feature set | What it includes | ΔAUC | ΔLog Loss | AUC | Log Loss |
|---|---|---|---|---|---|
| Social + operational + survival | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival | +0.232 | +0.223 | 0.882 | 0.442 |
| Social + operational | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution | +0.231 | +0.218 | 0.881 | 0.448 |
| Social all | VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity | +0.231 | +0.218 | 0.881 | 0.448 |
| Belief + conversion | BeliefInfluenceVoting | +0.218 | +0.182 | 0.868 | 0.484 |
| Social quality (no opportunity) | VotingInfluenceDetective actionsMafia tactics | +0.202 | +0.160 | 0.852 | 0.506 |
| Belief inference only | Belief | +0.160 | +0.121 | 0.810 | 0.545 |
| Role only | Role baseline | +0.000 | +0.000 | 0.650 | 0.666 |
Top single features
| Feature | ΔAUC | ΔLog Loss |
|---|---|---|
| belief_brier_mafia | +0.165 | +0.091 |
| belief_rank_auc | +0.146 | +0.085 |
| town_vote_misfire_rate | +0.133 | +0.065 |
| town_vote_mafia_rate | +0.133 | +0.065 |
| vote_elim_alignment_rate | +0.106 | +0.055 |
Validity by role
| Role | Rows | Weak rows | Weak % | Mean wt |
|---|---|---|---|---|
| detective | 210 | 0 | 0.0% | 0.951 |
| mafia | 420 | 0 | 0.0% | 0.985 |
| town | 840 | 0 | 0.0% | 0.949 |