LLM Social Reasoning Benchmark (Feb/Mar 2026)

completed

210/210 games completed · Role mix/model: T 120 · D 30 · M 60

Outcomes

#ModelOutcome ScoreWin RateTownDetectiveMafia
1openai/gpt-5.2
0.470.42..0.53
56%49..63%
0.3345%
0.1730%
0.9192%
2google/gemini-3.1-pro-preview
0.410.35..0.47
47%40..53%
0.3133%
0.2143%
0.7275%
3z-ai/glm-5
0.390.33..0.45
47%40..53%
0.2935%
0.1937%
0.7075%
4deepseek/deepseek-v3.2
0.370.31..0.44
43%36..50%
0.2933%
0.1533%
0.6568%
5moonshotai/kimi-k2.5
0.360.30..0.43
42%35..49%
0.2730%
0.2040%
0.6467%
6minimax/minimax-m2.5
0.270.21..0.33
32%26..39%
0.2428%
0.0513%
0.4450%
7x-ai/grok-4.1-fast
0.190.14..0.24
30%25..37%
0.1323%
0.1430%
0.3547%

Model vs Model

support n 210 per matchup

Average outcome-point advantage after subtracting each role's baseline expectation.

Click any non-diagonal cell to open pair details in the right pane.

A \\ Bdeepseek/deepseek-v3.2google/gemini-3.1-pro-previewminimax/minimax-m2.5moonshotai/kimi-k2.5openai/gpt-5.2x-ai/grok-4.1-fastz-ai/glm-5
deepseek/deepseek-v3.2-
google/gemini-3.1-pro-preview-
minimax/minimax-m2.5-
moonshotai/kimi-k2.5-
openai/gpt-5.2-
x-ai/grok-4.1-fast-
z-ai/glm-5-

How Models Play

ModelVote-Elim AlignmentElim CaptureInfluence Shift (+)Predicted Vote AccuracySurvival DepthTown Vote Misfire RateTown Vote Mafia RateBelief Brier (Mafia)Belief Rank AUCLast Elim Calibration
openai/gpt-5.2
0.810.74..0.88
0.870.81..0.93
0.010.00..0.04
0.810.78..0.84
0.380.34..0.41
0.700.62..0.78
0.300.23..0.37
0.270.26..0.27
0.530.50..0.56
0.180.15..0.21
google/gemini-3.1-pro-preview
0.680.61..0.74
0.720.65..0.79
0.060.02..0.11
0.850.82..0.87
0.800.74..0.84
0.540.47..0.60
0.460.40..0.54
0.260.24..0.28
0.540.49..0.58
0.220.20..0.25
z-ai/glm-5
0.700.63..0.77
0.740.67..0.81
0.110.05..0.19
0.800.77..0.82
0.730.67..0.79
0.650.58..0.72
0.350.28..0.41
0.260.25..0.28
0.490.44..0.54
0.210.19..0.24
deepseek/deepseek-v3.2
0.690.62..0.75
0.730.66..0.80
0.010.00..0.03
0.820.79..0.84
0.750.69..0.80
0.590.52..0.66
0.410.34..0.48
0.250.24..0.26
0.550.50..0.59
0.280.25..0.30
moonshotai/kimi-k2.5
0.710.63..0.77
0.760.69..0.83
0.030.00..0.11
0.790.76..0.82
0.680.62..0.73
0.590.53..0.66
0.410.34..0.48
0.240.23..0.26
0.540.49..0.58
0.300.27..0.34
minimax/minimax-m2.5
0.630.56..0.71
0.680.61..0.75
0.090.03..0.19
0.650.60..0.68
0.720.66..0.78
0.570.50..0.64
0.430.36..0.50
0.270.25..0.28
0.530.49..0.57
0.250.22..0.28
x-ai/grok-4.1-fast
0.250.18..0.32
0.270.19..0.35
0.020.00..0.08
0.790.77..0.81
0.410.35..0.47
0.680.59..0.76
0.320.25..0.40
0.260.24..0.27
0.490.45..0.53
0.240.19..0.28

What Predicts Winning

Feature Analysis

Grouped CV · 5 folds · seed 42 · 04/03/2026, 11:27:53

Role-only baseline
Log loss 0.666 · AUC 0.650 · Brier 0.236
Best feature set
Social + operational + survival · ΔLL +0.223 · ΔAUC +0.232
Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
Data quality
Weak rows 0/1470 (0.0%)
Mean analysis weight 0.959
Feature-set ranking (delta vs role-only baseline)
Feature setWhat it includesΔAUCΔLog LossAUCLog Loss
Social + operational + survival
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
+0.232+0.2230.8820.442
Social + operational
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution
+0.231+0.2180.8810.448
Social all
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity
+0.231+0.2180.8810.448
Belief + conversion
BeliefInfluenceVoting
+0.218+0.1820.8680.484
Social quality (no opportunity)
VotingInfluenceDetective actionsMafia tactics
+0.202+0.1600.8520.506
Belief inference only
Belief
+0.160+0.1210.8100.545
Role only
Role baseline
+0.000+0.0000.6500.666
Top single features
FeatureΔAUCΔLog Loss
belief_brier_mafia+0.165+0.091
belief_rank_auc+0.146+0.085
town_vote_misfire_rate+0.133+0.065
town_vote_mafia_rate+0.133+0.065
vote_elim_alignment_rate+0.106+0.055
Validity by role
RoleRowsWeak rowsWeak %Mean wt
detective21000.0%0.951
mafia42000.0%0.985
town84000.0%0.949