LLM Social Reasoning Benchmark (Feb/Mar 2026)

completed

210/210 games completed · Role mix/model: T 120 · D 30 · M 60

#	Model	Outcome Score	Win Rate	Town	Detective	Mafia
1	openai/gpt-5.2	0.470.42..0.53	56%49..63%	0.3345%	0.1730%	0.9192%
2	google/gemini-3.1-pro-preview	0.410.35..0.47	47%40..53%	0.3133%	0.2143%	0.7275%
3	z-ai/glm-5	0.390.33..0.45	47%40..53%	0.2935%	0.1937%	0.7075%
4	deepseek/deepseek-v3.2	0.370.31..0.44	43%36..50%	0.2933%	0.1533%	0.6568%
5	moonshotai/kimi-k2.5	0.360.30..0.43	42%35..49%	0.2730%	0.2040%	0.6467%
6	minimax/minimax-m2.5	0.270.21..0.33	32%26..39%	0.2428%	0.0513%	0.4450%
7	x-ai/grok-4.1-fast	0.190.14..0.24	30%25..37%	0.1323%	0.1430%	0.3547%

support n 210 per matchup

Average outcome-point advantage after subtracting each role's baseline expectation.

Click any non-diagonal cell to open pair details in the right pane.

Model	Vote-Elim Alignment	Elim Capture	Influence Shift (+)	Predicted Vote Accuracy	Survival Depth	Town Vote Misfire Rate	Town Vote Mafia Rate	Belief Brier (Mafia)	Belief Rank AUC	Last Elim Calibration
openai/gpt-5.2	0.810.74..0.88	0.870.81..0.93	0.010.00..0.04	0.810.78..0.84	0.380.34..0.41	0.700.62..0.78	0.300.23..0.37	0.270.26..0.27	0.530.50..0.56	0.180.15..0.21
google/gemini-3.1-pro-preview	0.680.61..0.74	0.720.65..0.79	0.060.02..0.11	0.850.82..0.87	0.800.74..0.84	0.540.47..0.60	0.460.40..0.54	0.260.24..0.28	0.540.49..0.58	0.220.20..0.25
z-ai/glm-5	0.700.63..0.77	0.740.67..0.81	0.110.05..0.19	0.800.77..0.82	0.730.67..0.79	0.650.58..0.72	0.350.28..0.41	0.260.25..0.28	0.490.44..0.54	0.210.19..0.24
deepseek/deepseek-v3.2	0.690.62..0.75	0.730.66..0.80	0.010.00..0.03	0.820.79..0.84	0.750.69..0.80	0.590.52..0.66	0.410.34..0.48	0.250.24..0.26	0.550.50..0.59	0.280.25..0.30
moonshotai/kimi-k2.5	0.710.63..0.77	0.760.69..0.83	0.030.00..0.11	0.790.76..0.82	0.680.62..0.73	0.590.53..0.66	0.410.34..0.48	0.240.23..0.26	0.540.49..0.58	0.300.27..0.34
minimax/minimax-m2.5	0.630.56..0.71	0.680.61..0.75	0.090.03..0.19	0.650.60..0.68	0.720.66..0.78	0.570.50..0.64	0.430.36..0.50	0.270.25..0.28	0.530.49..0.57	0.250.22..0.28
x-ai/grok-4.1-fast	0.250.18..0.32	0.270.19..0.35	0.020.00..0.08	0.790.77..0.81	0.410.35..0.47	0.680.59..0.76	0.320.25..0.40	0.260.24..0.27	0.490.45..0.53	0.240.19..0.28

Feature Analysis

Grouped CV · 5 folds · seed 42 · 04/03/2026, 11:27:53

Role-only baseline

Log loss 0.666 · AUC 0.650 · Brier 0.236

Best feature set

Social + operational + survival · ΔLL +0.223 · ΔAUC +0.232

Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival

Data quality

Weak rows 0/1470 (0.0%)

Mean analysis weight 0.959

Feature-set ranking (delta vs role-only baseline)

Feature set	What it includes	ΔAUC	ΔLog Loss	AUC	Log Loss
Social + operational + survival	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival	+0.232	+0.223	0.882	0.442
Social + operational	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution	+0.231	+0.218	0.881	0.448
Social all	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity	+0.231	+0.218	0.881	0.448
Belief + conversion	BeliefInfluenceVoting	+0.218	+0.182	0.868	0.484
Social quality (no opportunity)	VotingInfluenceDetective actionsMafia tactics	+0.202	+0.160	0.852	0.506
Belief inference only	Belief	+0.160	+0.121	0.810	0.545
Role only	Role baseline	+0.000	+0.000	0.650	0.666

Top single features

Validity by role

Role	Rows	Weak %	Mean wt
detective	210	0.0%	0.951
mafia	420	0.0%	0.985
town	840	0.0%	0.949