LLM Social Reasoning with Learning Benchmark (Feb/Mar 2026)

completed

210/210 games completed · Role mix/model: T 120 · D 30 · M 60

#	Model	Outcome Score	Win Rate	Town	Detective	Mafia
1	openai/gpt-5.2	0.490.42..0.55	56%49..63%	0.3340%	0.2753%	0.9090%
2	moonshotai/kimi-k2.5	0.380.32..0.44	43%36..50%	0.3035%	0.1427%	0.6567%
3	google/gemini-3.1-pro-preview	0.370.31..0.43	45%38..52%	0.2937%	0.1327%	0.6670%
4	z-ai/glm-5	0.360.30..0.42	41%35..48%	0.2833%	0.1627%	0.6263%
5	deepseek/deepseek-v3.2	0.330.27..0.40	39%33..46%	0.2629%	0.2037%	0.5660%
6	x-ai/grok-4.1-fast	0.320.26..0.38	41%35..48%	0.2633%	0.1530%	0.5363%
7	minimax/minimax-m2.5	0.270.22..0.32	35%29..42%	0.2127%	0.1833%	0.4453%

support n 210 per matchup

Average outcome-point advantage after subtracting each role's baseline expectation.

Click any non-diagonal cell to open pair details in the right pane.

Model	Vote-Elim Alignment	Elim Capture	Influence Shift (+)	Predicted Vote Accuracy	Survival Depth	Town Vote Misfire Rate	Town Vote Mafia Rate	Belief Brier (Mafia)	Belief Rank AUC	Last Elim Calibration
openai/gpt-5.2	0.870.81..0.92	0.900.84..0.94	0.130.07..0.21	0.860.83..0.88	0.580.52..0.64	0.580.50..0.65	0.420.35..0.50	0.260.25..0.27	0.570.53..0.61	0.210.19..0.23
moonshotai/kimi-k2.5	0.650.58..0.72	0.670.60..0.74	0.080.00..0.23	0.830.81..0.85	0.710.65..0.76	0.610.54..0.69	0.390.31..0.47	0.250.24..0.27	0.530.49..0.58	0.320.29..0.36
google/gemini-3.1-pro-preview	0.570.49..0.65	0.610.54..0.69	0.080.04..0.13	0.870.84..0.89	0.660.60..0.73	0.580.51..0.65	0.420.35..0.50	0.250.24..0.27	0.580.54..0.61	0.230.19..0.26
z-ai/glm-5	0.750.68..0.81	0.770.71..0.83	0.090.05..0.14	0.820.80..0.85	0.670.61..0.72	0.570.50..0.65	0.430.35..0.51	0.260.24..0.27	0.530.49..0.58	0.260.23..0.28
deepseek/deepseek-v3.2	0.590.52..0.66	0.620.55..0.70	0.040.01..0.08	0.860.84..0.88	0.680.62..0.74	0.620.54..0.69	0.380.31..0.45	0.270.25..0.28	0.530.49..0.57	0.280.25..0.31
x-ai/grok-4.1-fast	0.550.47..0.62	0.580.50..0.66	0.070.02..0.14	0.850.83..0.87	0.660.59..0.72	0.580.52..0.66	0.420.34..0.49	0.240.23..0.26	0.580.53..0.63	0.320.28..0.35
minimax/minimax-m2.5	0.560.48..0.63	0.590.51..0.67	0.070.03..0.13	0.750.71..0.79	0.600.54..0.67	0.700.63..0.77	0.300.23..0.37	0.280.26..0.29	0.500.46..0.54	0.290.26..0.32

Feature Analysis

Grouped CV · 5 folds · seed 42 · 08/03/2026, 11:44:47

Role-only baseline

Log loss 0.668 · AUC 0.641 · Brier 0.237

Best feature set

Social + operational + survival · ΔLL +0.247 · ΔAUC +0.257

Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival

Data quality

Weak rows 0/1470 (0.0%)

Mean analysis weight 0.961

Feature-set ranking (delta vs role-only baseline)

Feature set	What it includes	ΔAUC	ΔLog Loss	AUC	Log Loss
Social + operational + survival	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival	+0.257	+0.247	0.897	0.420
Social all	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity	+0.257	+0.242	0.897	0.425
Social + operational	VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution	+0.256	+0.241	0.897	0.426
Belief + conversion	BeliefInfluenceVoting	+0.241	+0.199	0.882	0.468
Social quality (no opportunity)	VotingInfluenceDetective actionsMafia tactics	+0.235	+0.193	0.876	0.475
Belief inference only	Belief	+0.178	+0.125	0.819	0.543
Role only	Role baseline	+0.000	+0.000	0.641	0.668

Top single features

Validity by role

Role	Rows	Weak %	Mean wt
detective	210	0.0%	0.952
mafia	420	0.0%	0.984
town	840	0.0%	0.952