LLM Social Reasoning with Learning Benchmark (Feb/Mar 2026)

completed

210/210 games completed · Role mix/model: T 120 · D 30 · M 60

Outcomes

#ModelOutcome ScoreWin RateTownDetectiveMafia
1openai/gpt-5.2
0.490.42..0.55
56%49..63%
0.3340%
0.2753%
0.9090%
2moonshotai/kimi-k2.5
0.380.32..0.44
43%36..50%
0.3035%
0.1427%
0.6567%
3google/gemini-3.1-pro-preview
0.370.31..0.43
45%38..52%
0.2937%
0.1327%
0.6670%
4z-ai/glm-5
0.360.30..0.42
41%35..48%
0.2833%
0.1627%
0.6263%
5deepseek/deepseek-v3.2
0.330.27..0.40
39%33..46%
0.2629%
0.2037%
0.5660%
6x-ai/grok-4.1-fast
0.320.26..0.38
41%35..48%
0.2633%
0.1530%
0.5363%
7minimax/minimax-m2.5
0.270.22..0.32
35%29..42%
0.2127%
0.1833%
0.4453%

Model vs Model

support n 210 per matchup

Average outcome-point advantage after subtracting each role's baseline expectation.

Click any non-diagonal cell to open pair details in the right pane.

A \\ Bdeepseek/deepseek-v3.2google/gemini-3.1-pro-previewminimax/minimax-m2.5moonshotai/kimi-k2.5openai/gpt-5.2x-ai/grok-4.1-fastz-ai/glm-5
deepseek/deepseek-v3.2-
google/gemini-3.1-pro-preview-
minimax/minimax-m2.5-
moonshotai/kimi-k2.5-
openai/gpt-5.2-
x-ai/grok-4.1-fast-
z-ai/glm-5-

How Models Play

ModelVote-Elim AlignmentElim CaptureInfluence Shift (+)Predicted Vote AccuracySurvival DepthTown Vote Misfire RateTown Vote Mafia RateBelief Brier (Mafia)Belief Rank AUCLast Elim Calibration
openai/gpt-5.2
0.870.81..0.92
0.900.84..0.94
0.130.07..0.21
0.860.83..0.88
0.580.52..0.64
0.580.50..0.65
0.420.35..0.50
0.260.25..0.27
0.570.53..0.61
0.210.19..0.23
moonshotai/kimi-k2.5
0.650.58..0.72
0.670.60..0.74
0.080.00..0.23
0.830.81..0.85
0.710.65..0.76
0.610.54..0.69
0.390.31..0.47
0.250.24..0.27
0.530.49..0.58
0.320.29..0.36
google/gemini-3.1-pro-preview
0.570.49..0.65
0.610.54..0.69
0.080.04..0.13
0.870.84..0.89
0.660.60..0.73
0.580.51..0.65
0.420.35..0.50
0.250.24..0.27
0.580.54..0.61
0.230.19..0.26
z-ai/glm-5
0.750.68..0.81
0.770.71..0.83
0.090.05..0.14
0.820.80..0.85
0.670.61..0.72
0.570.50..0.65
0.430.35..0.51
0.260.24..0.27
0.530.49..0.58
0.260.23..0.28
deepseek/deepseek-v3.2
0.590.52..0.66
0.620.55..0.70
0.040.01..0.08
0.860.84..0.88
0.680.62..0.74
0.620.54..0.69
0.380.31..0.45
0.270.25..0.28
0.530.49..0.57
0.280.25..0.31
x-ai/grok-4.1-fast
0.550.47..0.62
0.580.50..0.66
0.070.02..0.14
0.850.83..0.87
0.660.59..0.72
0.580.52..0.66
0.420.34..0.49
0.240.23..0.26
0.580.53..0.63
0.320.28..0.35
minimax/minimax-m2.5
0.560.48..0.63
0.590.51..0.67
0.070.03..0.13
0.750.71..0.79
0.600.54..0.67
0.700.63..0.77
0.300.23..0.37
0.280.26..0.29
0.500.46..0.54
0.290.26..0.32

What Predicts Winning

Feature Analysis

Grouped CV · 5 folds · seed 42 · 08/03/2026, 11:44:47

Role-only baseline
Log loss 0.668 · AUC 0.641 · Brier 0.237
Best feature set
Social + operational + survival · ΔLL +0.247 · ΔAUC +0.257
Includes:VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
Data quality
Weak rows 0/1470 (0.0%)
Mean analysis weight 0.961
Feature-set ranking (delta vs role-only baseline)
Feature setWhat it includesΔAUCΔLog LossAUCLog Loss
Social + operational + survival
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecutionSurvival
+0.257+0.2470.8970.420
Social all
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunity
+0.257+0.2420.8970.425
Social + operational
VotingInfluenceBeliefDetective actionsMafia tacticsOpportunityExecution
+0.256+0.2410.8970.426
Belief + conversion
BeliefInfluenceVoting
+0.241+0.1990.8820.468
Social quality (no opportunity)
VotingInfluenceDetective actionsMafia tactics
+0.235+0.1930.8760.475
Belief inference only
Belief
+0.178+0.1250.8190.543
Role only
Role baseline
+0.000+0.0000.6410.668
Top single features
FeatureΔAUCΔLog Loss
belief_brier_mafia+0.182+0.099
belief_rank_auc+0.174+0.095
town_vote_misfire_rate+0.155+0.081
town_vote_mafia_rate+0.155+0.081
vote_elim_alignment_rate+0.130+0.067
Validity by role
RoleRowsWeak rowsWeak %Mean wt
detective21000.0%0.952
mafia42000.0%0.984
town84000.0%0.952