Simulation Studies
Every page below is one experiment. We simulate data from a known model, run the estimators on it, and report what they recover. Because the truth is known, both recovery and failure are measurable.
Page |
Environment |
Size |
Estimators |
What it shows |
|---|---|---|---|---|
Keep-or-replace mileage model (Rust 1987). |
20 states x 2 actions |
All. |
The canonical benchmark. Who recovers the cost parameters, and at what compute cost. |
|
Walk to a goal on a grid. |
64 states x 5 actions |
All, IRL focus. |
What happens where the data rarely goes. |
|
Small random MDP, linear reward. |
8 states x 2 actions |
All. |
An easy problem every correct estimator must pass. |
|
The same generator, hardened three ways. |
300 states; 24-state collinear cell |
Structural family. |
Runtime at scale, inference near discount one, and broken identification. |
|
The same generator at large scale. |
3000 states x 2 actions |
Ten estimators across families. |
How compute costs separate as the state space grows. |
|
A reward that multiplies two features the estimators do not model. |
24 states x 3 actions |
All. |
What an omitted interaction costs: a small behavioral miss, a larger counterfactual one. |
|
Estimation under correct and misspecified rewards. |
varies |
MPEC, neural MPEC, GLADIUS. |
How this family degrades under reward misspecification. |
The findings in one line. Almost every estimator matches the choice probabilities. The differences show up in parameter recovery, in counterfactuals, and in compute cost.
Reading the tables
All numbers come from a saved results file written by the run script. Crashes and timeouts stay in the table with their error message.
Policy TV measures how far the estimated choice probabilities are from the truth. Lower is better.
Regret measures welfare lost when the recovered model is used in a changed environment. Type A shifts a payoff. Type B changes the dynamics. Type C penalizes an action. Structural estimators re-solve the model and adapt. Behavioral estimators keep their old policy, so their Type C regret is large.
Parameter recovery is reported only for structural estimators. IRL methods recover a reward that produces the same behavior but in a different parameterization, so comparing their parameters to the truth is not meaningful.
The estimators are documented in the catalog.