Abstract MDP 3: High Dimensional Case
This page asks what survives when the state space reaches a few thousand states. Every tabular structural solver consumes a dense transition tensor whose memory and per-iteration cost grow with the square of the state count. An optimizer like MPEC additionally carries one variable per state. Rather than assert where that breaks, the feasibility probes below run every candidate except behavioral cloning, which is trivially cheap, once per scale under a hard time budget and report what happened. The measured answer is more interesting than the folklore. At 3000 states the entire classical family still completes, so the main table benchmarks it alongside the approximation-based estimators. The probes at the larger scale show where the costs actually separate.
The data-generating process
Same Garnet generator as the previous abstract pages, scaled up. One MDP is drawn from the seed and held fixed: each state-action pair reaches a uniform random subset of \(b\) states with Dirichlet weights, plus a small self-loop mass \(\ell\):
The reward is linear in polynomial features of the normalized state index \(x_s = s/(S-1)\), with action \(0\) a zeroed outside option and, for \(a \geq 1\),
Behavior solves the soft Bellman equation with logit shocks (scale \(\sigma = 1\)):
Three reward parameters generate behavior over three thousand states. The structure, not the state count, carries the information. That is what the feature-based estimators exploit.
A 3000-state Garnet MDP with stochastic sparse transitions (branching 8) and a 3-feature linear reward: random_mdp(num_states=3000, num_actions=2, num_features=3, branching=8, discount_factor=0.95, seed=707). The panel (500 individuals x 60 periods) covers at most 30000 state visits, so most states are rarely or never observed. Estimators reach them only through the reward features, not through memorized choices. 500 x 60 observations, 3 replications, seed 707. True theta [-0.1532, 0.9127, -0.0559]. Design rank 3/3, condition number 3.90e+01, action-contrast rank 3/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

Results
Estimator |
Family |
Ran |
Conv |
Recovered params |
Param RMSE |
Policy TV |
Regret base |
Regret A |
Regret B |
Regret C |
Time (s) |
|---|---|---|---|---|---|---|---|---|---|---|---|
NFXP-SA |
structural |
3/3 |
3/3 |
[-0.096, 0.991, -0.126] |
0.1269 |
0.0031 |
0.0007 |
0.0007 |
0.0007 |
0.0000 |
46.9 |
NFXP-NK |
structural |
3/3 |
3/3 |
[-0.096, 0.991, -0.126] |
0.1269 |
0.0031 |
0.0007 |
0.0007 |
0.0007 |
0.0000 |
27.9 |
CCP |
structural |
3/3 |
3/3 |
[-0.108, 0.967, -0.113] |
0.1298 |
0.0136 |
0.0008 |
0.0008 |
0.0008 |
0.0000 |
1.8 |
MPEC |
structural |
3/3 |
3/3 |
[-0.096, 0.991, -0.126] |
0.1269 |
0.0031 |
0.0007 |
0.0007 |
0.0007 |
0.0000 |
103.1 |
UFXP |
structural |
3/3 |
3/3 |
[-0.075, 0.875, -0.123] |
0.1455 |
0.0088 |
0.0053 |
0.0054 |
0.0055 |
0.0000 |
1.1 |
TD-CCP |
structural |
3/3 |
3/3 |
[-0.028, 1.117, -0.207] |
0.1694 |
0.0075 |
0.0036 |
0.0037 |
0.0036 |
0.0000 |
3.7 |
NNES |
structural |
3/3 |
3/3 |
[-0.097, 0.990, -0.125] |
0.1260 |
0.0031 |
0.0007 |
0.0007 |
0.0007 |
0.0000 |
19.8 |
GLADIUS |
behavioral |
3/3 |
3/3 |
[0.040, 1.079, -0.244] |
- |
0.0221 |
0.0112 |
0.0114 |
0.0112 |
0.0000 |
15.6 |
Deep-MCE-IRL |
behavioral |
3/3 |
3/3 |
[0.070, -0.038, 0.045] |
- |
0.0516 |
0.2016 |
0.2014 |
0.2058 |
0.0000 |
59.8 |
BC |
behavioral |
3/3 |
3/3 |
different parameterization (6000 values) |
- |
0.1069 |
0.7059 |
0.7172 |
0.7922 |
93.5663 |
0.1 |
Param RMSE covers the structural family only, which shares the parameterization of the true model. Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.
Behavioral cloning is the control group. It is nearly free and matches the data where the data exists, but it carries no reward. It can say nothing at unvisited states or under the counterfactual interventions. The gap between its regret and the reward-recovering estimators’ regret is the value of estimating structure at this scale.
Feasibility probes
Single fits per estimator and scale (same generator and panel configuration as the main cell, state count varying), one subprocess per fit, run before the main benchmark to decide the roster empirically.
Estimator |
States |
Outcome |
Time (s) |
Detail |
|---|---|---|---|---|
CCP |
3000 |
completed |
3.1 |
|
Deep-MCE-IRL |
3000 |
completed |
59.8 |
|
GLADIUS |
3000 |
completed |
13.7 |
|
MPEC |
3000 |
completed |
100.1 |
|
NFXP-NK |
3000 |
completed |
30.8 |
|
NFXP-SA |
3000 |
completed |
45.6 |
|
NNES |
3000 |
completed |
20.3 |
|
TD-CCP |
3000 |
completed |
5.1 |
|
UFXP |
3000 |
completed |
1.5 |
|
CCP |
8000 |
completed |
11.5 |
|
Deep-MCE-IRL |
8000 |
completed |
511.9 |
|
GLADIUS |
8000 |
completed |
22.1 |
|
MPEC |
8000 |
completed |
480.1 |
|
NFXP-NK |
8000 |
completed |
263.9 |
|
NFXP-SA |
8000 |
completed |
321.4 |
|
NNES |
8000 |
completed |
45.3 |
|
TD-CCP |
8000 |
completed |
7.9 |
|
UFXP |
8000 |
completed |
8.1 |
Each probe is a single fit in its own subprocess with a hard 900-second budget; timeout means the fit was killed at the budget, with no number invented for it.
Notes per estimator
MPEC. Constrained MLE with one optimizer variable per state plus the parameters, 3003 variables here. The SQP solver handles that joint problem and matches the nested-solver MLE at roughly 100 seconds per fit.
UFXP. Unnested fixed point (Bray; Oguz and Bray 2026) with optimal weighting, built for exactly this regime. One factorization before the parameter search, no fixed point inside any optimizer, and the fastest accurate structural fit on the page.
Reproduce
python scripts/sim_abstract_mdp_3.py # run + write JSON
python scripts/sim_abstract_mdp_3.py --page # regenerate this page
python scripts/sim_abstract_mdp_3.py --verify # re-derive the table from JSON
Raw facts: validation/results/sim_abstract_mdp_3.json.
Not shown on this page: SEES (a spline value basis with basis_dim near the state count is its own scaling wall at thousands of states. Its showing is on the harder abstract MDP page); MCE-IRL, MaxEnt-IRL, AIRL, IQ-Learn, f-IRL and the other IRL methods (their exact inner solvers face the same dense-tensor cost the probes document for the classical family. The IRL comparison lives on the bus engine and gridworld pages).