Abstract MDP 3: High Dimensional Case

This page asks what survives when the state space reaches a few thousand states. Every tabular structural solver consumes a dense transition tensor whose memory and per-iteration cost grow with the square of the state count. An optimizer like MPEC additionally carries one variable per state. Rather than assert where that breaks, the feasibility probes below run every candidate except behavioral cloning, which is trivially cheap, once per scale under a hard time budget and report what happened. The measured answer is more interesting than the folklore. At 3000 states the entire classical family still completes, so the main table benchmarks it alongside the approximation-based estimators. The probes at the larger scale show where the costs actually separate.

The data-generating process

Same Garnet generator as the previous abstract pages, scaled up. One MDP is drawn from the seed and held fixed: each state-action pair reaches a uniform random subset of \(b\) states with Dirichlet weights, plus a small self-loop mass \(\ell\):

\[ P(s' \mid s, a) \;=\; (1-\ell)\, D_{s,a}(s') \;+\; \ell\, \mathbf{1}\{s'=s\}, \qquad D_{s,a} \sim \mathrm{Dirichlet}(\mathbf{1}_b),\quad b = 8,\ \ell = 0.05 . \]

The reward is linear in polynomial features of the normalized state index \(x_s = s/(S-1)\), with action \(0\) a zeroed outside option and, for \(a \geq 1\),

\[ u_\theta(s,a) = \theta^\top \varphi(s,a), \qquad \varphi(s,a) = \bigl(1,\ x_s,\ x_s^{2} + a\bigr), \qquad \theta \sim \mathcal{N}(0,\ 0.25\, I_3). \]

Behavior solves the soft Bellman equation with logit shocks (scale \(\sigma = 1\)):

\[ V(s) = \log \sum_{a} \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \qquad \pi^*(a \mid s) \propto \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr). \]

Three reward parameters generate behavior over three thousand states. The structure, not the state count, carries the information. That is what the feature-based estimators exploit.

A 3000-state Garnet MDP with stochastic sparse transitions (branching 8) and a 3-feature linear reward: random_mdp(num_states=3000, num_actions=2, num_features=3, branching=8, discount_factor=0.95, seed=707). The panel (500 individuals x 60 periods) covers at most 30000 state visits, so most states are rarely or never observed. Estimators reach them only through the reward features, not through memorized choices. 500 x 60 observations, 3 replications, seed 707. True theta [-0.1532, 0.9127, -0.0559]. Design rank 3/3, condition number 3.90e+01, action-contrast rank 3/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

Simulated trajectories and the optimal value function for 3000 states

Results

Estimator	Family	Ran	Conv	Recovered params	Param RMSE	Policy TV	Regret base	Regret A	Regret B	Regret C	Time (s)
NFXP-SA	structural	3/3	3/3	[-0.096, 0.991, -0.126]	0.1269	0.0031	0.0007	0.0007	0.0007	0.0000	46.9
NFXP-NK	structural	3/3	3/3	[-0.096, 0.991, -0.126]	0.1269	0.0031	0.0007	0.0007	0.0007	0.0000	27.9
CCP	structural	3/3	3/3	[-0.108, 0.967, -0.113]	0.1298	0.0136	0.0008	0.0008	0.0008	0.0000	1.8
MPEC	structural	3/3	3/3	[-0.096, 0.991, -0.126]	0.1269	0.0031	0.0007	0.0007	0.0007	0.0000	103.1
UFXP	structural	3/3	3/3	[-0.075, 0.875, -0.123]	0.1455	0.0088	0.0053	0.0054	0.0055	0.0000	1.1
TD-CCP	structural	3/3	3/3	[-0.028, 1.117, -0.207]	0.1694	0.0075	0.0036	0.0037	0.0036	0.0000	3.7
NNES	structural	3/3	3/3	[-0.097, 0.990, -0.125]	0.1260	0.0031	0.0007	0.0007	0.0007	0.0000	19.8
GLADIUS	behavioral	3/3	3/3	[0.040, 1.079, -0.244]	-	0.0221	0.0112	0.0114	0.0112	0.0000	15.6
Deep-MCE-IRL	behavioral	3/3	3/3	[0.070, -0.038, 0.045]	-	0.0516	0.2016	0.2014	0.2058	0.0000	59.8
BC	behavioral	3/3	3/3	different parameterization (6000 values)	-	0.1069	0.7059	0.7172	0.7922	93.5663	0.1

Param RMSE covers the structural family only, which shares the parameterization of the true model. Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.

Behavioral cloning is the control group. It is nearly free and matches the data where the data exists, but it carries no reward. It can say nothing at unvisited states or under the counterfactual interventions. The gap between its regret and the reward-recovering estimators’ regret is the value of estimating structure at this scale.

Feasibility probes

Single fits per estimator and scale (same generator and panel configuration as the main cell, state count varying), one subprocess per fit, run before the main benchmark to decide the roster empirically.

Estimator	States	Outcome	Time (s)
CCP	3000	completed	3.1
Deep-MCE-IRL	3000	completed	59.8
GLADIUS	3000	completed	13.7
MPEC	3000	completed	100.1
NFXP-NK	3000	completed	30.8
NFXP-SA	3000	completed	45.6
NNES	3000	completed	20.3
TD-CCP	3000	completed	5.1
UFXP	3000	completed	1.5
CCP	8000	completed	11.5
Deep-MCE-IRL	8000	completed	511.9
GLADIUS	8000	completed	22.1
MPEC	8000	completed	480.1
NFXP-NK	8000	completed	263.9
NFXP-SA	8000	completed	321.4
NNES	8000	completed	45.3
TD-CCP	8000	completed	7.9
UFXP	8000	completed	8.1

Each probe is a single fit in its own subprocess with a hard 900-second budget; timeout means the fit was killed at the budget, with no number invented for it.

Notes per estimator

MPEC. Constrained MLE with one optimizer variable per state plus the parameters, 3003 variables here. The SQP solver handles that joint problem and matches the nested-solver MLE at roughly 100 seconds per fit.

UFXP. Unnested fixed point (Bray; Oguz and Bray 2026) with optimal weighting, built for exactly this regime. One factorization before the parameter search, no fixed point inside any optimizer, and the fastest accurate structural fit on the page.

Reproduce

python scripts/sim_abstract_mdp_3.py                 # run + write JSON
python scripts/sim_abstract_mdp_3.py --page          # regenerate this page
python scripts/sim_abstract_mdp_3.py --verify        # re-derive the table from JSON

Raw facts: validation/results/sim_abstract_mdp_3.json.

Not shown on this page: SEES (a spline value basis with basis_dim near the state count is its own scaling wall at thousands of states. Its showing is on the harder abstract MDP page); MCE-IRL, MaxEnt-IRL, AIRL, IQ-Learn, f-IRL and the other IRL methods (their exact inner solvers face the same dense-tensor cost the probes document for the classical family. The IRL comparison lives on the bus engine and gridworld pages).