Abstract MDP 2

The sanity-check page showed every estimator recovering an easy problem. This page hardens the problem along three separate axes and watches the structural family specifically. What happens to runtime as the state space grows? What happens to inference as the discount factor approaches one? What happens to the parameters when the reward features are collinear? Each axis gets its own cell, run on the same engine and reported from the same raw records as every other page.

The data-generating process

The first two cells draw one Garnet-style MDP from the seed and hold it fixed. Each state-action pair reaches a uniform random subset of \(b\) states with Dirichlet weights, mixed with a small self-loop mass \(\ell\):

\[ P(s' \mid s, a) \;=\; (1-\ell)\, D_{s,a}(s') \;+\; \ell\, \mathbf{1}\{s'=s\}, \qquad D_{s,a} \sim \mathrm{Dirichlet}(\mathbf{1}_b),\quad b = 5,\ \ell = 0.05 . \]

The reward is linear in polynomial features of the normalized state index \(x_s = s/(S-1)\). Action \(0\) is a zeroed outside option, the identification anchor. For \(a \geq 1\),

\[ u_\theta(s,a) = \theta^\top \varphi(s,a), \qquad \varphi(s,a) = \bigl(1,\ x_s,\ x_s^{2} + a\bigr), \qquad \theta \sim \mathcal{N}(0,\ 0.25\, I_3). \]

The agent discounts at \(\beta\) and faces i.i.d. logit taste shocks (scale \(\sigma = 1\)), so behavior solves the soft Bellman equation

\[ V(s) = \log \sum_{a} \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \qquad \pi^*(a \mid s) \propto \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \]

and the data are \(N\) independent agents simulated for \(T\) periods from \(\pi^*\) and \(P\). The third cell swaps in a small handcrafted MDP whose features are deliberately collinear. Its construction is described in that cell.

300 states, discount 0.95

A 300-state Garnet MDP with stochastic sparse transitions (branching 5) and a 3-feature linear reward: random_mdp(num_states=300, num_actions=2, num_features=3, branching=5, discount_factor=0.95, seed=505). 500 x 60 observations, 3 replications, seed 505. True theta [0.3863, -1.5993, 0.1317]. Design rank 3/3, condition number 3.88e+01, action-contrast rank 3/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

The first cell is about cost at scale. All estimators face the same 300-state problem, and the runtime column is the result. The two NFXP rows are the same estimator with two inner solvers, Rust’s original successive approximation against the Newton-Kantorovich polyalgorithm. The refinement’s value is measured on this page rather than asserted.

Simulated trajectories and the optimal value function for 300 states, discount 0.95

Results

Estimator	Family	Ran	Conv	Recovered params	Param RMSE	Policy TV	Regret base	Regret A	Regret B	Time (s)
NFXP-SA	structural	3/3	3/3	[0.431, -1.614, 0.107]	0.0452	0.0035	0.0008	0.0008	0.0008	4.4
NFXP-NK	structural	3/3	3/3	[0.431, -1.614, 0.107]	0.0452	0.0035	0.0008	0.0008	0.0008	3.5
CCP	structural	3/3	3/3	[0.429, -1.617, 0.111]	0.0432	0.0037	0.0008	0.0008	0.0008	2.7
MPEC	structural	3/3	3/3	[0.431, -1.614, 0.107]	0.0452	0.0035	0.0008	0.0008	0.0008	1.1
NNES	structural	3/3	3/3	[0.431, -1.614, 0.107]	0.0452	0.0035	0.0008	0.0008	0.0008	17.0
SEES	structural	3/3	0/3	[0.543, -1.431, -0.046]	0.1692	0.0040	0.0009	0.0008	0.0009	2.8
TD-CCP	structural	3/3	3/3	[0.342, -1.787, 0.246]	0.1416	0.0057	0.0024	0.0024	0.0023	2.9
UFXP	structural	3/3	3/3	[0.422, -1.601, 0.111]	0.0440	0.0031	0.0006	0.0006	0.0006	0.1
MCE-IRL	behavioral	3/3	0/3	[0.431, -1.614, 0.107]	-	0.0035	0.0008	0.0008	0.0008	5.9

Param RMSE covers the structural family only, which shares the parameterization of the true model. Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.

The two NFXP rows land within a second of each other, so the textbook solver gap does not bite at 300 states. The high-dimension page is where it starts to. The approximation-based members (SEES, TD-CCP) give up some parameter precision relative to the exact family while staying close on behavior.

Same MDP, discount 0.99

The identical 300-state MDP with the discount factor moved from 0.95 to 0.99, where continuation values dominate flow payoffs and the inner fixed point becomes a slow contraction. Structural family only, 10 replications, standard errors requested from every estimator. 500 x 60 observations, 10 replications, seed 505. True theta [0.3863, -1.5993, 0.1317]. Design rank 3/3, condition number 3.88e+01, action-contrast rank 3/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

The second cell moves the discount factor to 0.99 and asks whether the reported uncertainty is usable. The parameter table shows bias, the spread across replications, RMSE, coverage of the nominal 95% intervals, and how often each estimator produced finite standard errors. NFXP-SA runs 2 of 10 replications as a runtime spot-check. Its inference is the same MLE as NFXP-NK, which runs all 10.

Results

Estimator	Family	Ran	Conv	Recovered params	Param RMSE	Policy TV	Regret base	Regret A	Regret B	Time (s)
NFXP-SA	structural	2/2	2/2	[0.403, -1.594, 0.121]	0.0329	0.0027	-0.0004	-0.0004	-0.0005	4.2
NFXP-NK	structural	10/10	10/10	[0.434, -1.555, 0.080]	0.0840	0.0040	0.0025	0.0024	0.0024	4.3
CCP	structural	10/10	10/10	[0.433, -1.555, 0.081]	0.0840	0.0042	0.0025	0.0025	0.0024	3.3
MPEC	structural	10/10	10/10	[0.434, -1.555, 0.080]	0.0840	0.0040	0.0025	0.0024	0.0024	0.6
NNES	structural	10/10	10/10	[0.434, -1.555, 0.080]	0.0840	0.0040	0.0025	0.0024	0.0024	16.9
SEES	structural	10/10	0/10	[0.830, -1.161, -0.304]	0.4474	0.0209	0.1145	0.1125	0.1145	2.1
TD-CCP	structural	10/10	10/10	[0.397, -1.671, 0.159]	0.1189	0.0052	0.0065	0.0065	0.0061	2.7
UFXP	structural	10/10	10/10	[0.426, -1.541, 0.083]	0.0857	0.0040	0.0028	0.0027	0.0026	0.1

Param RMSE covers the structural family only, which shares the parameterization of the true model. Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.

Parameter recovery

Estimator	Parameter	True	Mean est	Bias	Emp. SE	RMSE	95% coverage	SE avail
NFXP-SA	theta_0	0.386	0.403	+0.016	0.004	0.017	1.00 +/- 0.00	100% (2 reps)
NFXP-SA	theta_1	-1.599	-1.594	+0.005	0.073	0.052	1.00 +/- 0.00	100% (2 reps)
NFXP-SA	theta_2	0.132	0.121	-0.011	0.019	0.017	1.00 +/- 0.00	100% (2 reps)
NFXP-NK	theta_0	0.386	0.434	+0.048	0.109	0.114	0.90 +/- 0.09	100% (10 reps)
NFXP-NK	theta_1	-1.599	-1.555	+0.045	0.128	0.130	0.90 +/- 0.09	100% (10 reps)
NFXP-NK	theta_2	0.132	0.080	-0.052	0.129	0.133	0.90 +/- 0.09	100% (10 reps)
CCP	theta_0	0.386	0.433	+0.047	0.110	0.114	-	10% (10 reps)
CCP	theta_1	-1.599	-1.555	+0.044	0.132	0.133	-	10% (10 reps)
CCP	theta_2	0.132	0.081	-0.051	0.132	0.135	-	10% (10 reps)
MPEC	theta_0	0.386	0.434	+0.048	0.109	0.114	0.90 +/- 0.09	100% (10 reps)
MPEC	theta_1	-1.599	-1.555	+0.045	0.128	0.130	0.90 +/- 0.09	100% (10 reps)
MPEC	theta_2	0.132	0.080	-0.052	0.129	0.133	0.90 +/- 0.09	100% (10 reps)
NNES	theta_0	0.386	0.434	+0.048	0.109	0.114	0.90 +/- 0.09	100% (10 reps)
NNES	theta_1	-1.599	-1.555	+0.045	0.129	0.130	0.90 +/- 0.09	100% (10 reps)
NNES	theta_2	0.132	0.080	-0.052	0.130	0.133	0.90 +/- 0.09	100% (10 reps)
SEES	theta_0	0.386	0.830	+0.443	0.200	0.482	0.20 +/- 0.13	100% (10 reps)
SEES	theta_1	-1.599	-1.161	+0.439	0.325	0.536	0.40 +/- 0.15	100% (10 reps)
SEES	theta_2	0.132	-0.304	-0.436	0.280	0.510	0.40 +/- 0.15	100% (10 reps)
TD-CCP	theta_0	0.386	0.397	+0.011	0.122	0.116	1.00 +/- 0.00	100% (10 reps)
TD-CCP	theta_1	-1.599	-1.671	-0.072	0.146	0.156	1.00 +/- 0.00	100% (10 reps)
TD-CCP	theta_2	0.132	0.159	+0.027	0.149	0.144	1.00 +/- 0.00	100% (10 reps)
UFXP	theta_0	0.386	0.426	+0.039	0.111	0.112	0.90 +/- 0.09	100% (10 reps)
UFXP	theta_1	-1.599	-1.541	+0.059	0.134	0.140	0.90 +/- 0.09	100% (10 reps)
UFXP	theta_2	0.132	0.083	-0.049	0.133	0.135	0.90 +/- 0.09	100% (10 reps)

Coverage is the share of replications whose 95% interval contains the truth, shown with its Monte Carlo standard error. It is computed only where every replication produced a finite standard error. SE avail is the share of replications with finite standard errors.

The SE avail column is the headline. One estimator routinely fails to deliver usable standard errors here while recovering good point estimates. Without that column, the blank coverage entries would read as a formatting gap rather than an inference failure.

Collinear features (24 states)

A small MDP whose third reward feature is exactly twice the second (design rank 2 of 3). The likelihood identifies only the combination theta_1 + 2 theta_2, so no estimator can recover the individual coordinates. 500 x 80 observations, 20 replications, seed 606. True theta [-0.5, 1.0, 0.3]. Design rank 2/3, condition number 3.07e+16, action-contrast rank 2/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

The last cell breaks identification on purpose. The third feature is exactly twice the second, so only the combination theta_1 + 2 theta_2 is identified, and the design diagnostics above flag it. The parameter columns are omitted. Every estimator still matches behavior, which is what partial identification looks like in practice.

Results

Estimator	Family	Ran	Conv	Policy TV	Regret base	Regret A	Regret B	Time (s)
NFXP-NK	structural	20/20	20/20	0.0032	0.0007	0.0007	0.0008	3.9
CCP	structural	20/20	20/20	0.0032	0.0007	0.0007	0.0008	2.0
MPEC	structural	20/20	20/20	0.0032	0.0007	0.0007	0.0008	0.5
UFXP	structural	20/20	0/20	0.0032	0.0007	0.0007	0.0008	0.1

Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.

Notes per estimator

NFXP-SA. Rust’s original successive-approximation inner loop. Same maximum-likelihood answer as NFXP-NK, different runtime.

CCP. Its standard errors come from the outer Hessian and can fail to be finite even when the point estimate is fine. The SE avail column makes that visible.

UFXP. Unnested fixed point (Bray; Oguz and Bray 2026) with optimal weighting. As efficient as maximum likelihood, with standard errors, so it enters the coverage table on equal terms.

MCE-IRL. Behavioral reference. Its converged flag is a conservative gradient-norm check, so read it next to Policy TV.

Reproduce

python scripts/sim_abstract_mdp_2.py                 # run + write JSON
python scripts/sim_abstract_mdp_2.py --page          # regenerate this page
python scripts/sim_abstract_mdp_2.py --verify        # re-derive the table from JSON

Raw facts: validation/results/sim_abstract_mdp_2.json.

Not shown on this page: MaxEnt-IRL, IQ-Learn, AIRL, f-IRL, GLADIUS, Deep-MCE-IRL, BC (this page’s question is structural. Parameter recovery, inference quality, and identification as the problem hardens. The IRL family is compared on the bus engine and gridworld pages. MCE-IRL stays here as the behavioral reference); GAIL, GCL, DeepMaxEnt-IRL, Bayesian-IRL (research code or too slow; not benchmarked in this study).