Abstract MDP 1

The simplest abstract problem. A small but non-trivial random MDP with an action-dependent linear reward, easy enough that a correct estimator must recover it. It is the sanity check that the whole roster works before the harder regimes. Every estimator on the uniform estimate interface runs here. The table reports the exact recovered parameters, the recovery error, the policy distance from the truth, and the counterfactual regret.

Environment: random_mdp(num_states=8, num_actions=2, num_features=2, branching=3, discount_factor=0.9, seed=0). 300 x 50 observations, 3 replications. True theta [-0.1437, 0.7872]. Generated 2026-06-12 with econirl 0.0.4.

The data-generating process

One Garnet-style MDP is drawn from the seed and held fixed. Each state-action pair reaches a uniform random subset of \(b\) states with Dirichlet weights, mixed with a small self-loop mass \(\ell\):

\[ P(s' \mid s, a) \;=\; (1-\ell)\, D_{s,a}(s') \;+\; \ell\, \mathbf{1}\{s'=s\}, \qquad D_{s,a} \sim \mathrm{Dirichlet}(\mathbf{1}_b),\quad b = 3,\ \ell = 0.05 . \]

The reward is linear in features of the normalized state index \(x_s = s/(S-1)\). Action \(0\) is a zeroed outside option, the identification anchor. For action \(1\),

\[ u_\theta(s,a) = \theta^\top \varphi(s,a), \qquad \varphi(s,1) = \bigl(1,\ x_s + 1\bigr), \qquad \theta \sim \mathcal{N}(0,\ 0.25\, I_2). \]

The agent discounts at \(\beta = 0.9\) and faces i.i.d. logit taste shocks (scale \(\sigma = 1\)), so behavior solves the soft Bellman equation

\[ V(s) = \log \sum_{a} \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \qquad \pi^*(a \mid s) \propto \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \]

and the data are \(N\) independent agents simulated for \(T\) periods from \(\pi^*\) and the transition law. The figure shows what that produces. State paths mix across the whole space, and the optimal value function varies smoothly in the state index.

Simulated trajectories and the optimal value function

Results

Estimator

Family

Ran

Conv

Recovered params

Param RMSE

Policy TV

Regret base

Regret A

Regret B

Regret C

Time (s)

NFXP

structural

3/3

3/3

[-0.154, 0.797]

0.0251

0.0025

0.0002

0.0002

0.0002

0.0000

4.1

CCP

structural

3/3

3/3

[-0.154, 0.797]

0.0250

0.0025

0.0002

0.0002

0.0002

0.0000

2.1

MPEC

structural

3/3

3/3

[-0.154, 0.797]

0.0251

0.0025

0.0002

0.0002

0.0002

0.0000

0.3

NNES

structural

3/3

3/3

[-0.154, 0.797]

0.0251

0.0025

0.0002

0.0002

0.0002

0.0000

11.0

SEES

structural

3/3

1/3

[-0.154, 0.797]

0.0254

0.0025

0.0002

0.0002

0.0002

0.0000

0.9

TD-CCP

structural

3/3

3/3

[-0.155, 0.794]

0.0215

0.0021

0.0002

0.0002

0.0002

0.0000

3.1

UFXP

structural

3/3

3/3

[-0.155, 0.796]

0.0248

0.0025

0.0002

0.0002

0.0002

0.0000

0.1

MCE-IRL

behavioral

3/3

0/3

[-0.154, 0.797]

-

0.0025

0.0002

0.0002

0.0002

0.0000

5.3

MaxEnt-IRL

behavioral

3/3

3/3

[-0.348, 0.923]

-

0.0101

0.0056

0.0062

0.0016

0.0000

13.2

IQ-Learn

behavioral

3/3

3/3

[-0.219, 0.773]

-

0.0370

0.0091

0.0101

0.0096

0.0000

1.2

GLADIUS

behavioral

3/3

3/3

[-0.380, 0.915]

-

0.0165

0.0062

0.0066

0.0044

0.0000

15.9

AIRL

behavioral

3/3

1/3

[0.147, 0.522]

-

0.0324

0.0513

0.0554

0.0297

0.0000

95.4

f-IRL

behavioral

3/3

3/3

different parameterization (16 values)

-

0.0091

0.0034

0.0462

0.0736

61.4508

22.0

BC

behavioral

3/3

3/3

different parameterization (16 values)

-

0.0088

0.0026

0.0423

0.0676

61.4560

0.2

Param RMSE covers the structural family only, which shares the parameterization of the true model. Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate. Regret base is welfare lost in the observed environment. Types A, B, and C are welfare lost after a change. Type A shifts a payoff, Type B changes the transitions, Type C penalizes an action. Estimators with a recovered reward re-solve it and adapt. Those without one keep their old policy.

Configs are modest quick-run defaults, not tuned.

Reproduce

python scripts/quick_all_estimators.py --replications 3   # run + write JSON
python scripts/quick_all_estimators.py --page          # regenerate this page
python scripts/quick_all_estimators.py --verify        # re-derive the table from JSON

Raw facts: validation/results/quick_all_estimators.json.

Excluded from this run: MCE-IRL-NN (uses the sklearn .fit interface, not the uniform .estimate path); GAIL (known slow (~9 min/fit); not a quick run); DeepMaxEnt-IRL (known slow (~7 min/fit); not a quick run); Bayesian-IRL (known slow (~16 min/fit); not a quick run).