Gridworld navigation

Gridworld navigation is the home turf of the maximum-entropy IRL tradition of Ziebart’s MaxEnt and its descendants, so this page weights the roster toward IRL methods. NFXP, CCP, MPEC, and UFXP run as the structural contrast. The environment also supplies a stress the bus engine does not. Every trajectory starts at the same corner and walks toward the goal, so states off that path are visited rarely or never. Methods that invert state-by-state choice frequencies feel that thinness. Methods that share strength through features or networks do not.

The data-generating process

States are cells of an \(N \times N\) grid indexed \(s = \mathrm{row} \cdot N + \mathrm{col}\), with five actions (left, right, up, down, stay), deterministic moves, and an absorbing goal at the bottom-right corner. The reward has three parts: a per-step penalty, a terminal bonus when the chosen move reaches the goal, and a shaping term in the Manhattan distance \(d(s)\) to the goal:

\[ u_\theta(s, a) = \theta_{\mathrm{step}}\, \mathbf{1}\{s \neq s_{\mathrm{goal}}\} + \theta_{\mathrm{goal}}\, \mathbf{1}\{s'(s, a) = s_{\mathrm{goal}}\} - \theta_{\mathrm{dist}}\, \frac{d(s)}{2N}, \]

with \(\theta_{\mathrm{step}} = -0.1\), \(\theta_{\mathrm{goal}} = 10\), \(\theta_{\mathrm{dist}} = 0.1\). The agent discounts at \(\beta\) and faces i.i.d. logit taste shocks (scale \(\sigma = 1\)), so behavior solves the soft Bellman equation

\[ V(s) = \log \sum_{a} \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \qquad \pi^*(a \mid s) \propto \exp\Bigl(u_\theta(s,a) + \beta\, \mathbb{E}\bigl[V(s') \mid s,a\bigr]\Bigr), \]

and every trajectory starts at the top-left corner (state 0). The figure shows the resulting paths climbing the state index toward the absorbing goal and the value function rising with proximity to it. The horizon is deliberately short (20 periods) because the goal is absorbing: once there, an agent generates no further information.

GridworldEnvironment(grid_size=8, step_penalty=-0.1, terminal_reward=10.0, distance_weight=0.1, discount_factor=0.95). Transitions are deterministic; 64 states, 5 actions (left, right, up, down, stay). 500 x 20 observations, 3 replications, seed 7. True theta [-0.1, 10.0, 0.1]. Design rank 3/3, condition number 7.45e+00, action-contrast rank 1/3 (the rank that identification from choices actually uses). Generated 2026-06-12 with econirl 0.0.4.

Simulated trajectories and the optimal value function for Gridworld 8x8

Results

Estimator	Family	Ran	Conv	Policy TV	Time (s)
MaxEnt-IRL	behavioral	3/3	0/3	0.2567	22.5
MCE-IRL	behavioral	3/3	0/3	0.3632	251.1
Deep-MCE-IRL	behavioral	3/3	3/3	0.3986	11.1
AIRL	behavioral	3/3	0/3	0.6152	113.7
IQ-Learn	behavioral	3/3	3/3	0.6165	2.1
f-IRL	behavioral	3/3	3/3	0.0509	28.4
GLADIUS	behavioral	3/3	3/3	0.2208	25.6
BC	behavioral	3/3	3/3	0.1298	0.2
NFXP	structural	3/3	3/3	0.0004	12.0
CCP	structural	3/3	3/3	0.0081	3.1
MPEC	structural	3/3	3/3	0.0004	0.9
UFXP	structural	3/3	0/3	0.0388	0.1

Policy TV is the distance between estimated and true choice probabilities, lower is better. Conv is the estimator’s own convergence flag. A cautious flag can read False while the recovered policy is accurate.

The structural rows match behavior almost perfectly, but their parameters are not separately identified here, so the parameter columns are omitted. The raw feature design has full rank, while the action-contrast design has rank 1. The step-penalty and distance features take the same value for every action at a state, so they cancel out of every choice probability. The regret columns are omitted for the same reason, because transferring an unidentified reward is not a meaningful exercise. The lesson is to check the rank of the action-differenced features before estimating.

Notes per estimator

MCE-IRL. Two of its three reward directions, the state-only step and distance features, are unidentified here. Its gradient ascent can drift along them, and in one replication of three the policy collapsed. Read the per-rep records, not just the mean.

f-IRL. The strongest behavioral score on this page. It recovers a tabular reward, one value per state-action pair, which does not depend on the deficient feature basis at all.

Reproduce

python scripts/sim_taxi_gridworld.py                 # run + write JSON
python scripts/sim_taxi_gridworld.py --page          # regenerate this page
python scripts/sim_taxi_gridworld.py --verify        # re-derive the table from JSON

Raw facts: validation/results/sim_taxi_gridworld.json.

Not shown on this page: SEES (its spline value basis is built for an ordered 1-D state index. A 2-D grid breaks that geometry, so running it here would be misspecification by construction); NNES, TD-CCP (the structural contrast is carried by NFXP/CCP/MPEC/UFXP here. The full structural roster runs on the bus engine and abstract MDP pages); MMP, GAIL, GCL, DeepMaxEnt-IRL, Bayesian-IRL (research code or too slow; not benchmarked in this study).