Simulation Study

IQ-Learn runs on three synthetic cells covering low-dimensional tabular, high-dimensional neural, and state-only reward settings. Each cell has known transitions, policy, value, Q function, and counterfactual oracle objects, so every recovery claim is checked against the truth. The primary cell is canonical_low_action.

The full result generator is run.py. It writes the machine-readable results file iq_learn.json.

cd /path/to/econirl
PYTHONPATH=src:. python validation/estimators/iq_learn/run.py

Primary Cell: canonical_low_action

Design

Quantity	Value
States	21
Actions	3
Individuals	2,000
Periods per individual	80
Observations	160,000
Q type	tabular
Divergence	chi2
Alpha	1.0

Fit Summary

Quantity	Value
Converged	True
Log-likelihood	-174923.515625
Iterations	173
Estimation time	3.49 seconds
Expert state coverage	1.0
Expert state-action coverage	1.0

Recovery Metrics

Metric	Value	Gate	Status
Policy TV	0.04068339984836971	at most 0.05	pass
Raw Bellman reward NRMSE	0.3809617636095332	at most 0.1	fail
Projected reward NRMSE	0.27739328652373035	at most 0.1	fail
Value NRMSE	0.4855298533917329	at most 0.1	fail
Q NRMSE	0.48137314674423415	at most 0.1	fail
Type A counterfactual regret	0.011518059257921435	at most 0.05	pass
Type B counterfactual regret	0.02155897051624924	at most 0.05	pass
Type C counterfactual regret	0.009886935172236239	at most 0.05	pass

The estimator passes imitation and counterfactual regret checks on the primary cell. Reward, value, and Q recovery fail. Low regret on this cell reflects that the Q-induced policy happens to produce near-oracle welfare under the applied interventions, not that the reward or value objects are structurally accurate.

Stress Cell: canonical_high_action

Quantity	Value
States	81
Actions	3
Individuals	2,000
Periods per individual	80
Observations	160,000
Q type	neural

Metric	Value	Gate	Status
Policy TV	0.069342286794494	at most 0.05	fail
Raw Bellman reward NRMSE	0.9969176297594339	at most 0.1	fail
Projected reward NRMSE	0.7863419580279364	at most 0.1	fail
Value NRMSE	1.7525372602832876	at most 0.1	fail
Q NRMSE	1.4222683291851073	at most 0.1	fail
Type A regret	0.09202234513797602	at most 0.05	fail
Type B regret	0.38998822309194975	at most 0.05	fail
Type C regret	0.026453418082599357	at most 0.05	pass

All structural recovery gates and most regret gates fail on the high-dimensional neural cell.

Negative Control: canonical_low_state_only

Quantity	Value
States	21
Actions	3
Individuals	500
Periods per individual	80
Observations	40,000
Q type	tabular

Metric	Value	Gate	Status
Policy TV	0.03664439528766237	at most 0.05	pass
Raw Bellman reward NRMSE	0.7500170275582363	at most 0.1	fail
Projected reward NRMSE	0.28081766497887156	at most 0.1	fail
Value NRMSE	0.5703591477554589	at most 0.1	fail
Q NRMSE	0.5518449193241758	at most 0.1	fail
Type A regret	0.034859046266863335	at most 0.05	pass
Type B regret	0.05649933858379137	at most 0.05	fail
Type C regret	0.020117752969335607	at most 0.05	pass

The state-only cell passes imitation and most regret checks but fails structural reward, value, and Q recovery.

Sparse-Support Guard

The sparse-support guard uses a tiny panel with one observed state and one observed state-action pair (state coverage 0.333, state-action coverage 0.167). Even when all non-coverage metrics are set to pass, the run is not counterfactual-valid because support gates fail. The guard prevents future changes from treating small policy or regret numbers as sufficient when the expert panel does not cover the relevant state-action space.

PYTHONPATH=src:. python validation/estimators/iq_learn/sparse_support_guard.py

Results: iq_learn_sparse_support_guard.json.

Simulation Studies

IQ-Learn appears on both cross-estimator simulation-study pages: the bus engine and the taxi gridworld pages, where it is compared against the full structural and IRL rosters.