Simulation Study
IQ-Learn runs on three synthetic cells covering low-dimensional tabular,
high-dimensional neural, and state-only reward settings. Each cell has known
transitions, policy, value, Q function, and counterfactual oracle objects, so
every recovery claim is checked against the truth. The primary cell is
canonical_low_action.
The full result generator is
run.py.
It writes the machine-readable results file
iq_learn.json.
cd /path/to/econirl
PYTHONPATH=src:. python validation/estimators/iq_learn/run.py
Primary Cell: canonical_low_action
Design
Quantity |
Value |
|---|---|
States |
21 |
Actions |
3 |
Individuals |
2,000 |
Periods per individual |
80 |
Observations |
160,000 |
Q type |
tabular |
Divergence |
chi2 |
Alpha |
1.0 |
Fit Summary
Quantity |
Value |
|---|---|
Converged |
True |
Log-likelihood |
-174923.515625 |
Iterations |
173 |
Estimation time |
3.49 seconds |
Expert state coverage |
1.0 |
Expert state-action coverage |
1.0 |
Recovery Metrics
Metric |
Value |
Gate |
Status |
|---|---|---|---|
Policy TV |
0.04068339984836971 |
at most 0.05 |
pass |
Raw Bellman reward NRMSE |
0.3809617636095332 |
at most 0.1 |
fail |
Projected reward NRMSE |
0.27739328652373035 |
at most 0.1 |
fail |
Value NRMSE |
0.4855298533917329 |
at most 0.1 |
fail |
Q NRMSE |
0.48137314674423415 |
at most 0.1 |
fail |
Type A counterfactual regret |
0.011518059257921435 |
at most 0.05 |
pass |
Type B counterfactual regret |
0.02155897051624924 |
at most 0.05 |
pass |
Type C counterfactual regret |
0.009886935172236239 |
at most 0.05 |
pass |
The estimator passes imitation and counterfactual regret checks on the primary cell. Reward, value, and Q recovery fail. Low regret on this cell reflects that the Q-induced policy happens to produce near-oracle welfare under the applied interventions, not that the reward or value objects are structurally accurate.
Stress Cell: canonical_high_action
Quantity |
Value |
|---|---|
States |
81 |
Actions |
3 |
Individuals |
2,000 |
Periods per individual |
80 |
Observations |
160,000 |
Q type |
neural |
Metric |
Value |
Gate |
Status |
|---|---|---|---|
Policy TV |
0.069342286794494 |
at most 0.05 |
fail |
Raw Bellman reward NRMSE |
0.9969176297594339 |
at most 0.1 |
fail |
Projected reward NRMSE |
0.7863419580279364 |
at most 0.1 |
fail |
Value NRMSE |
1.7525372602832876 |
at most 0.1 |
fail |
Q NRMSE |
1.4222683291851073 |
at most 0.1 |
fail |
Type A regret |
0.09202234513797602 |
at most 0.05 |
fail |
Type B regret |
0.38998822309194975 |
at most 0.05 |
fail |
Type C regret |
0.026453418082599357 |
at most 0.05 |
pass |
All structural recovery gates and most regret gates fail on the high-dimensional neural cell.
Negative Control: canonical_low_state_only
Quantity |
Value |
|---|---|
States |
21 |
Actions |
3 |
Individuals |
500 |
Periods per individual |
80 |
Observations |
40,000 |
Q type |
tabular |
Metric |
Value |
Gate |
Status |
|---|---|---|---|
Policy TV |
0.03664439528766237 |
at most 0.05 |
pass |
Raw Bellman reward NRMSE |
0.7500170275582363 |
at most 0.1 |
fail |
Projected reward NRMSE |
0.28081766497887156 |
at most 0.1 |
fail |
Value NRMSE |
0.5703591477554589 |
at most 0.1 |
fail |
Q NRMSE |
0.5518449193241758 |
at most 0.1 |
fail |
Type A regret |
0.034859046266863335 |
at most 0.05 |
pass |
Type B regret |
0.05649933858379137 |
at most 0.05 |
fail |
Type C regret |
0.020117752969335607 |
at most 0.05 |
pass |
The state-only cell passes imitation and most regret checks but fails structural reward, value, and Q recovery.
Sparse-Support Guard
The sparse-support guard uses a tiny panel with one observed state and one observed state-action pair (state coverage 0.333, state-action coverage 0.167). Even when all non-coverage metrics are set to pass, the run is not counterfactual-valid because support gates fail. The guard prevents future changes from treating small policy or regret numbers as sufficient when the expert panel does not cover the relevant state-action space.
PYTHONPATH=src:. python validation/estimators/iq_learn/sparse_support_guard.py
Results: iq_learn_sparse_support_guard.json.
Simulation Studies
IQ-Learn appears on both cross-estimator simulation-study pages: the bus engine and the taxi gridworld pages, where it is compared against the full structural and IRL rosters.