# Simulation Study

IQ-Learn runs on three synthetic cells covering low-dimensional tabular,
high-dimensional neural, and state-only reward settings. Each cell has known
transitions, policy, value, Q function, and counterfactual oracle objects, so
every recovery claim is checked against the truth. The primary cell is
`canonical_low_action`.

The full result generator is
[`run.py`](https://github.com/rawatpranjal/EconIRL/blob/main/validation/estimators/iq_learn/run.py).
It writes the machine-readable results file
[`iq_learn.json`](https://github.com/rawatpranjal/EconIRL/blob/main/validation/results/iq_learn.json).

```bash
cd /path/to/econirl
PYTHONPATH=src:. python validation/estimators/iq_learn/run.py
```

## Primary Cell: canonical_low_action

### Design

| Quantity | Value |
| --- | --- |
| States | 21 |
| Actions | 3 |
| Individuals | 2,000 |
| Periods per individual | 80 |
| Observations | 160,000 |
| Q type | tabular |
| Divergence | chi2 |
| Alpha | 1.0 |

### Fit Summary

| Quantity | Value |
| --- | --- |
| Converged | True |
| Log-likelihood | -174923.515625 |
| Iterations | 173 |
| Estimation time | 3.49 seconds |
| Expert state coverage | 1.0 |
| Expert state-action coverage | 1.0 |

### Recovery Metrics

| Metric | Value | Gate | Status |
| --- | --- | --- | --- |
| Policy TV | 0.04068339984836971 | at most 0.05 | pass |
| Raw Bellman reward NRMSE | 0.3809617636095332 | at most 0.1 | fail |
| Projected reward NRMSE | 0.27739328652373035 | at most 0.1 | fail |
| Value NRMSE | 0.4855298533917329 | at most 0.1 | fail |
| Q NRMSE | 0.48137314674423415 | at most 0.1 | fail |
| Type A counterfactual regret | 0.011518059257921435 | at most 0.05 | pass |
| Type B counterfactual regret | 0.02155897051624924 | at most 0.05 | pass |
| Type C counterfactual regret | 0.009886935172236239 | at most 0.05 | pass |

The estimator passes imitation and counterfactual regret checks on the primary
cell. Reward, value, and Q recovery fail. Low regret on this cell reflects
that the Q-induced policy happens to produce near-oracle welfare under the
applied interventions, not that the reward or value objects are structurally
accurate.

## Stress Cell: canonical_high_action

| Quantity | Value |
| --- | --- |
| States | 81 |
| Actions | 3 |
| Individuals | 2,000 |
| Periods per individual | 80 |
| Observations | 160,000 |
| Q type | neural |

| Metric | Value | Gate | Status |
| --- | --- | --- | --- |
| Policy TV | 0.069342286794494 | at most 0.05 | fail |
| Raw Bellman reward NRMSE | 0.9969176297594339 | at most 0.1 | fail |
| Projected reward NRMSE | 0.7863419580279364 | at most 0.1 | fail |
| Value NRMSE | 1.7525372602832876 | at most 0.1 | fail |
| Q NRMSE | 1.4222683291851073 | at most 0.1 | fail |
| Type A regret | 0.09202234513797602 | at most 0.05 | fail |
| Type B regret | 0.38998822309194975 | at most 0.05 | fail |
| Type C regret | 0.026453418082599357 | at most 0.05 | pass |

All structural recovery gates and most regret gates fail on the high-dimensional
neural cell.

## Negative Control: canonical_low_state_only

| Quantity | Value |
| --- | --- |
| States | 21 |
| Actions | 3 |
| Individuals | 500 |
| Periods per individual | 80 |
| Observations | 40,000 |
| Q type | tabular |

| Metric | Value | Gate | Status |
| --- | --- | --- | --- |
| Policy TV | 0.03664439528766237 | at most 0.05 | pass |
| Raw Bellman reward NRMSE | 0.7500170275582363 | at most 0.1 | fail |
| Projected reward NRMSE | 0.28081766497887156 | at most 0.1 | fail |
| Value NRMSE | 0.5703591477554589 | at most 0.1 | fail |
| Q NRMSE | 0.5518449193241758 | at most 0.1 | fail |
| Type A regret | 0.034859046266863335 | at most 0.05 | pass |
| Type B regret | 0.05649933858379137 | at most 0.05 | fail |
| Type C regret | 0.020117752969335607 | at most 0.05 | pass |

The state-only cell passes imitation and most regret checks but fails
structural reward, value, and Q recovery.

## Sparse-Support Guard

The sparse-support guard uses a tiny panel with one observed state and one
observed state-action pair (state coverage 0.333, state-action coverage
0.167). Even when all non-coverage metrics are set to pass, the run is not
counterfactual-valid because support gates fail. The guard prevents future
changes from treating small policy or regret numbers as sufficient when the
expert panel does not cover the relevant state-action space.

```bash
PYTHONPATH=src:. python validation/estimators/iq_learn/sparse_support_guard.py
```

Results: [`iq_learn_sparse_support_guard.json`](https://github.com/rawatpranjal/EconIRL/blob/main/validation/results/iq_learn_sparse_support_guard.json).

## Simulation Studies

IQ-Learn appears on both cross-estimator simulation-study pages:
the [bus engine](../../simulation_studies/rust_bus.md) and the
[taxi gridworld](../../simulation_studies/taxi_gridworld.md) pages,
where it is compared against the full structural and IRL rosters.