Pre-Estimation Checks

IQ-Learn has the same general data-quality checks as other estimators plus coverage checks that are specific to its Q-based reward recovery.

Check

Why it matters for IQ-Learn

Expert state coverage

The objective only scores expert (s, a) pairs; states not in the panel receive no direct Q signal.

Expert state-action coverage

Off-support state-action pairs get no gradient; their implied reward is extrapolated, not fitted.

Q parameterization and divergence

Tabular Q with simple divergence has no upper bound; chi-squared is required for bounded optimization.

Feature rank (linear head)

A rank-deficient feature matrix leaves directions of theta undetermined.

Feature condition number

Ill-conditioning inflates the variance of the linear Q solve.

Transition row sums

Transitions must be row-stochastic in the (n_actions, n_states, n_states) orientation for the inverse Bellman reward to be valid.

Discount and scale

Misspecified beta or sigma shift the implied reward by a constant factor.

Coverage Gates

IQ-Learn output is suitable for reward and counterfactual diagnostics only when:

  • expert_state_coverage == 1.0 (every state in the MDP was visited),

  • expert_state_action_coverage >= 0.95 (at least 95 percent of state-action pairs were visited).

Below these thresholds the Q table and implied reward are valid only on support; off-support values are extrapolation.

Canonical Simulation Checks

Values from the primary synthetic cell run (see Simulation Study):

Check

Value

Status

Feature rank

4 / 4

pass

Feature condition number

4.51

pass

Observed states

21 / 21

pass

State-action coverage

1.000

pass

Minimum action share

0.325

pass

Common Risk Patterns

Sparse expert panels are the main risk. When the expert panel covers only a subset of states, the Q table on unvisited states is unconstrained and the inverse Bellman reward at those cells is unreliable. The linear Q head mitigates this by constraining Q to propagate through features, but it requires a well-specified feature matrix. Always check both coverage figures from summary.metadata before interpreting the reward output.

The sparse-support guard in sparse_support_guard.py demonstrates that small policy TV and low counterfactual regret numbers are not sufficient evidence when coverage is below the gates; support must pass first.