# Under the Hood

## Model

The data are state, action, next-state triples $(s, a, s')$ from a stationary
infinite-horizon dynamic discrete choice model. A discount factor $\beta$,
transition kernels $F_a(s' \mid s)$ with orientation $(A, S, S)$, and i.i.d.
logit taste shocks with scale $\sigma$ are assumed known. Instead of
parameterizing a reward function directly, IQ-Learn parameterizes the soft
Q-function $Q(s, a)$ and derives the policy and implied reward from it.

## Soft Value and Policy

Given $Q$, the soft value function and the implied policy are:

$$
V(s) = \sigma \log \sum_a \exp\!\bigl(Q(s, a) / \sigma\bigr),
\qquad
\pi(a \mid s) = \exp\!\bigl((Q(s, a) - V(s)) / \sigma\bigr).
$$

These satisfy $Q(s, a) - V(s) = \sigma \log \pi(a \mid s)$.

## Inverse Bellman Reward

The temporal difference of $Q$ under the transitions defines the reward
implied by $Q$ through the inverse Bellman operator:

$$
r_{\mathrm{IB}}(s, a)
= Q(s, a) - \beta \sum_{s'} F_a(s' \mid s)\, V(s').
$$

If $Q$ is a true soft Bellman Q-function, $r_{\mathrm{IB}}$ recovers the
underlying reward. In practice it is a diagnostic: it is fitted on expert
support and can be unreliable off support.

## Chi-Squared Objective

IQ-Learn minimizes over $Q$:

$$
\mathcal{L}(Q)
= -\mathbb{E}_\rho\bigl[Q(s, a) - V(s)\bigr]
+ \frac{1}{4\alpha}\,\mathbb{E}_\rho\bigl[r_{\mathrm{IB}}(s, a)^2\bigr],
$$

where $\mathbb{E}_\rho$ averages over expert $(s, a)$ pairs. Since
$Q(s, a) - V(s) = \sigma \log \pi(a \mid s)$, the first term is behavioral
cloning scaled by $\sigma$. The second term penalizes large implied rewards on
expert support, which bounds the objective and prevents $Q$ from drifting to
numerical overflow on a free tabular table.

The simple (TV-distance) objective replaces the quadratic penalty with
$-\mathbb{E}_\rho[r_{\mathrm{IB}}(s, a)] + (1 - \beta)\,\mathbb{E}_{s_0}[V(s_0)]$,
which has no upper bound on a free tabular $Q$ and should not be used there.

## Q Parameterizations

Three parameterizations are supported:

- **Tabular**: $Q$ is a free $(S \times A)$ matrix. No structure is imposed;
  the recovered reward is per-cell and does not propagate to unvisited
  state-action pairs. Optimized by L-BFGS-B.
- **Linear**: $Q(s, a) = \varphi(s, a)^\top \theta$ where $\varphi$ is the
  feature matrix from the utility specification. This constrains $Q$ to the
  structural feature space and propagates through features to unvisited pairs.
  Optimized by L-BFGS-B.
- **Neural**: $Q(s, \cdot) = f_\psi(s)$ for a small feedforward network.
  Optimized by Adam.

## Pseudocode

```
compute expert (s, a) pairs from the panel
compute expert_state_coverage and expert_state_action_coverage
initialize Q (zeros for tabular/linear, random for neural)
minimize L(Q) over the chosen parameterization
extract policy: pi(a|s) = softmax(Q(s,:) / sigma)
extract value: V(s) = sigma * logsumexp(Q(s,:) / sigma)
extract implied reward: r_IB(s,a) = Q(s,a) - beta * sum_{s'} F_a(s'|s) V(s')
report policy, value, r_IB, and coverage fractions
```

## Implementation Notes

The implementation lives in `econirl.estimation.iq_learn`. Coverage fractions
are computed before optimization and exposed through
`summary.metadata["expert_state_coverage"]` and
`summary.metadata["expert_state_action_coverage"]`. Standard errors are not
computed; the returned standard error array is NaN. The raw Bellman implied
reward and its projection into the utility feature basis are both stored in
`summary.metadata` for diagnostic use.