Under the Hood

Model

The data are state, action, next-state triples \((s, a, s')\) from a stationary infinite-horizon dynamic discrete choice model. A discount factor \(\beta\), transition kernels \(F_a(s' \mid s)\) with orientation \((A, S, S)\), and i.i.d. logit taste shocks with scale \(\sigma\) are assumed known. Instead of parameterizing a reward function directly, IQ-Learn parameterizes the soft Q-function \(Q(s, a)\) and derives the policy and implied reward from it.

Soft Value and Policy

Given \(Q\), the soft value function and the implied policy are:

\[ V(s) = \sigma \log \sum_a \exp\!\bigl(Q(s, a) / \sigma\bigr), \qquad \pi(a \mid s) = \exp\!\bigl((Q(s, a) - V(s)) / \sigma\bigr). \]

These satisfy \(Q(s, a) - V(s) = \sigma \log \pi(a \mid s)\).

Inverse Bellman Reward

The temporal difference of \(Q\) under the transitions defines the reward implied by \(Q\) through the inverse Bellman operator:

\[ r_{\mathrm{IB}}(s, a) = Q(s, a) - \beta \sum_{s'} F_a(s' \mid s)\, V(s'). \]

If \(Q\) is a true soft Bellman Q-function, \(r_{\mathrm{IB}}\) recovers the underlying reward. In practice it is a diagnostic: it is fitted on expert support and can be unreliable off support.

Chi-Squared Objective

IQ-Learn minimizes over \(Q\):

\[ \mathcal{L}(Q) = -\mathbb{E}_\rho\bigl[Q(s, a) - V(s)\bigr] + \frac{1}{4\alpha}\,\mathbb{E}_\rho\bigl[r_{\mathrm{IB}}(s, a)^2\bigr], \]

where \(\mathbb{E}_\rho\) averages over expert \((s, a)\) pairs. Since \(Q(s, a) - V(s) = \sigma \log \pi(a \mid s)\), the first term is behavioral cloning scaled by \(\sigma\). The second term penalizes large implied rewards on expert support, which bounds the objective and prevents \(Q\) from drifting to numerical overflow on a free tabular table.

The simple (TV-distance) objective replaces the quadratic penalty with \(-\mathbb{E}_\rho[r_{\mathrm{IB}}(s, a)] + (1 - \beta)\,\mathbb{E}_{s_0}[V(s_0)]\), which has no upper bound on a free tabular \(Q\) and should not be used there.

Q Parameterizations

Three parameterizations are supported:

Tabular: \(Q\) is a free \((S \times A)\) matrix. No structure is imposed; the recovered reward is per-cell and does not propagate to unvisited state-action pairs. Optimized by L-BFGS-B.
Linear: \(Q(s, a) = \varphi(s, a)^\top \theta\) where \(\varphi\) is the feature matrix from the utility specification. This constrains \(Q\) to the structural feature space and propagates through features to unvisited pairs. Optimized by L-BFGS-B.
Neural: \(Q(s, \cdot) = f_\psi(s)\) for a small feedforward network. Optimized by Adam.

Pseudocode

compute expert (s, a) pairs from the panel
compute expert_state_coverage and expert_state_action_coverage
initialize Q (zeros for tabular/linear, random for neural)
minimize L(Q) over the chosen parameterization
extract policy: pi(a|s) = softmax(Q(s,:) / sigma)
extract value: V(s) = sigma * logsumexp(Q(s,:) / sigma)
extract implied reward: r_IB(s,a) = Q(s,a) - beta * sum_{s'} F_a(s'|s) V(s')
report policy, value, r_IB, and coverage fractions

Implementation Notes

The implementation lives in econirl.estimation.iq_learn. Coverage fractions are computed before optimization and exposed through summary.metadata["expert_state_coverage"] and summary.metadata["expert_state_action_coverage"]. Standard errors are not computed; the returned standard error array is NaN. The raw Bellman implied reward and its projection into the utility feature basis are both stored in summary.metadata for diagnostic use.