Under the Hood
Model
The data are state, action, next-state triples \((s, a, s')\) from a stationary infinite-horizon dynamic discrete choice model. A discount factor \(\beta\), transition kernels \(F_a(s' \mid s)\) with orientation \((A, S, S)\), and i.i.d. logit taste shocks with scale \(\sigma\) are assumed known. Instead of parameterizing a reward function directly, IQ-Learn parameterizes the soft Q-function \(Q(s, a)\) and derives the policy and implied reward from it.
Soft Value and Policy
Given \(Q\), the soft value function and the implied policy are:
These satisfy \(Q(s, a) - V(s) = \sigma \log \pi(a \mid s)\).
Inverse Bellman Reward
The temporal difference of \(Q\) under the transitions defines the reward implied by \(Q\) through the inverse Bellman operator:
If \(Q\) is a true soft Bellman Q-function, \(r_{\mathrm{IB}}\) recovers the underlying reward. In practice it is a diagnostic: it is fitted on expert support and can be unreliable off support.
Chi-Squared Objective
IQ-Learn minimizes over \(Q\):
where \(\mathbb{E}_\rho\) averages over expert \((s, a)\) pairs. Since \(Q(s, a) - V(s) = \sigma \log \pi(a \mid s)\), the first term is behavioral cloning scaled by \(\sigma\). The second term penalizes large implied rewards on expert support, which bounds the objective and prevents \(Q\) from drifting to numerical overflow on a free tabular table.
The simple (TV-distance) objective replaces the quadratic penalty with \(-\mathbb{E}_\rho[r_{\mathrm{IB}}(s, a)] + (1 - \beta)\,\mathbb{E}_{s_0}[V(s_0)]\), which has no upper bound on a free tabular \(Q\) and should not be used there.
Q Parameterizations
Three parameterizations are supported:
Tabular: \(Q\) is a free \((S \times A)\) matrix. No structure is imposed; the recovered reward is per-cell and does not propagate to unvisited state-action pairs. Optimized by L-BFGS-B.
Linear: \(Q(s, a) = \varphi(s, a)^\top \theta\) where \(\varphi\) is the feature matrix from the utility specification. This constrains \(Q\) to the structural feature space and propagates through features to unvisited pairs. Optimized by L-BFGS-B.
Neural: \(Q(s, \cdot) = f_\psi(s)\) for a small feedforward network. Optimized by Adam.
Pseudocode
compute expert (s, a) pairs from the panel
compute expert_state_coverage and expert_state_action_coverage
initialize Q (zeros for tabular/linear, random for neural)
minimize L(Q) over the chosen parameterization
extract policy: pi(a|s) = softmax(Q(s,:) / sigma)
extract value: V(s) = sigma * logsumexp(Q(s,:) / sigma)
extract implied reward: r_IB(s,a) = Q(s,a) - beta * sum_{s'} F_a(s'|s) V(s')
report policy, value, r_IB, and coverage fractions
Implementation Notes
The implementation lives in econirl.estimation.iq_learn. Coverage fractions
are computed before optimization and exposed through
summary.metadata["expert_state_coverage"] and
summary.metadata["expert_state_action_coverage"]. Standard errors are not
computed; the returned standard error array is NaN. The raw Bellman implied
reward and its projection into the utility feature basis are both stored in
summary.metadata for diagnostic use.