Context
Most IRL estimators frame reward recovery as a min-max problem: choose a reward function and a policy, then alternate between improving each. IQ-Learn collapses this adversarial game into a single concave optimization over a soft Q-function. The Q-function determines both the policy (through the softmax mapping) and a Bellman-implied reward (through the inverse Bellman operator), so there is no discriminator to train and no inner loop to stabilize.
Source Ideas
The construction comes from Garg et al. (2021). The key observation is that in a max-entropy policy framework the optimal policy is always the softmax of Q, so it suffices to optimize over Q alone. The chi-squared divergence objective adds a quadratic penalty on the temporal difference term, which bounds the objective and prevents Q from drifting to numerical overflow. The simple (TV-distance) objective is also implemented but has no upper bound on a free tabular Q table and should not be used there.
Where IQ-Learn Fits
IQ-Learn sits with AIRL, f-IRL, GLADIUS, and MCE-IRL in the IRL family. Like the others it does not impose a structural Bellman fixed point as a hard constraint. Unlike AIRL and f-IRL it avoids adversarial training; unlike MCE-IRL and GLADIUS it does not start from a parameterized reward function but from a parameterized Q directly.
The Garg et al. (2021) paper shows that IQ-Learn generalizes behavioral cloning: setting the divergence penalty weight to zero recovers a pure log-likelihood maximizer over the Q-induced logit choice probabilities. This connection makes IQ-Learn a natural bridge between BC and Bellman-aware IRL, and the alpha hyperparameter controls where on that path the estimator sits.
Package Position
IQ-Learn is the right tool when you want an imitation policy without adversarial training and a Q-based reward diagnostic alongside it. It is not the right tool when structural counterfactual validity is required; the current evidence shows that reward, value, and Q recovery do not pass their checks on any tested cell. For structural use, see NFXP, CCP, or UFXP.