MCE-IRL

Maximum causal entropy inverse reinforcement learning recovers reward parameters from demonstrated state-action trajectories by matching discounted feature expectations under a soft-optimal causal policy. For each candidate reward, the estimator solves a forward soft dynamic program, computes the implied feature moments, and updates the parameters until the model moments equal the expert moments. Counterfactuals are meaningful only through the fitted MDP primitives.

Read this page when demonstrations, not a structural likelihood, define the problem. The estimated object is a reward inside the supplied feature basis and normalization.

Source Papers

The estimator follows Ziebart et al. (2008), which introduces maximum-entropy inverse reinforcement learning through feature-count matching, and Ziebart (2010), which formulates the maximum causal entropy objective. The causal entropy conditions each action only on information available when the choice is made, the current state and continuation values, not on the states the action later reaches. Trajectory maximum-entropy IRL instead scores the whole trajectory, which ties the entropy to the transition dynamics and biases the recovered policy toward actions with uncertain outcomes. The two coincide under deterministic dynamics and separate under stochastic ones. The causal form matches the logit dynamic-discrete-choice structure, which is why MCE-IRL is the reference entropy IRL route for these comparisons.

Theory Connections

For the proof route behind this page, start with Soft Bellman and DDC-MaxEnt Equivalence for the maximum-causal-entropy and logit-DDC equivalence, Identification and Anchors for reward normalization, and IRL Identification Boundaries for what feature matching can and cannot identify. Use Reward Projection and Feature Rank for the feature-rank condition behind finite reward parameters.

Notation

Throughout, \(s\) indexes the discrete state and \(a\) the discrete action, observed for individual \(i\) in period \(t\). The vector \(\phi(s, a)\) collects the action-dependent reward features and \(\theta\) the reward parameters to be estimated. The subscript \(k\) indexes the \(k\)-th component of \(\phi\), so \(\phi_k(s, a)\) denotes the \(k\)-th feature value at \((s, a)\). The discount factor is \(\beta\) and the logit shock scale is \(\sigma\). Setting \(\sigma = 1\) recovers the unit-scale convention used in the internal derivation and in Ziebart (2010); the public page keeps \(\sigma\) explicit for the DDC comparison. The transition kernel \(P_a(s, s')\) gives the probability of moving to \(s'\) from \(s\) under action \(a\), stored in \((A, S, S)\) orientation. The initial state distribution is \(\rho_0(s)\). The soft value function is \(V_\theta(s)\), the choice-specific value is \(Q_\theta(s, a)\), and the causal policy is \(\pi_\theta(a \mid s)\). The empirical discounted expert occupancy is \(D_E(s, a)\) and the model occupancy is \(D_\theta(s, a)\). The expert and model feature moments are \(\mu_E\) and \(\mu_\theta\).

Model

The observed data are state, action, and next-state trajectories \((s_{it}, a_{it}, s_{i,t+1})\). The reward is linear in the action-dependent features:

\[ r_\theta(s, a) = \phi(s, a)^\top \theta. \]

The choice-specific value satisfies:

\[ Q_\theta(s, a) = r_\theta(s, a) + \beta \sum_{s'} P_a(s, s') V_\theta(s'). \]

The soft value function solves:

\[ V_\theta(s) = \sigma \log \sum_a \exp\!\left(\frac{Q_\theta(s, a)}{\sigma}\right). \]

The causal policy is the softmax of the choice-specific values:

\[ \pi_\theta(a \mid s) = \frac{\exp(Q_\theta(s, a) / \sigma)} {\sum_b \exp(Q_\theta(s, b) / \sigma)}. \]

Action probabilities at time \(t\) depend on the current state and continuation values, not on future realized states. This causal structure connects MCE-IRL to logit dynamic discrete choice: both use the same soft choice form, but MCE-IRL estimates the reward through feature moments rather than through a conditional likelihood alone.

Identification

This is the section that says when matching feature moments is enough to recover the intended reward representation, rather than only reproducing behavior.

MCE-IRL identifies a reward representation under the following assumptions.

Known transitions. The transition kernel \(P_a(s, s')\) is supplied or estimated outside the estimator. It does not depend on the reward parameters.
Causal behavioral model. The agent’s policy is the soft-optimal policy of the maximum causal entropy objective. The action distribution at each state follows the softmax of the choice-specific values from the soft Bellman recursion.
Additive linear reward. The reward is linear in the supplied feature matrix: \(r_\theta(s, a) = \phi(s, a)^\top \theta\). Structural counterfactuals require this parametric form.
Reward normalization. The reward is identified only up to transformations that leave behavior unchanged, including additive constants and reward shaping. A normalization anchor must be applied consistently when comparing estimated and reference rewards.
Action-dependent feature rank. The feature design must have full rank after applying the normalization. For multi-action reward recovery, features must vary across actions. State-only features broadcast across actions and leave action-specific payoff differences unidentified.
Sufficient action support. Each action must have enough observed support for the occupancy comparison. States with only one feasible action, or rare actions in the data, leave the corresponding reward directions weakly pinned.
Consistent encoding. Observations must be encoded in the same state-action indexing system as the transition tensor.

These hold inside a finite discrete state space with a stationary environment and a known discount factor \(\beta\). Under them, the feature-moment condition \(\mu_E = \mu_\theta\) uniquely determines \(\theta\) within the supplied feature basis and normalization. Identification weakens under a rank-deficient or state-only feature matrix, thin action support, or an invalid normalization.

Estimator

MCE-IRL matches discounted feature expectations. The empirical and model feature moments are:

\[ \mu_E = \sum_{s,a} D_E(s, a)\,\phi(s, a), \qquad \mu_\theta = \sum_{s,a} D_\theta(s, a)\,\phi(s, a). \]

The estimator solves the moment condition:

\[ \mu_E - \mu_\theta = 0. \]

Equivalently, this is the stationarity condition of the causal-entropy dual objective. The primal problem is

\[ \max_{\pi \text{ causal}} H_\text{causal}(\pi) \quad \text{s.t.} \quad \mathbb{E}_\pi[\phi(s,a)] = \mu_E, \]

where \(H_\text{causal}(\pi)\) is the causal entropy of the policy. Introducing \(\theta\) as the Lagrange multiplier on the feature-matching constraint gives the dual objective

\[ L(\theta) = \min_{\pi \text{ causal}} \Bigl[\theta \cdot (\mu_\pi - \mu_E)\Bigr] + H_\text{causal}(\pi), \]

whose gradient is \(\nabla_\theta L(\theta) = \mu_E - \mu_\theta\) (Ziebart, 2010, ch. 3). The causal policy is the soft-optimal policy of this dual, which is precisely the softmax of \(Q_\theta / \sigma\) derived above. This equivalence follows from differentiating \(\log \pi_\theta(a \mid s)\) directly. Since the policy is the softmax of \(Q_\theta / \sigma\), the score has the logit form

\[ \frac{\partial \log \pi_\theta(a \mid s)}{\partial \theta_k} = \frac{1}{\sigma}\left( \frac{\partial Q_\theta(s, a)}{\partial \theta_k} - \sum_b \pi_\theta(b \mid s)\,\frac{\partial Q_\theta(s, b)}{\partial \theta_k} \right), \]

where the \(Q\)-gradient carries a continuation term through the value function,

\[ \frac{\partial Q_\theta(s, a)}{\partial \theta_k} = \phi_k(s, a) + \beta \sum_{s'} P_a(s, s')\,\frac{\partial V_\theta(s')}{\partial \theta_k}. \]

Aggregated over the discounted state-action occupancy of the expert and the model, the continuation terms telescope and the gradient reduces to the feature-expectation difference (Ziebart, 2010, §3.4):

\[ \nabla_\theta L(\theta) = \mu_E - \mu_\theta. \]

In the default L-BFGS-B path, the estimator maximizes the conditional log likelihood of the demonstrations under \(\pi_\theta\), with the gradient computed by implicit differentiation through the soft Bellman fixed point. The score differentiates through the value function via:

\[ (I - \beta P_\pi)\frac{\partial V}{\partial \theta_k} = \sum_a \pi_\theta(a \mid s)\,\phi_k(s, a), \]

where \(P_\pi = \sum_a \operatorname{diag}(\pi_\theta(\cdot, a)) P_a\) is the policy-weighted transition matrix.

The model state occupancy \(D_\theta(s)\) required for \(\mu_\theta\) is computed by a forward pass (Ziebart, 2010, Algorithm 1):

\[ D_\theta(s) = \rho_0(s) + \beta \sum_{s', a} D_\theta(s')\,\pi_\theta(a \mid s')\,P_a(s', s), \]

or in matrix form \(D_\theta = \rho_0 + \beta P_\pi^\top D_\theta\), solved by fixed-point iteration. The state-action occupancy is then \(D_\theta(s, a) = D_\theta(s)\,\pi_\theta(a \mid s)\), from which \(\mu_\theta = \sum_{s,a} D_\theta(s, a)\,\phi(s, a)\).

The final gradient of the log-likelihood with respect to \(\theta_k\) is

\[ \frac{\partial \mathcal{L}}{\partial \theta_k} = \frac{1}{\sigma} \sum_t \Bigl[ dQ_k(s_t, a_t) - \sum_a \pi_\theta(a \mid s_t)\,dQ_k(s_t, a) \Bigr], \]

where \(dQ_k(s, a) = \phi_k(s, a) + \beta (P_a\, dV_k)(s)\) and \(dV_k\) solves the implicit-differentiation system above. This step connects the implicit differentiation to the gradient used in L-BFGS-B. The resulting gradient has the same logit form as the structural conditional likelihood score.

Algorithm

Algorithm  MCE-IRL (default: L-BFGS-B outer, hybrid inner solver)
Input   panel {(s_it, a_it, s_{i,t+1})}, features phi, transitions P,
        discount beta, logit scale sigma
Output  theta_hat, policy pi, value V

 compute expert feature moments mu_E from the demonstration occupancy
 compute initial state distribution rho_0 from the data
 initialize theta
 repeat                                         # outer loop: L-BFGS-B
     r_theta(s, a) := phi(s, a)' theta
     solve  V_theta = T_theta V_theta           # inner loop: hybrid soft Bellman
     Q_theta(s, a) := r_theta(s, a) + beta * sum_{s'} P_a(s, s') V_theta(s')
     pi_theta(a | s) := exp(Q_theta(s, a)/sigma) / sum_b exp(Q_theta(s, b)/sigma)
     L(theta) := sum_{i,t} log pi_theta(a_it | s_it)
    solve  (I - beta P_pi) dV/dtheta_k = sum_a pi_theta phi_k(s, a)  for each k
    compute grad L from dV/dtheta_k via the logit score
    update theta using L-BFGS-B
until the gradient norm is below tolerance
return theta_hat, pi_theta, V_theta

The inner solve in step 6 defaults to inner_solver="hybrid": value iteration (contraction) while far from the fixed point, then Newton-Kantorovich steps near the solution. Two pure variants are also available. "value" (successive approximation) converges linearly and is robust from any start. "policy" (policy iteration with matrix-inversion evaluation) converges faster near the solution but requires a good starting point.

An alternative outer path, used in the package’s own simulation study, is optimizer="root": a direct root-finding solver (HYBR method) that solves \(\mu_E - \mu_\theta = 0\) without maximizing the log likelihood. A gradient-descent path (optimizer="gradient") is also available, using Adam or plain SGD as the outer update.

System View

MCE-IRL starts from demonstrations rather than a structural likelihood. It asks which reward makes a soft-optimal agent visit the same state-action features as the expert.

Expert demonstrations
Known transition model, reward features, discount factor
        |
        v
Compute expert feature moments from observed behavior
        |
        v
Try one candidate reward parameter theta
        |
        v
Solve the soft dynamic program under that reward
        |
        v
Compute the model's feature moments
        |
        v
Update theta until model moments match expert moments

The reward is identified only inside the supplied feature span and normalization. If the features omit the real action contrast, the estimator can fit behavior without recovering the intended reward.

Applicability

Applicable when	Prefer an alternative when
Demonstrations come from a discrete sequential decision problem.	Likelihood-based structural standard errors are required.
Transitions are known or can be supplied.	Transition estimation is the main modeling challenge.
Reward features are supplied and action-dependent.	Reward features are unknown or require a neural representation.
The behavioral model is maximum causal entropy.	The target is deterministic control without entropy regularization.
Reward, policy, value, and counterfactual recovery are the goals.	Only fitted conditional choice probabilities are required.

MCE-IRL is the reference entropy IRL estimator for tabular discrete choice. The structural estimators (NFXP, CCP, MPEC, NNES, TD-CCP) target the same reward through likelihood or estimating-equation paths and report standard errors for \(\theta\). Neural MCE-IRL keeps the causal-entropy objective but replaces the tabular feature basis with a neural reward map.

Usage

import numpy as np

from econirl import MCEIRL

from econirl.datasets import load_rust_bus

n_states = 90
n_actions = 2
features = np.zeros((n_states, n_actions, 2))
features[:, 0, 0] = -np.arange(n_states) / 100.0
features[:, 1, 1] = -1.0

df = load_rust_bus()

model = MCEIRL(
    n_states=n_states,
    n_actions=n_actions,
    discount=0.99,
    feature_matrix=features,
    feature_names=["keep_mileage_cost", "replace_cost"],
)
model.fit(df, state="mileage_bin", action="replaced", id="bus_id")

print(model.params_)
print(model.policy_.shape)

The fitted policy gives action probabilities by state:

print(model.predict_proba([0, 10, 50, 89]))

General MDP

The example above is the Rust bus, where transitions=None estimates the two-action keep/replace kernel from the data. For any other problem, supply the dynamics explicitly. Pass a transition tensor of shape (n_actions, n_states, n_states) and the observed next-state column. A two-dimensional matrix fills the non-keep actions with the bus reset-to-state-0 kernel. A model with more than two actions and no explicit tensor is rejected.

from econirl import MCEIRL
from econirl.estimators import estimate_empirical_transitions

# transitions[a, s, s2] = P(s2 | s, a)
# features[s, a, k]      = phi_k(s, a)
model = MCEIRL(
    n_states=n_states,
    n_actions=n_actions,
    discount=0.95,
    feature_matrix=features,
    feature_names=feature_names,
)
model.fit(
    df,
    state="state",
    action="action",
    id="id",
    next_state="next_state",
    transitions=transitions,
)

When the kernel is unknown, estimate it from the observed transitions in a Panel and pass the result:

transitions = estimate_empirical_transitions(panel, n_actions, n_states)

Reward parameters are identified only when the action-contrast features have full rank and each action has enough observed support. The estimator warns at fit time when the action-contrast design is rank deficient, which means action-specific payoffs are not identified even with correct transitions. See Pre-Estimation Checks.

Counterfactual analysis requires re-solving the dynamic program under changed primitives. The fitted primitives available for this are model.reward_matrix_, model.policy_, and model.value_function_. For controlled payoff, transition, or action-set interventions, use the simulation and evaluation utilities with an explicit problem and transition environment. The Counterfactuals page documents the three counterfactual families and the reported regret figures.

The Quick Start page documents the full set of fitted attributes and the full MCEIRLEstimator API.

Evidence

Behavioral recovery is measured on a synthetic benchmark with 25 states, 3 actions, and 8 action-dependent reward features. The reward, transitions, policy, value, Q functions, and counterfactual oracles are fully specified before any data are generated. The estimator sees only the 300,000 generated observations, the transition tensor, and the feature matrix. The root feature-matching path reaches a solution in 25 iterations.

Behavioral fit against the known oracle policy:

Metric	Value
Policy total variation	0.00698
Value RMSE	0.0319
Type A regret (reward shift)	0.000433
Type B regret (transition change)	0.000410
Type C regret (action removed)	9.44e-05

Counterfactual recovery under three perturbation families:

Counterfactual	Policy TV	Value RMSE	Regret
Type A (reward shift)	0.006456	0.000742	0.000433
Type B (transition change)	0.006284	0.000523	0.000410
Type C (action removed)	0.004211	0.000145	9.44e-05

These results are local to the known simulation environment. They depend on the same transition law, support, reward representation, and policy-response assumptions used in fitting. For the cross-estimator comparison on multiple dynamic choice problems, see the bus engine simulation study.

References

Source papers:

Ziebart, B. D., Maas, A. L., Bagnell, J. A., and Dey, A. K. (2008). Maximum Entropy Inverse Reinforcement Learning. Proceedings of the 23rd AAAI Conference on Artificial Intelligence, 1433-1438. reference entry.
Ziebart, B. D. (2010). Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University. reference entry.

Implementation and reproduction:

Estimator source: econirl.estimation.mce_irl.
sklearn wrapper: econirl.MCEIRL.
Validation runner: validation/estimators/mce_irl/run.py.
Results file: mce_irl.json.

Pages: