econirl.datasets

Built-in datasets for econirl.

This module provides loaders for real-world datasets used in dynamic discrete choice and inverse reinforcement learning research. Synthetic data generation lives on the environment classes via env.generate_panel().

Real DDC Datasets: - Rust (1987): Bus engine replacement - SCANIA Component X (IDA 2024): Heavy truck component replacement - Keane & Wolpin (1994): Career decisions - Aguirregabiria (1999): Supermarket pricing/inventory - ICU-Sepsis: Clinical treatment decisions (abstracted MDP)

Real IRL / Trajectory Datasets: - T-Drive: Beijing taxi GPS trajectories - GeoLife: Human mobility GPS trajectories - Stanford Drone: Campus pedestrian/cyclist trajectories - ETH/UCY: Pedestrian trajectory benchmark - NGSIM: Highway lane-change vehicle trajectories - Shanghai: Taxi route-choice on road network

Real Sequential Choice Datasets: - Trivago (2019): Hotel search sessions - Foursquare: Venue check-in sequences

Hybrid Datasets (real data with synthetic fallback): - Citibike Route: Station-to-station destination choice - Citibike Usage: Daily ride/no-ride member panel

econirl.datasets.load_rust_bus(group=None, as_panel=False, original=False)[source]

Load the Rust (1987) bus engine replacement dataset.

This dataset contains monthly observations of odometer readings and engine replacement decisions for buses from the Madison Metropolitan Bus Company. The data was used in Rust’s pioneering work on dynamic discrete choice models.

Parameters:

group (int | None) – If specified, load only buses from this group (1-8 for synthetic, 1-4 for original). Groups differ by bus type and usage patterns. If None, loads all groups.
as_panel (bool) – If True, return data structured as a Panel object compatible with econirl estimators. If False (default), return as a pandas DataFrame.
original (bool) – If True, load original Rust (1987) data from NFXP files. If False (default), load synthetic data with similar characteristics.

Returns:

bus_id: Unique bus identifier
period: Month number (1-indexed)
mileage: Odometer reading (in thousands of miles)
mileage_bin: Discretized mileage state (0-89)
replaced: 1 if engine was replaced this period, 0 otherwise
group: Bus group (1-4 for original, 1-8 for synthetic)

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_rust_bus
>>> df = load_rust_bus()
>>> print(f"Observations: {len(df):,}")
>>> print(f"Buses: {df['bus_id'].nunique()}")
>>> print(f"Replacement rate: {df['replaced'].mean():.2%}")

>>> # Load original data for replication
>>> df_original = load_rust_bus(original=True)
>>> print(f"Original data: {len(df_original):,} observations")

Notes

The original Rust (1987) paper used groups 1-4: - Group 1: Grumman model 870 (15 buses) - Group 2: Chance model RT50 (4 buses) - Group 3: GMC model T8H203 (48 buses) - Group 4: GMC model A5308, 1975 (37 buses)

For the synthetic data, groups 5-8 represent additional bus types with different characteristics.

Mileage bins follow Rust’s discretization: each bin represents 5,000 miles, so bin 0 = [0, 5000), bin 1 = [5000, 10000), etc.

econirl.datasets.load_occupational_choice(n_individuals=500, n_periods=40, as_panel=False, seed=1997)[source]

Load synthetic occupational choice data (Keane-Wolpin style).

This dataset represents individuals making career choices over their working lives. The state space combines education level, work experience, and age into approximately 100 discrete states. Individuals choose between continuing school, working in white-collar or blue-collar jobs, or staying home.

Parameters:

n_individuals (int) – Number of individuals to generate (default: 500)
n_periods (int) – Number of time periods per individual (default: 40)
as_panel (bool) – If True, return data structured as a Panel object compatible with econirl estimators. If False (default), return as a pandas DataFrame.
seed (int) – Random seed for reproducibility (default: 1997)

Returns:

id: Individual identifier
period: Time period (0-indexed)
state: Discretized state index (0-99)
action: Chosen action (0=school, 1=white_collar, 2=blue_collar, 3=home)
education: Education level (0-4, representing years/degree levels)
experience: Work experience (0-9, discretized)
age: Age group (0-1: young/old)

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_occupational_choice
>>> df = load_occupational_choice()
>>> print(f"Observations: {len(df):,}")
>>> print(f"Individuals: {df['id'].nunique()}")
>>> print(f"States: {df['state'].nunique()}")

>>> # Get as Panel for estimation
>>> panel = load_occupational_choice(as_panel=True)
>>> print(f"Panel with {panel.num_individuals} individuals")

Notes

State encoding: state = education * 20 + experience * 2 + age // 20 This gives approximately 5 * 10 * 2 = 100 discrete states.

Action interpretation: - 0 (school): Continue education, increases education level - 1 (white_collar): Work in white-collar job, increases experience - 2 (blue_collar): Work in blue-collar job, increases experience - 3 (home): Stay home (unemployment, family care, etc.)

econirl.datasets.load_robinson_crusoe(n_individuals=200, n_periods=50, n_inventory_bins=20, include_hunt=False, as_panel=False, seed=1719)[source]

Load or generate Robinson Crusoe production/leisure data.

Model structure:

State: inventory (0 to n_inventory_bins-1)
Actions: 0 Fish (increases inventory), 1 Leisure (decreases inventory, increases utility), 2 Hunt (optional, higher risk/reward than fishing)
Transition: inventory += catch - consumption

Parameters:

n_individuals (int) – Number of individuals to simulate
n_periods (int) – Decision periods per individual
n_inventory_bins (int) – Number of discrete inventory states
include_hunt (bool) – If True, add hunting as third action
as_panel (bool) – If True, return as Panel object
seed (int | None) – Random seed for reproducibility

Returns:

id: Individual identifier
period: Decision period
inventory: Current inventory state
choice: Chosen action
catch: Units caught this period
consumption: Units consumed this period

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_robinson_crusoe
>>> df = load_robinson_crusoe(n_individuals=100, seed=42)
>>> print(df['choice'].value_counts())

econirl.datasets.get_robinson_crusoe_info()[source]

Get metadata about Robinson Crusoe dataset.

Return type:: dict

econirl.datasets.load_equipment_replacement(variant='binary', n_machines=100, n_periods=100, as_panel=False, seed=1987)[source]

Load synthetic equipment replacement data with configurable variants.

This dataset represents machines making maintenance/replacement decisions over time. The state represents equipment wear level, and actions vary by variant (binary replacement, ternary with repairs, or continuous state).

Parameters:

variant (str) – Problem variant to generate: - “binary”: 2 actions (keep, replace), 90 states - like Rust - “ternary”: 3 actions (keep, minor_repair, major_repair), 90 states - “continuous_state”: 2 actions (keep, replace), 200 states
n_machines (int) – Number of machines to simulate (default: 100)
n_periods (int) – Number of time periods per machine (default: 100)
as_panel (bool) – If True, return data structured as a Panel object compatible with econirl estimators. If False (default), return as a pandas DataFrame.
seed (int) – Random seed for reproducibility (default: 1987)

Returns:

id: Machine identifier
period: Time period (0-indexed)
state: Discretized wear state index
action: Chosen action (varies by variant)
wear_level: Continuous wear level (underlying state)
variant: The variant name used to generate this data

Return type:

DataFrame with columns

Raises:

ValueError – If variant is not one of “binary”, “ternary”, “continuous_state”

Example

>>> from econirl.datasets import load_equipment_replacement
>>> df = load_equipment_replacement(variant="binary")
>>> print(f"Observations: {len(df):,}")
>>> print(f"Machines: {df['id'].nunique()}")
>>> print(f"States: {df['state'].nunique()}")

>>> # Test with ternary actions
>>> df_ternary = load_equipment_replacement(variant="ternary")
>>> print(f"Actions: {df_ternary['action'].unique()}")

>>> # Get as Panel for estimation
>>> panel = load_equipment_replacement(as_panel=True)
>>> print(f"Panel with {panel.num_individuals} machines")

Notes

Action interpretation by variant: - binary: 0=keep, 1=replace - ternary: 0=keep, 1=minor_repair, 2=major_repair - continuous_state: 0=keep, 1=replace (but with finer state grid)

State encoding: - binary: 90 states (wear bins, like Rust’s mileage bins) - ternary: 90 states with different transition dynamics - continuous_state: 200 states for finer granularity

econirl.datasets.load_keane_wolpin(version='kw_94', as_panel=False, source='bundled')[source]

Load the Keane & Wolpin career decisions dataset.

This dataset tracks individuals choosing between: - 0: Schooling (accumulate education) - 1: White-collar work (accumulate white-collar experience) - 2: Blue-collar work (accumulate blue-collar experience) - 3: Home production (no state accumulation)

State variables include: - schooling: Years of completed education - exp_white_collar: Years of white-collar experience - exp_blue_collar: Years of blue-collar experience - age: Current age

Parameters:

version (Literal['kw_94', 'kw_97']) – Which version of the KW model to load - “kw_94”: Original 1994 REStat specification - “kw_97”: Extended 1997 JPE specification
as_panel (bool) – If True, return as Panel object for econirl estimators
source (Literal['respy', 'bundled']) – Data source - “respy”: Load from respy package (if installed) - “bundled”: Load bundled sample data

Returns:

id: Individual identifier
period: Decision period (1-indexed)
age: Current age
schooling: Years of education
exp_white_collar: White-collar experience
exp_blue_collar: Blue-collar experience
choice: Chosen action (0-3)

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_keane_wolpin
>>> df = load_keane_wolpin()
>>> print(f"Individuals: {df['id'].nunique()}")
>>> print(f"Choice distribution:\n{df['choice'].value_counts()}")

Notes

For full replication of KW94/KW97, install respy: pip install respy

The bundled sample data is suitable for testing and tutorials.

econirl.datasets.get_keane_wolpin_info()[source]

Get metadata about the Keane-Wolpin dataset.

Return type:: dict

econirl.datasets.load_rdw_scrappage(data_dir=None, as_panel=False, max_vehicles=None)[source]

Load the RDW vehicle scrappage dataset.

This dataset contains annual observations of vehicle age, APK inspection defect severity, and scrappage decisions for Dutch passenger vehicles. When real data is not available, synthetic data is generated with matching structure.

Parameters:

data_dir (str | None) – Path to directory containing real RDW CSV data produced by scripts/download_rdw.py. If None or if the file does not exist, synthetic data is generated.
as_panel (bool) – If True, return data structured as a Panel object compatible with econirl estimators. If False (default), return as a pandas DataFrame.
max_vehicles (int | None) – If specified, limit the number of vehicles loaded. Useful for quick testing.

Returns:

vehicle_id: Unique vehicle identifier
year: Calendar year (or period index for synthetic data)
age_bin: Discretized vehicle age (0-24)
defect_level: APK defect severity (0=pass, 1=minor, 2=major)
scrapped: 1 if vehicle was scrapped this period, 0 otherwise
state: Flattened state index (age_bin * 3 + defect_level)

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_rdw_scrappage
>>> df = load_rdw_scrappage()
>>> print(f"Observations: {len(df):,}")
>>> print(f"Vehicles: {df['vehicle_id'].nunique()}")
>>> print(f"Scrappage rate: {df['scrapped'].mean():.2%}")

>>> # With real RDW data
>>> df = load_rdw_scrappage(data_dir="/path/to/rdw_data/")

econirl.datasets.get_rdw_scrappage_info()[source]

Get metadata about the RDW scrappage dataset.

Returns:: Dictionary with dataset information including number of vehicles, observations, and summary statistics.
Return type:: dict

econirl.datasets.load_scania(data_dir=None, as_panel=False, num_degradation_bins=50, max_vehicles=None)[source]

Load the SCANIA Component X replacement dataset.

If data_dir is provided and contains the real SCANIA CSV files, loads and transforms the real data into a DDC panel. Otherwise, generates a synthetic dataset that mimics the SCANIA data structure.

The real data transformation computes a degradation index via PCA on the 105 operational readout features. The first principal component captures 97 percent of variance and is discretized into percentile-based bins. The replacement action is set to 1 at the vehicle’s final observed time step if in_study_repair is 1, and 0 at all other time steps.

Parameters:

data_dir (Optional[str | Path]) – Path to directory containing SCANIA CSV files. If None, uses synthetic data.
as_panel (bool) – If True, return a Panel object compatible with econirl estimators. If False (default), return a DataFrame.
num_degradation_bins (int) – Number of bins for degradation discretization. Default 50.
max_vehicles (Optional[int]) – If set, limit to this many vehicles (for quick testing).

Returns:

vehicle_id: Unique vehicle identifier
- period: Observation index within each vehicle (0-indexed)
- time_step: Original continuous time stamp
- degradation: PC1 score (continuous degradation index)
- degradation_bin: Discretized degradation state
- replaced: 1 if component replaced this period, 0 otherwise

Or Panel if as_panel=True.

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_scania
>>> df = load_scania()
>>> print(f"Vehicles: {df['vehicle_id'].nunique()}")
>>> print(f"Replacement rate: {df['replaced'].mean():.2%}")

>>> # With real data from Kaggle
>>> df = load_scania(data_dir="data/scania/Dataset/")

econirl.datasets.get_scania_info()[source]

Get metadata about the SCANIA dataset.

Returns:: Dictionary with dataset information including number of vehicles, observations, and summary statistics.
Return type:: dict

econirl.datasets.load_supermarket(as_panel=False, data_path=None)[source]

Load the Aguirregabiria (1999) supermarket dataset.

Returns the real supermarket data with 534 products tracked over 26 usable periods (29 months minus lagged variable construction). Each observation is a product-month with discretized state and action variables suitable for DDC estimation.

Parameters:

as_panel (bool) – If True, return Panel object for econirl estimators.
data_path (str | Path | None) – Path to the supermarket_data.csv file. If None, uses the bundled dataset.

Returns:

product_id, period, state, action, next_state, inventory_bin, lagged_promotion, promotion, ordered, sales, inventory, orders, wholesale_price, retail_price, markup_pct, stockout.

If as_panel=True, returns Panel object.

Return type:

DataFrame with columns

econirl.datasets.get_supermarket_info()[source]

Return metadata about the supermarket dataset.

Return type:: dict

econirl.datasets.load_icu_sepsis(n_individuals=500, max_steps=20, as_panel=False, seed=42, data_path=None)[source]

Generate ICU sepsis treatment trajectories from the expert policy.

Simulates patient trajectories by rolling out the clinician behavior policy (from MIMIC-III) through the MDP transition dynamics. Each trajectory starts from the empirical initial state distribution and runs until the patient reaches the absorbing terminal state or the maximum number of steps is reached.

Parameters:

n_individuals (int) – Number of patient trajectories to generate.
max_steps (int) – Maximum steps per trajectory. ICU stays in MIMIC-III are typically 5 to 20 four-hour windows.
as_panel (bool) – If True, return a Panel object for econirl estimators.
seed (int) – Random seed for reproducibility.
data_path (str | Path | None) – Path to NPZ file. If None, uses bundled data.

Returns:

patient_id, period, state, action, next_state, sofa_score, fluid_level, vaso_level, reward, terminated.

If as_panel=True, returns a Panel object.

Return type:

DataFrame with columns

econirl.datasets.load_icu_sepsis_mdp(data_path=None)[source]

Load the raw ICU-Sepsis MDP components as numpy arrays.

Returns a dictionary with the following keys:

transitions: shape (25, 716, 716), transitions[a, s, s'] = P(s'|s,a)
rewards: shape (716,), state rewards (+1 at survival state 714)
initial_distribution: shape (716,), starting state probabilities
expert_policy: shape (716, 25), clinician behavior policy pi(a|s)
sofa_scores: shape (716,), mean SOFA score per state cluster

Parameters:: data_path (str | Path | None) – Path to NPZ file. If None, uses bundled data.
Returns:: Dictionary of numpy arrays.
Return type:: dict

econirl.datasets.get_icu_sepsis_info()[source]

Return metadata about the ICU-Sepsis dataset.

Return type:: dict

econirl.datasets.load_citibike_route(as_panel=False, data_path=None, n_individuals=1000, n_periods=50, seed=42)[source]

Load Citibike route choice data.

If preprocessed data exists (from scripts/download_citibike.py), loads it directly. Otherwise generates synthetic route choice trajectories from the CitibikeRouteEnvironment.

Parameters:

as_panel (bool) – If True, return Panel object for econirl estimators.
data_path (str | Path | None) – Path to citibike_route.csv. If None, checks default location then falls back to synthetic generation.
n_individuals (int) – Number of riders for synthetic fallback.
n_periods (int) – Number of trips per rider for synthetic fallback.
seed (int) – Random seed for synthetic fallback.

Returns:

DataFrame with route choice data. If as_panel=True, returns Panel object.

Return type:

DataFrame | TrajectoryPanel

econirl.datasets.get_citibike_route_info()[source]

Return metadata about the Citibike route choice dataset.

Return type:: dict

econirl.datasets.load_citibike_usage(as_panel=False, data_path=None, n_individuals=500, n_periods=90, seed=42)[source]

Load Citibike daily usage frequency data.

If preprocessed data exists (from scripts/download_citibike.py), loads it directly. Otherwise generates synthetic usage panels from the CitibikeUsageEnvironment.

Parameters:

as_panel (bool) – If True, return Panel object for econirl estimators.
data_path (str | Path | None) – Path to citibike_usage.csv. If None, checks default location then falls back to synthetic generation.
n_individuals (int) – Number of members for synthetic fallback.
n_periods (int) – Number of days per member for synthetic fallback.
seed (int) – Random seed for synthetic fallback.

Returns:

DataFrame with daily usage data. If as_panel=True, returns Panel object.

Return type:

DataFrame | TrajectoryPanel

econirl.datasets.get_citibike_usage_info()[source]

Return metadata about the Citibike usage frequency dataset.

Return type:: dict

econirl.datasets.load_tdrive(n_taxis=None, as_trajectories=False, discretize=False, grid_size=100, seed=2010)[source]

Load T-Drive taxi trajectory data.

The original T-Drive dataset contains ~15 million GPS points from 10,357 Beijing taxis. This loader provides a bundled sample or can download the full dataset.

Parameters:

n_taxis (int | None) – Limit to first N taxis (None = all available)
as_trajectories (bool) – If True, return list of trajectory arrays
discretize (bool) – If True, convert GPS to discrete grid states
grid_size (int) – Number of grid cells per dimension if discretizing
seed (int | None) – Random seed for sample generation

Returns:

taxi_id: Taxi identifier
timestamp: GPS timestamp
longitude: GPS longitude
latitude: GPS latitude
(if discretize) state: Discrete grid cell index

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_tdrive
>>> df = load_tdrive(n_taxis=100)
>>> print(f"Points: {len(df):,}, Taxis: {df['taxi_id'].nunique()}")

>>> # For MaxEnt IRL
>>> trajectories = load_tdrive(as_trajectories=True, discretize=True)
>>> print(f"Trajectories: {len(trajectories)}")

econirl.datasets.get_tdrive_info()[source]

Get metadata about T-Drive dataset.

Return type:: dict

econirl.datasets.load_geolife(n_users=None, include_labels=False, as_trajectories=False, discretize=False, grid_size=100, seed=2008)[source]

Load GeoLife GPS trajectory data.

The original GeoLife dataset contains 17,621 trajectories from 182 users, representing 1.2 million kilometers of travel. Some trajectories include transportation mode labels.

Parameters:

n_users (int | None) – Limit to first N users (None = all available)
include_labels (bool) – Include transportation mode labels where available
as_trajectories (bool) – If True, return list of trajectory arrays
discretize (bool) – If True, convert GPS to discrete grid states
grid_size (int) – Number of grid cells per dimension if discretizing
seed (int | None) – Random seed for sample generation

Returns:

user_id: User identifier
trajectory_id: Trajectory identifier (trip)
timestamp: GPS timestamp
latitude: GPS latitude
longitude: GPS longitude
altitude: Altitude in meters
(if include_labels) mode: Transportation mode

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_geolife
>>> df = load_geolife(n_users=50)
>>> print(f"Users: {df['user_id'].nunique()}, Trips: {df['trajectory_id'].nunique()}")

>>> # For mobility IRL
>>> trajectories = load_geolife(as_trajectories=True, discretize=True)

econirl.datasets.get_geolife_info()[source]

Get metadata about GeoLife dataset.

Return type:: dict

econirl.datasets.load_stanford_drone(scene=None, agent_type=None, as_trajectories=False, discretize=False, grid_size=50, seed=2016)[source]

Load Stanford Drone Dataset pedestrian/cyclist trajectories.

The original SDD contains trajectories from 8 unique scenes on Stanford campus, captured via drone footage. Trajectories include pedestrians, cyclists, skaters, carts, and vehicles.

Parameters:

scene (Literal['bookstore', 'coupa', 'deathCircle', 'gates', 'hyang', 'little', 'nexus', 'quad'] | None) – Specific scene to load (None = all scenes)
agent_type (Literal['Pedestrian', 'Biker', 'Skater'] | None) – Filter by agent type (None = all types)
as_trajectories (bool) – If True, return list of trajectory arrays
discretize (bool) – If True, convert pixel coordinates to grid states
grid_size (int) – Grid size for discretization
seed (int | None) – Random seed for sample generation

Returns:

track_id: Unique trajectory identifier
frame: Video frame number
x: X coordinate (pixels)
y: Y coordinate (pixels)
agent_type: Type of agent
scene: Scene name

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_stanford_drone
>>> df = load_stanford_drone(scene="gates", agent_type="Pedestrian")
>>> print(f"Trajectories: {df['track_id'].nunique()}")

>>> # For trajectory IRL
>>> trajectories = load_stanford_drone(as_trajectories=True, discretize=True)

econirl.datasets.get_stanford_drone_info()[source]

Get metadata about Stanford Drone Dataset.

Return type:: dict

econirl.datasets.load_eth_ucy(scene=None, as_trajectories=False, discretize=False, grid_size=50, seed=2009)[source]

Load ETH/UCY pedestrian trajectory data.

The ETH/UCY datasets are classic benchmarks for pedestrian trajectory prediction, containing world-coordinate trajectories (in meters) at 2.5 FPS.

Parameters:

scene (Literal['eth', 'hotel', 'univ', 'zara1', 'zara2'] | None) – Specific scene to load (None = all scenes)
as_trajectories (bool) – If True, return list of trajectory arrays
discretize (bool) – If True, convert coordinates to grid states
grid_size (int) – Grid size for discretization
seed (int | None) – Random seed for sample generation

Returns:

pedestrian_id: Unique pedestrian identifier
frame: Frame number
x: X coordinate (meters, world coordinates)
y: Y coordinate (meters, world coordinates)
scene: Scene name

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_eth_ucy
>>> df = load_eth_ucy(scene="eth")
>>> print(f"Pedestrians: {df['pedestrian_id'].nunique()}")

>>> # For trajectory IRL
>>> trajectories = load_eth_ucy(as_trajectories=True, discretize=True)

econirl.datasets.get_eth_ucy_info()[source]

Get metadata about ETH/UCY datasets.

Return type:: dict

econirl.datasets.load_foursquare(as_panel=False, min_checkins=50, n_time_bins=4)[source]

Load the Foursquare NYC check-in dataset as a sequential venue choice problem.

Each user’s check-in sequence is converted to (state, action, next_state) tuples where state encodes the current venue category and time of day, and action is the next venue category visited.

Parameters:

as_panel (bool) – If True, return as Panel object for econirl estimators.
min_checkins (int) – Minimum check-ins per user to include (default 50).
n_time_bins (int) – Number of time-of-day bins (default 4).

Returns:

user_id, period, state, action, next_state, super_category, next_category, time_bin, hour, is_weekend

Return type:

DataFrame with columns

econirl.datasets.get_foursquare_info()[source]

Return metadata about the Foursquare NYC dataset.

Return type:: dict

econirl.datasets.load_ngsim(as_panel=False, n_speed_bins=10, subsample_frames=10, min_frames=50, max_vehicles=None)[source]

Load the NGSIM US-101 dataset as a lane-change discrete choice problem.

Parameters:

as_panel (bool) – If True, return as Panel object for econirl estimators.
n_speed_bins (int) – Number of speed bins (default 10, covering 0-50 ft/s).
subsample_frames (int) – Take every Nth frame to reduce autocorrelation (default 10, i.e., 1Hz from 10Hz raw data).
min_frames (int) – Minimum frames per vehicle after subsampling (default 50).
max_vehicles (int | None) – If set, limit to this many vehicles (for faster testing).

Returns:

vehicle_id, frame, state, action, next_state, lane, speed_bin, v_vel, space_headway, lane_change

Return type:

DataFrame with columns

econirl.datasets.get_ngsim_info()[source]

Return metadata about the NGSIM US-101 dataset.

Return type:: dict

econirl.datasets.load_taxi_gridworld(grid_size=10, n_individuals=200, n_periods=100, as_panel=False, seed=2008, step_penalty=-0.1, terminal_reward=10.0, distance_weight=0.1, discount_factor=0.99)[source]

Generate a taxi-gridworld dataset for MCE IRL benchmarking.

Creates synthetic taxi trajectories on an N x N grid where drivers navigate toward a destination at (N-1, N-1). The true reward parameters are known, enabling parameter recovery evaluation.

This replicates the canonical setup from Ziebart et al. (2008) where MaxEnt IRL was first applied to taxi route preferences.

Parameters:

grid_size (int) – Size of the grid (N x N). Default 10 gives 100 states.
n_individuals (int) – Number of taxi trajectories to simulate.
n_periods (int) – Number of steps per trajectory.
as_panel (bool) – If True, return as Panel object for econirl estimators.
seed (int) – Random seed for reproducibility.
step_penalty (float) – Cost per non-terminal step (default -0.1).
terminal_reward (float) – Reward for reaching destination (default 10.0).
distance_weight (float) – Weight on distance-to-destination feature (default 0.1).
discount_factor (float) – Time discount factor (default 0.99).

Returns:

taxi_id, period, state, action, next_state, row, col, next_row, next_col, manhattan_distance

If as_panel=True, returns Panel object.

Return type:

DataFrame with columns

econirl.datasets.get_taxi_gridworld_info(grid_size=10)[source]

Return metadata about the taxi gridworld dataset.

Parameters:: grid_size (int)
Return type:: dict

econirl.datasets.load_shanghai_network(data_dir=None)[source]

Load Shanghai road network data.

Parameters:: data_dir (str or None) – Path to the dataset directory. If None, uses the default location at /Volumes/Expansion/datasets/shanghai_taxi_rcm_airl/data.
Returns:: Keys: "nodes" (DataFrame), "edges" (DataFrame), "transit" (ndarray of shape (1737, 3)), "n_states" (714), "n_actions" (8).
Return type:: dict

econirl.datasets.load_shanghai_route(split='train', cv=0, size=1000, data_dir=None)[source]

Load Shanghai route-choice data as a TrajectoryPanel.

Convenience function that loads trajectories and transit data, then parses them into a TrajectoryPanel ready for IRL estimators.

Parameters:

split (str) – "train" or "test".
cv (int) – Cross-validation fold index (0-4).
size (int) – Training set size (100, 1000, or 10000). Ignored for test split.
data_dir (str or None) – Path to the dataset directory.

Return type:

TrajectoryPanel

econirl.datasets.load_shanghai_trajectories(split='train', cv=0, size=1000, data_dir=None)[source]

Load Shanghai route-choice trajectory data.

Parameters:

split (str) – "train" or "test".
cv (int) – Cross-validation fold index (0-4).
size (int) – Training set size (100, 1000, or 10000). Ignored for test split.
data_dir (str or None) – Path to the dataset directory.

Returns:

Columns: ori, des, path (underscore-separated n_id sequence), len (number of edges in path).

Return type:

pd.DataFrame

econirl.datasets.build_transition_matrix(transit, n_states=714, n_actions=8)[source]

Build deterministic transition matrix from transit triples.

For each (from_state, direction, to_state) row in transit, sets T[direction, from_state, to_state] = 1.0. State-action pairs not present in transit get a self-loop (absorbing). Each row sums to 1.

Parameters:

transit (ndarray of shape (K, 3)) – Each row is [from_n_id, direction_id, to_n_id].
n_states (int) – Number of states (714).
n_actions (int) – Number of actions (8).

Returns:

Deterministic transition matrix with rows summing to 1.

Return type:

jnp.ndarray of shape (n_actions, n_states, n_states)

econirl.datasets.build_edge_features(edges_df, n_states=714)[source]

Build per-edge feature matrix.

Features:: 0: length (normalized to [0, 1] by max length) 1-6: one-hot encoding of highway type

Parameters:

edges_df (pd.DataFrame) – Edge data with columns n_id, length, highway.
n_states (int) – Number of edge-states (714).

Return type:

jnp.ndarray of shape (n_states, 7)

econirl.datasets.build_state_action_features(edge_features, transit, n_states=714, n_actions=8)[source]

Build state-action feature matrix.

For each (from_state, action, to_state) triple, the feature vector at [from_state, action, :] is edge_features[to_state, :] – i.e., the features of the road segment reached by taking that action.

Invalid (state, action) pairs get zero features.

Parameters:

edge_features (jnp.ndarray of shape (n_states, n_features)) – Per-edge features.
transit (ndarray of shape (K, 3)) – Each row is [from_n_id, direction_id, to_n_id].
n_states (int) – Number of states (714).
n_actions (int) – Number of actions (8).

Return type:

jnp.ndarray of shape (n_states, n_actions, n_features)

econirl.datasets.load_trivago_search(n_sessions=None, data_path=None)[source]

Convenience function: load Trivago data and return a Panel.

Parameters:

n_sessions (int, optional) – Number of sessions to load. None = all.
data_path (str, optional) – Path to train.csv.

Returns:

Panel of session trajectories.

Return type:

Panel

econirl.datasets.load_trivago_sessions(data_path=None, n_sessions=None)[source]

Load raw Trivago session data using polars for speed.

Parameters:

data_path (str, optional) – Path to the train.csv file. Defaults to the external drive location.
n_sessions (int, optional) – If specified, load only the first N unique session_ids.

Returns:

DataFrame with all original columns.

Return type:

pl.DataFrame

econirl.datasets.build_trivago_mdp(sessions_df, n_step_buckets=3, n_viewed_buckets=4)[source]

Build MDP tuples (state, action, next_state) from raw session data.

For each session, tracks cumulative items viewed, computes bucketed state features, maps actions, and handles terminal transitions.

Parameters:

sessions_df (pl.DataFrame) – Raw session data from load_trivago_sessions.
n_step_buckets (int) – Number of step depth buckets (default 3).
n_viewed_buckets (int) – Number of items-viewed buckets (default 4).

Returns:

Keys: all_states, all_actions, all_next_states, session_ids, n_states, n_actions, state_names, action_names.

Return type:

dict

econirl.datasets.build_trivago_panel(mdp_dict)[source]

Build a Panel of trajectories from the MDP dict.

Groups observations by session_id and creates one Trajectory per session.

Parameters:: mdp_dict (dict) – Output of build_trivago_mdp.
Returns:: Panel object with one trajectory per session.
Return type:: Panel

econirl.datasets.build_trivago_features(n_states=37, n_actions=4)[source]

Build action-dependent feature matrix for Trivago hotel search.

4 features per (state, action) pair: - step_cost: grows with search depth (negative for browse/refine) - browse_indicator: -1 for browse, 0 otherwise - refine_indicator: -1 for refine, 0 otherwise - clickout_indicator: +1 for clickout, 0 otherwise

Parameters:

n_states (int) – Number of states including absorbing (default 37).
n_actions (int) – Number of actions (default 4).

Returns:

Feature matrix of shape (n_states, n_actions, 4).

Return type:

jnp.ndarray

econirl.datasets.build_trivago_transitions(mdp_dict, n_states=37, n_actions=4, smoothing=1e-08)[source]

Build empirical transition matrix P(s'|s,a) from training data.

Parameters:

mdp_dict (dict) – Output of build_trivago_mdp.
n_states (int) – Number of states including absorbing (default 37).
n_actions (int) – Number of actions (default 4).
smoothing (float) – Smoothing constant for unobserved (s, a) pairs.

Returns:

Transition matrix of shape (n_actions, n_states, n_states).

Return type:

jnp.ndarray

econirl.datasets.get_trivago_info()[source]

Return metadata about the Trivago hotel search dataset.

Return type:: dict

econirl.datasets.load_rust_small(group=None, as_panel=False, original=False)

Load the Rust (1987) bus engine replacement dataset.

This dataset contains monthly observations of odometer readings and engine replacement decisions for buses from the Madison Metropolitan Bus Company. The data was used in Rust’s pioneering work on dynamic discrete choice models.

Parameters:

group (int | None) – If specified, load only buses from this group (1-8 for synthetic, 1-4 for original). Groups differ by bus type and usage patterns. If None, loads all groups.
as_panel (bool) – If True, return data structured as a Panel object compatible with econirl estimators. If False (default), return as a pandas DataFrame.
original (bool) – If True, load original Rust (1987) data from NFXP files. If False (default), load synthetic data with similar characteristics.

Returns:

bus_id: Unique bus identifier
period: Month number (1-indexed)
mileage: Odometer reading (in thousands of miles)
mileage_bin: Discretized mileage state (0-89)
replaced: 1 if engine was replaced this period, 0 otherwise
group: Bus group (1-4 for original, 1-8 for synthetic)

Return type:

DataFrame with columns

Example

>>> from econirl.datasets import load_rust_bus
>>> df = load_rust_bus()
>>> print(f"Observations: {len(df):,}")
>>> print(f"Buses: {df['bus_id'].nunique()}")
>>> print(f"Replacement rate: {df['replaced'].mean():.2%}")

>>> # Load original data for replication
>>> df_original = load_rust_bus(original=True)
>>> print(f"Original data: {len(df_original):,} observations")

Notes

The original Rust (1987) paper used groups 1-4: - Group 1: Grumman model 870 (15 buses) - Group 2: Chance model RT50 (4 buses) - Group 3: GMC model T8H203 (48 buses) - Group 4: GMC model A5308, 1975 (37 buses)

For the synthetic data, groups 5-8 represent additional bus types with different characteristics.

Mileage bins follow Rust’s discretization: each bin represents 5,000 miles, so bin 0 = [0, 5000), bin 1 = [5000, 10000), etc.

econirl.datasets.load_rust_big(n_dummies=30, dummy_cardinality=20, seed=42, as_panel=False)[source]

Load the Rust bus panel augmented with dummy state variables.

The mileage column from the original panel is preserved exactly so that any estimator that projects onto mileage recovers the same parameters as on rust-small. The dummies are appended as columns dummy_0, dummy_1, …, each drawn uniformly over 0 to dummy_cardinality - 1 independently of the action.

Parameters:

n_dummies (int) – Number of dummy state variables to append. Default 30 matches the middle test value in Kang et al. (2025). Set to zero to recover rust-small.
dummy_cardinality (int) – Number of values each dummy can take. Default 20 matches Kang et al.(2025).
seed (int) – Random seed for the dummy draws.
as_panel (bool) – If True, return a Panel object whose state column is the original mileage bin and whose metadata declares the dummy column names. If False, return the augmented DataFrame.

Returns:

DataFrame with the original Rust columns plus n_dummies dummy columns. If as_panel=True, a Panel object whose metadata[‘dummy_columns’] lists the dummy names.

Return type:

DataFrame

Notes

The Panel object encodes only the mileage state as the integer state column to keep it compatible with tabular estimators that expect a small state space. The dummy columns travel in the Panel metadata field. Estimators that consume the augmented state space (NNES, GLADIUS, TD-CCP) read the dummies through a multi-dimensional state encoder rather than through the primary state column.

econirl.datasets.get_rust_big_info(n_dummies=30, dummy_cardinality=20)[source]

Return metadata about the augmented Rust bus dataset.

Parameters:

n_dummies (int) – Number of dummy state variables in the configuration.
dummy_cardinality (int) – Cardinality of each dummy state variable.

Returns:

Dictionary with dataset information including the augmented state-space dimensionality, the source paper, and the true reward parameters on the genuine mileage dimension.

Return type:

dict

econirl.datasets.load_ziebart_small(grid_size=10, n_individuals=200, n_periods=100, as_panel=False, seed=2008, step_penalty=-0.1, terminal_reward=10.0, distance_weight=0.1, discount_factor=0.99)

Generate a taxi-gridworld dataset for MCE IRL benchmarking.

Creates synthetic taxi trajectories on an N x N grid where drivers navigate toward a destination at (N-1, N-1). The true reward parameters are known, enabling parameter recovery evaluation.

This replicates the canonical setup from Ziebart et al. (2008) where MaxEnt IRL was first applied to taxi route preferences.

Parameters:

grid_size (int) – Size of the grid (N x N). Default 10 gives 100 states.
n_individuals (int) – Number of taxi trajectories to simulate.
n_periods (int) – Number of steps per trajectory.
as_panel (bool) – If True, return as Panel object for econirl estimators.
seed (int) – Random seed for reproducibility.
step_penalty (float) – Cost per non-terminal step (default -0.1).
terminal_reward (float) – Reward for reaching destination (default 10.0).
distance_weight (float) – Weight on distance-to-destination feature (default 0.1).
discount_factor (float) – Time discount factor (default 0.99).

Returns:

taxi_id, period, state, action, next_state, row, col, next_row, next_col, manhattan_distance

If as_panel=True, returns Panel object.

Return type:

DataFrame with columns

econirl.datasets.load_ziebart_big(grid_size=50, n_actions=8, slip_prob=0.1, n_basis=16, n_trajectories=50000, trajectory_length=30, discount_factor=0.99, scale_parameter=1.0, seed=42, as_panel=False)[source]

Load or generate the large Ziebart gridworld dataset.

A 50-by-50 stochastic gridworld with 8 actions and a slip probability of 0.1. The reward is a linear combination of 16 radial basis functions on the grid coordinates with coefficients drawn once at generation time and stored alongside the panel. 50000 trajectories of length 30 are sampled under the maximum causal entropy optimal policy at the true reward.

Parameters:

grid_size (int) – Side length of the square grid. Default 50 produces a 2500-cell state space.
n_actions (int) – Move set size. Only the 8-action diagonal move set is supported.
slip_prob (float) – Probability that a move resolves to a uniformly chosen alternative direction.
n_basis (int) – Number of radial basis functions used in the reward.
n_trajectories (int) – Number of trajectories to sample.
trajectory_length (int) – Number of state-action pairs per trajectory.
discount_factor (float) – Time discount factor for the optimal policy.
scale_parameter (float) – Logit scale parameter.
seed (int) – Random seed for reproducibility.
as_panel (bool) – If True, return a Panel object. If False, return a pandas DataFrame.

Returns:

DataFrame with trajectory records or Panel object whose metadata field carries the true reward coefficients.

Return type:

DataFrame

econirl.datasets.get_ziebart_big_info()[source]

Return metadata about the large gridworld dataset.

Returns the true reward coefficients used by the data-generating process if the bundled metadata is available; otherwise returns only the static descriptive fields. Validation tests check recovered coefficients against the true_theta field.

Return type:: dict

econirl.datasets.load_lsw_synthetic(n_users=5000, n_books=50, n_chapters_per_book=30, pi_pay_segment=0.4, seed=42, as_panel=False)[source]

Load or generate the LSW serialized-content semi-synthetic panel.

Parameters:

n_users (int) – Number of simulated users in the panel.
n_books (int) – Maximum number of books each user encounters before running out of content. Users may exit a book early.
n_chapters_per_book (int) – Number of chapters per book.
pi_pay_segment (float) – Population mixture weight on the high-patience pay-and-read latent type. Default 0.4 matches the source paper’s reported relative segment shares.
seed (int) – Random seed for reproducibility.
as_panel (bool) – If True, return a Panel object whose metadata field carries the data-generating process parameters. If False, return a pandas DataFrame.

Returns:

DataFrame with one row per chapter decision, including the integer state encoding, the chosen action, the next state, the latent type, and a four-dimensional content embedding. If as_panel=True, a Panel object whose metadata declares the type-specific reward coefficients.

Return type:

DataFrame

econirl.datasets.get_lsw_synthetic_info()[source]

Return metadata about the LSW serialized-content dataset.

Returns the type-specific reward coefficients, the population mixture weights, the discount factor, and the action labels used by the AIRL-Het EM loop to validate recovered parameters.

Return type:: dict

econirl.datasets.load_shapeshifter(seed=42, n_individuals=500, n_periods=100, as_panel=True, **config_kwargs)[source]

Construct a shape-shifter environment and simulate a panel.

Parameters:

seed (int) – Random seed. Sets both the environment’s frozen-network and transition-matrix seed (via ShapeshifterConfig.seed) and the simulation’s trajectory sampler seed.
n_individuals (int) – Number of trajectories to simulate.
n_periods (int) – Number of time periods per trajectory. Ignored when the environment is finite-horizon (the env terminates earlier).
as_panel (bool) – Always treated as True; the worker pipeline expects a Panel. Argument retained for signature compatibility with the other loaders in this package.
**config_kwargs – Forwarded to ShapeshifterConfig. Any axis flag (for example reward_type="neural" or state_dim=2) goes here. The cell defines what to pass.

Returns:

Simulated panel ready to feed into an estimator.

Return type:

Panel

econirl.datasets.get_shapeshifter_info()[source]

Return metadata about the shape-shifter loader.

Return type:: dict[str, Any]