tanat.dataset.simulation package#
Submodules#
tanat.dataset.simulation.events module#
simulate_events: generate synthetic event sequence data.
- tanat.dataset.simulation.events.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic event sequence data.
Returns a
pd.DataFramewith one row per event (entity).The DataFrame contains columns:
id(int64),time(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of events per ID (inclusive).
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, time, <features>].
Examples:
df = simulate_events(n_ids=50, seed=42) df = simulate_events(n_ids=50, features=["value", "category"], seed=42)
tanat.dataset.simulation.intervals module#
simulate_intervals: generate synthetic interval sequence data.
- tanat.dataset.simulation.intervals.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic interval sequence data.
Returns a
pd.DataFramewith one row per interval (entity).The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of intervals per ID.
duration_range – (min, max) interval duration in days.
allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0) df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)
tanat.dataset.simulation.states module#
simulate_states: generate synthetic contiguous state sequence data.
- tanat.dataset.simulation.states.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic contiguous state sequence data.
Returns a
pd.DataFramewith one row per state (entity).States are strictly contiguous:
end[i] == start[i+1]within each ID. Theendcolumn of the last state per ID is set to the end oftime_range.The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-state measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of states per ID.
duration_range – (min, max) state duration in days.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_states(n_ids=50, seed=42) df = simulate_states(n_ids=50, features=["score", "status"], seed=42)
tanat.dataset.simulation.static module#
simulate_static: generate synthetic per-sequence static data.
- tanat.dataset.simulation.static.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) DataFrame[source]#
Generate a synthetic static DataFrame with one row per sequence ID.
This function is independent of any temporal simulation: it can be used alongside any
simulate_events,simulate_intervalsorsimulate_statescall that shares the samen_ids. IDs are generated as consecutive integers1 … n_ids.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
- Parameters:
n_ids – Number of distinct sequence IDs (and rows) to generate.
features – Number of feature columns to generate (auto-named
s_0,s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columnsidplus one column per feature.
Examples:
static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0) static_df.head()
tanat.dataset.simulation.trajectories module#
simulate_trajectories: generate synthetic multi-sequence trajectory data.
- tanat.dataset.simulation.trajectories.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) dict[str, DataFrame][source]#
Generate synthetic data for multiple sequence types at once.
Convenience wrapper that calls
simulate_events,simulate_intervals, orsimulate_statesfor each entry insequences.- Parameters:
sequences – Mapping of
{alias: config_dict}. Each config_dict must contain a"type"key ("event","interval", or"state") and may contain any keyword accepted by the correspondingsimulate_*function, includingfeaturesto name the entity-level columns.shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.
seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit
seed.
- Returns:
Dict of
{alias: DataFrame}matching the input keys, ready to be piped intobuild_*and thenbuild_trajectories.Use
simulate_static()separately to generate per-trajectory static data.- Raises:
ValueError – When
shared_ids=Trueandn_idsvalues differ across sequence configs.ValueError – When an unknown sequence type is provided.
Examples:
data = simulate_trajectories( sequences={ "admissions": {"type": "interval", "n_ids": 500}, "procedures": {"type": "event", "n_ids": 500}, }, seed=42, )
Module contents#
Simulation sub-package: synthetic data generation.
- tanat.dataset.simulation.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic event sequence data.
Returns a
pd.DataFramewith one row per event (entity).The DataFrame contains columns:
id(int64),time(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of events per ID (inclusive).
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, time, <features>].
Examples:
df = simulate_events(n_ids=50, seed=42) df = simulate_events(n_ids=50, features=["value", "category"], seed=42)
- tanat.dataset.simulation.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic interval sequence data.
Returns a
pd.DataFramewith one row per interval (entity).The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of intervals per ID.
duration_range – (min, max) interval duration in days.
allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0) df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)
- tanat.dataset.simulation.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic contiguous state sequence data.
Returns a
pd.DataFramewith one row per state (entity).States are strictly contiguous:
end[i] == start[i+1]within each ID. Theendcolumn of the last state per ID is set to the end oftime_range.The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-state measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of states per ID.
duration_range – (min, max) state duration in days.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_states(n_ids=50, seed=42) df = simulate_states(n_ids=50, features=["score", "status"], seed=42)
- tanat.dataset.simulation.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) DataFrame[source]#
Generate a synthetic static DataFrame with one row per sequence ID.
This function is independent of any temporal simulation: it can be used alongside any
simulate_events,simulate_intervalsorsimulate_statescall that shares the samen_ids. IDs are generated as consecutive integers1 … n_ids.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
- Parameters:
n_ids – Number of distinct sequence IDs (and rows) to generate.
features – Number of feature columns to generate (auto-named
s_0,s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columnsidplus one column per feature.
Examples:
static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0) static_df.head()
- tanat.dataset.simulation.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) dict[str, DataFrame][source]#
Generate synthetic data for multiple sequence types at once.
Convenience wrapper that calls
simulate_events,simulate_intervals, orsimulate_statesfor each entry insequences.- Parameters:
sequences – Mapping of
{alias: config_dict}. Each config_dict must contain a"type"key ("event","interval", or"state") and may contain any keyword accepted by the correspondingsimulate_*function, includingfeaturesto name the entity-level columns.shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.
seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit
seed.
- Returns:
Dict of
{alias: DataFrame}matching the input keys, ready to be piped intobuild_*and thenbuild_trajectories.Use
simulate_static()separately to generate per-trajectory static data.- Raises:
ValueError – When
shared_ids=Trueandn_idsvalues differ across sequence configs.ValueError – When an unknown sequence type is provided.
Examples:
data = simulate_trajectories( sequences={ "admissions": {"type": "interval", "n_ids": 500}, "procedures": {"type": "event", "n_ids": 500}, }, seed=42, )