tanat.dataset.simulation package#

Submodules#

tanat.dataset.simulation.events module#

simulate_events: generate synthetic event sequence data.

tanat.dataset.simulation.events.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic event sequence data.

Returns a pd.DataFrame with one row per event (entity).

The DataFrame contains columns: id (int64), time (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of events per ID (inclusive).

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, time, <features>].

Examples:

df = simulate_events(n_ids=50, seed=42)
df = simulate_events(n_ids=50, features=["value", "category"], seed=42)

tanat.dataset.simulation.intervals module#

simulate_intervals: generate synthetic interval sequence data.

tanat.dataset.simulation.intervals.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic interval sequence data.

Returns a pd.DataFrame with one row per interval (entity).

The DataFrame contains columns: id (int64), start (datetime64[us]), end (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of intervals per ID.

  • duration_range – (min, max) interval duration in days.

  • allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0)
df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)

tanat.dataset.simulation.states module#

simulate_states: generate synthetic contiguous state sequence data.

tanat.dataset.simulation.states.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic contiguous state sequence data.

Returns a pd.DataFrame with one row per state (entity).

States are strictly contiguous: end[i] == start[i+1] within each ID. The end column of the last state per ID is set to the end of time_range.

The DataFrame contains columns: id (int64), start (datetime64[us]), end (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-state measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of states per ID.

  • duration_range – (min, max) state duration in days.

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_states(n_ids=50, seed=42)
df = simulate_states(n_ids=50, features=["score", "status"], seed=42)

tanat.dataset.simulation.static module#

simulate_static: generate synthetic per-sequence static data.

tanat.dataset.simulation.static.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) DataFrame[source]#

Generate a synthetic static DataFrame with one row per sequence ID.

This function is independent of any temporal simulation: it can be used alongside any simulate_events, simulate_intervals or simulate_states call that shares the same n_ids. IDs are generated as consecutive integers 1 n_ids.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Parameters:
  • n_ids – Number of distinct sequence IDs (and rows) to generate.

  • features – Number of feature columns to generate (auto-named s_0, s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns id plus one column per feature.

Examples:

static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0)
static_df.head()

tanat.dataset.simulation.trajectories module#

simulate_trajectories: generate synthetic multi-sequence trajectory data.

tanat.dataset.simulation.trajectories.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) dict[str, DataFrame][source]#

Generate synthetic data for multiple sequence types at once.

Convenience wrapper that calls simulate_events, simulate_intervals, or simulate_states for each entry in sequences.

Parameters:
  • sequences – Mapping of {alias: config_dict}. Each config_dict must contain a "type" key ("event", "interval", or "state") and may contain any keyword accepted by the corresponding simulate_* function, including features to name the entity-level columns.

  • shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.

  • seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit seed.

Returns:

Dict of {alias: DataFrame} matching the input keys, ready to be piped into build_* and then build_trajectories.

Use simulate_static() separately to generate per-trajectory static data.

Raises:
  • ValueError – When shared_ids=True and n_ids values differ across sequence configs.

  • ValueError – When an unknown sequence type is provided.

Examples:

data = simulate_trajectories(
    sequences={
        "admissions": {"type": "interval", "n_ids": 500},
        "procedures": {"type": "event", "n_ids": 500},
    },
    seed=42,
)

Module contents#

Simulation sub-package: synthetic data generation.

tanat.dataset.simulation.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic event sequence data.

Returns a pd.DataFrame with one row per event (entity).

The DataFrame contains columns: id (int64), time (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of events per ID (inclusive).

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, time, <features>].

Examples:

df = simulate_events(n_ids=50, seed=42)
df = simulate_events(n_ids=50, features=["value", "category"], seed=42)
tanat.dataset.simulation.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic interval sequence data.

Returns a pd.DataFrame with one row per interval (entity).

The DataFrame contains columns: id (int64), start (datetime64[us]), end (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of intervals per ID.

  • duration_range – (min, max) interval duration in days.

  • allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0)
df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)
tanat.dataset.simulation.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#

Generate synthetic contiguous state sequence data.

Returns a pd.DataFrame with one row per state (entity).

States are strictly contiguous: end[i] == start[i+1] within each ID. The end column of the last state per ID is set to the end of time_range.

The DataFrame contains columns: id (int64), start (datetime64[us]), end (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-state measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:
  • n_ids – Number of distinct sequence IDs to generate.

  • seq_length_range – (min, max) number of states per ID.

  • duration_range – (min, max) state duration in days.

  • features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.

  • time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_states(n_ids=50, seed=42)
df = simulate_states(n_ids=50, features=["score", "status"], seed=42)
tanat.dataset.simulation.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) DataFrame[source]#

Generate a synthetic static DataFrame with one row per sequence ID.

This function is independent of any temporal simulation: it can be used alongside any simulate_events, simulate_intervals or simulate_states call that shares the same n_ids. IDs are generated as consecutive integers 1 n_ids.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Parameters:
  • n_ids – Number of distinct sequence IDs (and rows) to generate.

  • features – Number of feature columns to generate (auto-named s_0, s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.

  • seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns id plus one column per feature.

Examples:

static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0)
static_df.head()
tanat.dataset.simulation.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) dict[str, DataFrame][source]#

Generate synthetic data for multiple sequence types at once.

Convenience wrapper that calls simulate_events, simulate_intervals, or simulate_states for each entry in sequences.

Parameters:
  • sequences – Mapping of {alias: config_dict}. Each config_dict must contain a "type" key ("event", "interval", or "state") and may contain any keyword accepted by the corresponding simulate_* function, including features to name the entity-level columns.

  • shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.

  • seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit seed.

Returns:

Dict of {alias: DataFrame} matching the input keys, ready to be piped into build_* and then build_trajectories.

Use simulate_static() separately to generate per-trajectory static data.

Raises:
  • ValueError – When shared_ids=True and n_ids values differ across sequence configs.

  • ValueError – When an unknown sequence type is provided.

Examples:

data = simulate_trajectories(
    sequences={
        "admissions": {"type": "interval", "n_ids": 500},
        "procedures": {"type": "event", "n_ids": 500},
    },
    seed=42,
)