tanat.dataset package#

Subpackages#

Module contents#

Dataset package.

tanat.dataset.access(data_type: str, cache_dir: Path | None = None, force: bool = False) → Any[source]#

Access a dataset from Zenodo and return a ready-to-use object.

Depending on the type, this may return: - a DataFrame (for file-based datasets: CSV, Parquet, etc.) - a database connection (for SQL-based datasets)

Data is cached locally after the first download unless force=True is set.

Parameters:

data_type (str) – Name of the dataset registered in ZenodoAccessor.
cache_dir (str, optional) – Directory used for caching. Defaults to system temp directory.
force (bool) – If True, forces re-download even if data is already cached.

Returns:

A usable object for interacting with the dataset. The concrete type depends on the accessor implementation (e.g. a Path for SQL-based, datasets such as "mimic4", or a pandas.DataFrame for CSV-based) ones such as "mvad".

Return type:

Any

Raises:

ValueError – If data_type is not registered in the accessor.

Examples

>>> # Access a MVAD CSV dataset as a DataFrame
>>> df = access("mvad")

>>> # Access the mimic4 SQLite database: returns Path to the .db file
>>> db_path: Path = access("mimic4")
>>> DB = f"sqlite:///{db_path}"  # SQLAlchemy-compatible URL

tanat.dataset.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) → DataFrame[source]#

Generate synthetic event sequence data.

Returns a pd.DataFrame with one row per event (entity).

The DataFrame contains columns: id (int64), time (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:

n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of events per ID (inclusive).
features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.
time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, time, <features>].

Examples:

df = simulate_events(n_ids=50, seed=42)
df = simulate_events(n_ids=50, features=["value", "category"], seed=42)

tanat.dataset.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) → DataFrame[source]#

Generate synthetic interval sequence data.

Returns a pd.DataFrame with one row per interval (entity).

The DataFrame contains columns: id (int64), start (datetime64[us]), end (datetime64[us]), plus one column per entity feature. The features argument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:

n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of intervals per ID.
duration_range – (min, max) interval duration in days.
allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.
features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.
time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0)
df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)

tanat.dataset.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) → DataFrame[source]#

Generate synthetic contiguous state sequence data.

Returns a pd.DataFrame with one row per state (entity).

States are strictly contiguous: end[i] == start[i+1] within each ID. The end column of the last state per ID is set to the end of time_range.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Use simulate_static() to generate a separate per-sequence static DataFrame when needed.

Parameters:

n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of states per ID.
duration_range – (min, max) state duration in days.
features – Number of entity feature columns to generate (auto-named f_0, f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.
time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns [id, start, end, <features>].

Examples:

df = simulate_states(n_ids=50, seed=42)
df = simulate_states(n_ids=50, features=["score", "status"], seed=42)

tanat.dataset.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) → DataFrame[source]#

Generate a synthetic static DataFrame with one row per sequence ID.

This function is independent of any temporal simulation: it can be used alongside any simulate_events, simulate_intervals or simulate_states call that shares the same n_ids. IDs are generated as consecutive integers 1 … n_ids.

Feature types are assigned by cycling through numeric, categorical and boolean in that order.

Parameters:

n_ids – Number of distinct sequence IDs (and rows) to generate.
features – Number of feature columns to generate (auto-named s_0, s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.
seed – Random seed for reproducibility.

Returns:

A pd.DataFrame with columns id plus one column per feature.

Examples:

static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0)
static_df.head()

tanat.dataset.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) → dict[str, DataFrame][source]#

Generate synthetic data for multiple sequence types at once.

Convenience wrapper that calls simulate_events, simulate_intervals, or simulate_states for each entry in sequences.

Parameters:

sequences – Mapping of {alias: config_dict}. Each config_dict must contain a "type" key ("event", "interval", or "state") and may contain any keyword accepted by the corresponding simulate_* function, including features to name the entity-level columns.
shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.
seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit seed.

Returns:

Dict of {alias: DataFrame} matching the input keys, ready to be piped into build_* and then build_trajectories.

Use simulate_static() separately to generate per-trajectory static data.

Raises:

ValueError – When shared_ids=True and n_ids values differ across sequence configs.
ValueError – When an unknown sequence type is provided.

Examples:

data = simulate_trajectories(
    sequences={
        "admissions": {"type": "interval", "n_ids": 500},
        "procedures": {"type": "event", "n_ids": 500},
    },
    seed=42,
)