tanat.dataset package#
Subpackages#
- tanat.dataset.access package
- tanat.dataset.simulation package
Module contents#
Dataset package.
- tanat.dataset.access(data_type: str, cache_dir: Path | None = None, force: bool = False) Any[source]#
Access a dataset from Zenodo and return a ready-to-use object.
Depending on the type, this may return: - a DataFrame (for file-based datasets: CSV, Parquet, etc.) - a database connection (for SQL-based datasets)
Data is cached locally after the first download unless force=True is set.
- Parameters:
data_type (str) – Name of the dataset registered in ZenodoAccessor.
cache_dir (str, optional) – Directory used for caching. Defaults to system temp directory.
force (bool) – If True, forces re-download even if data is already cached.
- Returns:
A usable object for interacting with the dataset. The concrete type depends on the accessor implementation (e.g. a
Pathfor SQL-based, datasets such as"mimic4", or apandas.DataFramefor CSV-based) ones such as"mvad".- Return type:
Any
- Raises:
ValueError – If
data_typeis not registered in the accessor.
Examples
>>> # Access a MVAD CSV dataset as a DataFrame >>> df = access("mvad")
>>> # Access the mimic4 SQLite database: returns Path to the .db file >>> db_path: Path = access("mimic4") >>> DB = f"sqlite:///{db_path}" # SQLAlchemy-compatible URL
- tanat.dataset.simulate_events(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic event sequence data.
Returns a
pd.DataFramewith one row per event (entity).The DataFrame contains columns:
id(int64),time(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-event measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of events per ID (inclusive).
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-event (entity-level) measurements in the pool.time_range – (start, end) datetime bounds for generated timestamps. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, time, <features>].
Examples:
df = simulate_events(n_ids=50, seed=42) df = simulate_events(n_ids=50, features=["value", "category"], seed=42)
- tanat.dataset.simulate_intervals(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (1, 30), allow_overlaps: bool = True, features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic interval sequence data.
Returns a
pd.DataFramewith one row per interval (entity).The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-interval measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of intervals per ID.
duration_range – (min, max) interval duration in days.
allow_overlaps – When True intervals within an ID may overlap. When False each interval starts after the previous ends.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-interval (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_intervals(n_ids=200, allow_overlaps=False, seed=0) df = simulate_intervals(n_ids=50, features=["duration_days", "label"], seed=0)
- tanat.dataset.simulate_states(*, n_ids: int = 100, seq_length_range: tuple[int, int] = (3, 10), duration_range: tuple[int, int] = (5, 45), features: int | list[str] = 2, time_range: tuple[datetime, datetime] | None = None, seed: int | None = None) DataFrame[source]#
Generate synthetic contiguous state sequence data.
Returns a
pd.DataFramewith one row per state (entity).States are strictly contiguous:
end[i] == start[i+1]within each ID. Theendcolumn of the last state per ID is set to the end oftime_range.The DataFrame contains columns:
id(int64),start(datetime64[us]),end(datetime64[us]), plus one column per entity feature. Thefeaturesargument controls the entity-level columns, i.e. the per-state measurements attached to each sequence row.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
Use
simulate_static()to generate a separate per-sequence static DataFrame when needed.- Parameters:
n_ids – Number of distinct sequence IDs to generate.
seq_length_range – (min, max) number of states per ID.
duration_range – (min, max) state duration in days.
features – Number of entity feature columns to generate (auto-named
f_0,f_1, …) or explicit list of column names. These become the per-state (entity-level) measurements in the pool.time_range – (start, end) datetime bounds. Defaults to 2000-01-01 to 2025-01-01.
seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columns[id, start, end, <features>].
Examples:
df = simulate_states(n_ids=50, seed=42) df = simulate_states(n_ids=50, features=["score", "status"], seed=42)
- tanat.dataset.simulate_static(*, n_ids: int = 100, features: int | list[str] = 2, seed: int | None = None) DataFrame[source]#
Generate a synthetic static DataFrame with one row per sequence ID.
This function is independent of any temporal simulation: it can be used alongside any
simulate_events,simulate_intervalsorsimulate_statescall that shares the samen_ids. IDs are generated as consecutive integers1 … n_ids.Feature types are assigned by cycling through numeric, categorical and boolean in that order.
- Parameters:
n_ids – Number of distinct sequence IDs (and rows) to generate.
features – Number of feature columns to generate (auto-named
s_0,s_1, …) or an explicit list of column names. Types cycle through numeric, categorical, boolean.seed – Random seed for reproducibility.
- Returns:
A
pd.DataFramewith columnsidplus one column per feature.
Examples:
static_df = simulate_static(n_ids=50, features=["age", "group"], seed=0) static_df.head()
- tanat.dataset.simulate_trajectories(sequences: dict[str, dict], *, shared_ids: bool = True, seed: int | None = None) dict[str, DataFrame][source]#
Generate synthetic data for multiple sequence types at once.
Convenience wrapper that calls
simulate_events,simulate_intervals, orsimulate_statesfor each entry insequences.- Parameters:
sequences – Mapping of
{alias: config_dict}. Each config_dict must contain a"type"key ("event","interval", or"state") and may contain any keyword accepted by the correspondingsimulate_*function, includingfeaturesto name the entity-level columns.shared_ids – When True all generated sequences use the same ID space (1..n_ids). When False each sequence gets its own independent ID range.
seed – Master seed. Per-sequence seeds are derived deterministically when individual configs omit
seed.
- Returns:
Dict of
{alias: DataFrame}matching the input keys, ready to be piped intobuild_*and thenbuild_trajectories.Use
simulate_static()separately to generate per-trajectory static data.- Raises:
ValueError – When
shared_ids=Trueandn_idsvalues differ across sequence configs.ValueError – When an unknown sequence type is provided.
Examples:
data = simulate_trajectories( sequences={ "admissions": {"type": "interval", "n_ids": 500}, "procedures": {"type": "event", "n_ids": 500}, }, seed=42, )