tanat.sequence package#
Subpackages#
- tanat.sequence.base package
- Submodules
- tanat.sequence.base.entity module
- tanat.sequence.base.pool module
SequencePoolSequencePool.MAX_BINS_LIMITSequencePool.__init__()SequencePool.add_entity_features()SequencePool.add_static_features()SequencePool.apply()SequencePool.binned_data()SequencePool.cast_features()SequencePool.cast_id()SequencePool.cast_to_datetime()SequencePool.cast_to_timestep()SequencePool.copy()SequencePool.describe()SequencePool.drop_features()SequencePool.extend()SequencePool.filter_entities()SequencePool.from_parent()SequencePool.get_sequences()SequencePool.is_dirtySequencePool.save()SequencePool.set_t0()SequencePool.static_data()SequencePool.subset()SequencePool.survival_target()SequencePool.t0_data()SequencePool.temporal_data()SequencePool.to_dummies()SequencePool.to_tensor()SequencePool.train_test_split()SequencePool.unique_idsSequencePool.which()
- tanat.sequence.base.sequence module
- tanat.sequence.base.settings module
SequenceSettingsSequenceSettings.__init__()SequenceSettings.available_features()SequenceSettings.entity_featuresSequenceSettings.get_column_rename_map()SequenceSettings.get_time_columns()SequenceSettings.id_columnSequenceSettings.is_compatible_with()SequenceSettings.model_dump()SequenceSettings.normalize_entity_features()SequenceSettings.normalize_static_features()SequenceSettings.static_featuresSequenceSettings.validate_compatibility()SequenceSettings.validate_features()
- tanat.sequence.base.view_mixin module
SequenceFrameAssemblerSequenceFrameAssembler.__init__()SequenceFrameAssembler.id_time_index()SequenceFrameAssembler.ids()SequenceFrameAssembler.merged_for_extend()SequenceFrameAssembler.select()SequenceFrameAssembler.static()SequenceFrameAssembler.static_for_store()SequenceFrameAssembler.temporal()SequenceFrameAssembler.temporal_for_store()
SequenceViewMixin
- Module contents
- tanat.sequence.type package
- Subpackages
- Module contents
Submodules#
tanat.sequence.shortcuts module#
Quick-build helpers for sequence pools.
- tanat.sequence.shortcuts.build_events(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, time_column: str, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) EventSequencePool[source]#
Build an
EventSequencePoolfrom a single DataFrame.All columns in
temporal_dataexceptid_columnandtime_columnare treated as entity features. All columns instatic_dataexceptid_columnare treated as static features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, time, and one feature column.
id_column – Name of the sequence identifier column (present in both
temporal_dataandstatic_dataif provided).time_column – Name of the timestamp column.
static_data – Optional DataFrame or LazyFrame with per-id static features. Must contain a column named
id_columnfor joining.store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_event_<hex8>).
- Returns:
A ready-to-use
EventSequencePool.- Raises:
ValueError – If
id_columnortime_columnare missing, if no feature columns remain, or ifid_columnis absent fromstatic_data.
Examples:
pool = build_events(df, id_column="patient", time_column="date") pool.temporal_data(fmt="polars").head()
- tanat.sequence.shortcuts.build_intervals(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, start_column: str, end_column: str, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) IntervalSequencePool[source]#
Build an
IntervalSequencePoolfrom a single DataFrame.All columns in
temporal_dataexceptid_column,start_column, andend_columnare treated as entity features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, start, end, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the interval start column.
end_column – Name of the interval end column.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_interval_<hex8>).
- Returns:
A ready-to-use
IntervalSequencePool.- Raises:
ValueError – If required columns are missing, if no feature columns remain, or if
id_columnis absent fromstatic_data.
Examples:
pool = build_intervals( df, id_column="id", start_column="start", end_column="end", ) pool.temporal_data(fmt="polars").head()
- tanat.sequence.shortcuts.build_states(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, start_column: str, end_column: str | None = None, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) StateSequencePool[source]#
Build a
StateSequencePoolfrom a single DataFrame.When
end_columnisNonethe end of each state is derived from the start of the next state (last state stays open-ended withnull).All columns in
temporal_dataexcept the structural columns (id, start, and optionally end) are treated as entity features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, start, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the state start column.
end_column – Name of the state end column. When
Nonethe builder derives end values automatically.static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_state_<hex8>).
- Returns:
A ready-to-use
StateSequencePool.- Raises:
ValueError – If required columns are missing, if no feature columns remain, or if
id_columnis absent fromstatic_data.
Examples:
pool = build_states(df, id_column="id", start_column="start") pool.temporal_data(fmt="polars").head()
Module contents#
Sequence module entry point.
- class tanat.sequence.EventEntity(id_value, store: str | Path | SequenceStore, features: list[str] | None = None, *, rank: int, store_index: int, cast_recipe: SequenceCastRecipe | dict | None = None, virtual_id: str | None = None, parent_metadata: SequenceMetadata | None = None)[source]#
Bases:
EntityEntity representing one event row (single timestamp).
- class tanat.sequence.EventSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#
Bases:
SequenceA single event sequence (one timestamp per entity row).
- SETTINGS_CLASS[source]#
alias of
EventSequenceSettings
- __init__(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None) None[source]#
Create an event sequence for id_value.
- Parameters:
id_value – Sequence identifier.
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
time_column – User-facing name for the event timestamp column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.
- class tanat.sequence.EventSequencePool(store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#
Bases:
SequencePoolPool of event sequences (single timestamp per entity row).
- SETTINGS_CLASS[source]#
alias of
EventSequenceSettings
- __init__(store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None) None[source]#
Create an event sequence pool backed by store.
- Parameters:
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
time_column – User-facing name for the event timestamp column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via
SequenceCastRecipe.coerce()and probed eagerly.
- Raises:
TypeError – If cast_recipe is not a
SequenceCastRecipe,dict, orNone.
- as_event() EventSequencePool[source]#
Return this pool unchanged . Source and target types are identical.
A warning is emitted to signal the no-op conversion.
- Returns:
self(no copy, no I/O).
- as_interval(duration: Duration, *, start_column: str = 'start', end_column: str = 'end', destination: str | Path | None = None, overwrite: bool = False) IntervalSequencePool[source]#
Convert this event pool to an interval pool by computing
_t_end.Each event timestamp becomes
_t_start;_t_endis computed as_t_start + duration. The resulting time index is stored as a virtual override (ephemeral) or written to a new persistent store.- Parameters:
duration –
Interval length added to each event timestamp. Can be:
A
timedeltaor numeric scalar: applied uniformly to every event.A
str: name of an entity feature column whose values provide per-row durations.
start_column – User-facing name for the start column. Defaults to
"start".end_column – User-facing name for the end column. Defaults to
"end".destination –
None→ ephemeral result; path → new persistent store.overwrite – Replace destination if it already exists.
- Returns:
A new
IntervalSequencePool.
- as_state(*, end_value: datetime | int | float | str | None = None, start_column: str = 'start', end_column: str = 'end', destination: str | Path | None = None, overwrite: bool = False) StateSequencePool[source]#
Convert this event pool to a state pool by computing
_t_end.Each event timestamp becomes
_t_start;_t_endis taken from the next event in the same sequence (shift(-1).over(_seq_id)).- Parameters:
end_value – Sentinel for
_t_endof the last event per sequence.Noneleaves the last row with_t_end = null. Astrnames a static feature column whose per-sequence value fills the last_t_end.start_column – User-facing name for the start column. Defaults to
"start".end_column – User-facing name for the end column. Defaults to
"end".destination –
None→ ephemeral result; path → new persistent store.overwrite – Replace destination if it already exists.
- Returns:
A new
StateSequencePool.
- classmethod builder() EventSequenceStoreBuilder[source]#
Return a fluent builder for constructing an event sequence store.
- class tanat.sequence.EventSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, time_column: str)[source]#
Bases:
SequenceSettingsSettings for event sequences (single timestamp column).
- class tanat.sequence.IntervalEntity(id_value, store: str | Path | SequenceStore, features: list[str] | None = None, *, rank: int, store_index: int, cast_recipe: SequenceCastRecipe | dict | None = None, virtual_id: str | None = None, parent_metadata: SequenceMetadata | None = None)[source]#
Bases:
EntityEntity representing one interval row (start/end timestamps).
Unlike
StateEntity, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.
- class tanat.sequence.IntervalSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#
Bases:
SequenceA single interval sequence.
Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.
- SETTINGS_CLASS[source]#
alias of
IntervalSequenceSettings
- __init__(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None) None[source]#
Create an interval sequence for id_value.
- Parameters:
id_value – Sequence identifier.
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the interval start column.
end_column – User-facing name for the interval end column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.
- class tanat.sequence.IntervalSequencePool(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#
Bases:
SequencePoolPool of interval sequences.
Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.
- SETTINGS_CLASS[source]#
alias of
IntervalSequenceSettings
- __init__(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None) None[source]#
Create an interval sequence pool backed by store.
- Parameters:
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the interval start column.
end_column – User-facing name for the interval end column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via
SequenceCastRecipe.coerce()and probed eagerly.
- Raises:
TypeError – If cast_recipe is not a
SequenceCastRecipe,dict, orNone.
- as_event(anchor: Literal['start', 'end', 'middle'], *, time_column: str = 'time', destination: str | Path | None = None, overwrite: bool = False) EventSequencePool[source]#
Convert this interval pool to an event pool by anchoring to one timestamp.
- Parameters:
anchor –
"start","end", or"middle"- selects which timestamp (or their midpoint) becomes the event timestamp.time_column – User-facing name for the event timestamp. Defaults to
"time".destination –
None→ ephemeral result; path → new persistent store.overwrite – Replace destination if it already exists.
- Returns:
A new
EventSequencePool.
- as_interval() IntervalSequencePool[source]#
Return this pool unchanged - source and target types are identical.
A warning is emitted to signal the no-op conversion.
- Returns:
self(no copy, no I/O).
- as_state() NoReturn[source]#
Not supported: interval → state conversion is ambiguous.
Intervals may overlap or contain gaps; neither property can be resolved into contiguous non-overlapping states without domain-specific merge / fill logic. Apply a manual Polars transformation instead.
- Raises:
NotImplementedError – Always.
- classmethod builder(*, sort_anchor: Literal['start', 'end', 'middle'] = 'start') IntervalSequenceStoreBuilder[source]#
Return a fluent builder for constructing an interval sequence store.
- Parameters:
sort_anchor – Intra-sequence sort column -
"start"(default),"end"for right-censored datasets, or"middle"to sort by the interval midpoint(T_START + T_END) / 2.
- class tanat.sequence.IntervalSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, start_column: str, end_column: str)[source]#
Bases:
SequenceSettingsSettings for interval sequences (start + end timestamp columns).
Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.
- get_time_columns() list[str][source]#
Returns time index columns for Interval sequences [start, end].
- class tanat.sequence.Sequence(id_value, store: SequenceStore, settings)[source]#
Bases:
ABC,SequenceViewMixin,CachableSettings,RegistrableInterface to a single sequence within a Store.
A Sequence is a scoped view on the data for one specific ID. It shares the same
SequenceStoreas its parent Pool (no copy).Typical creation patterns:
# From a Pool (recommended) seq = pool[42] # Standalone seq = StateSequence(id_value=42, store="my_store")
- __init__(id_value, store: SequenceStore, settings) None[source]#
Base initialiser. Delegated to by concrete subclasses and
from_parent()after store and feature resolution have been performed.- Parameters:
id_value – Unique identifier for this sequence in the store.
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings(entity_featuresandstatic_featuresneverNone).
- apply(exprs: Expr | list[Expr], is_static: bool = False, *, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Evaluates Polars expressions against this sequence’s features.
This is a read-only computation scoped to this single sequence. The result is returned, not stored.
At the Pool level, use
Pool.apply(by_id=True)for per-sequence computations across all sequences, thenPool.add_entity_features()orPool.add_static_features()to persist.- Parameters:
exprs – One or more Polars expressions producing new columns. Each must use
.alias()to name the output.is_static – Whether to read static or entity features.
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
The computed columns for this sequence only.
Examples
Local normalization:
seq = pool[42] result = seq.apply( (pl.col("value") - pl.col("value").mean()).alias("v_centered") )
Multiple expressions:
result = seq.apply([ (pl.col("value").diff()).alias("v_diff"), (pl.col("value").rolling_mean(3)).alias("v_rm3"), ])
See also
Pool.apply: Apply across all sequences (with optionalby_id).Pool.add_entity_features: Persist entity features.Pool.add_static_features: Persist static features.
- copy() Sequence[source]#
Return a standalone copy of this sequence, detached from any parent pool.
- Returns:
A new standalone
Sequencewith_parent_pool=None.
Examples:
seq = pool[42] standalone = seq.copy() # detaches from pool standalone.filter_entities(crit, inplace=True) # safe
- describe(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Compute summary statistics for this single sequence.
- Parameters:
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Single-row DataFrame with columns
[length, n_unique_entities, temporal_span, …].
Examples:
seq = pool[42] seq.describe() seq.describe(fmt="polars")
- filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) Sequence[source]#
Return a view with entities pruned by criterion.
- Parameters:
- Returns:
Filtered sequence (or self when inplace=True).
- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.
- classmethod from_parent(id_value, store: SequenceStore, settings, *, parent_pool: SequencePool) Sequence[source]#
Create a pool-managed sequence. Not part of the public API.
Bypasses store resolution, feature resolution, and cast probe: all already performed by the pool. Every piece of pool context (casts, filters, virtual ID, T0) is read lazily from parent_pool via the corresponding cached properties.
- Parameters:
id_value – Sequence identifier.
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings.parent_pool – The owning
SequencePool.
- Returns:
A new
Sequenceinstance bound to parent_pool.
- match(criterion: Criterion) bool[source]#
Return
Trueif this sequence satisfies criterion.- Parameters:
criterion – A
Criterioninstance.- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion is incompatible with this sequence.
- static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#
Return static (non-temporal) data for this sequence.
- Parameters:
features – Feature name(s) to include (
None-> all).fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Single-row DataFrame with columns
[id, feature…].Nonewhen no static features are exposed by this pool.
Examples:
seq = pool[42] row = seq.static_data() # pandas, all static features row = seq.static_data("age", "sex") # subset
- property t0: datetime | date | int | float | None[source]#
T0 value for this sequence (scalar, not a DataFrame).
Nonewhen no valid T0 row was found (e.g. sequence too short, or no row matched the query).
- property t0_nearest_rank: int | None[source]#
0-based rank of the nearest row at or before T0 within this sequence.
Nonewhen no valid T0 row was found (e.g. sequence too short, T0 before all timestamps, or no row matched the query).
- temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return temporal data for this sequence.
Each row is one entity: the atomic observation of this sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).
- Parameters:
features – Entity feature name(s) to include.
None→ all entity features.fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with columns
[id, temporal…, feature…]scoped to this sequence ID.
Examples:
seq = pool[42] df = seq.temporal_data() # pandas, all features df = seq.temporal_data("heart_rate") # single feature df = seq.temporal_data(fmt="polars")
- class tanat.sequence.SequencePool(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#
Bases:
ABC,SequenceViewMixin,CachableSettings,RegistrableBase class for sequence pool objects.
- __init__(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None) None[source]#
Base initialiser. Delegated to by concrete subclasses after store and feature resolution have been performed.
- Parameters:
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings(or equivalent dict).entity_featuresandstatic_featuresneverNone.cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via
SequenceCastRecipe.coerce()and probed eagerly.
- Raises:
TypeError – If cast_recipe is not a
SequenceCastRecipe,dict, orNone.
- add_entity_features(df: DataFrame | DataFrame | LazyFrame, *, overwrite: bool = False) None[source]#
Add new entity features to the virtual store.
The input DataFrame must be positionally aligned with the full entity row set of the store (i.e. it must have exactly as many rows as there are entity rows in the physical store, not just the current view). Use
save()first to materialise a filtered view before calling this method.- Parameters:
df – Feature-only DataFrame (no ID column) positionally aligned with the entity rows in the store. Can be pandas, Polars eager, or Polars lazy.
overwrite – If
True, replace existing features with the same name in the virtual context.
- Raises:
RuntimeError – If the pool has an active
_id_maskor entity filter expression. Callpool.save()first and then add features to the resulting unfiltered pool.ValueError – If the number of rows in df does not match the number of entity rows in the store.
- add_static_features(df: DataFrame | DataFrame | LazyFrame, *, id_column: str | None = None, overwrite: bool = False) None[source]#
Add static features to the virtual store via an ID-keyed join.
The input DataFrame must include the ID column (either under
settings.id_columnor under the name given by id_column). A LEFT JOIN against the full sequence index is performed internally, so partial DataFrames (covering only a subset of IDs) are accepted: IDs absent from df receivenullin the virtual context.- Parameters:
df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.
id_column – Name of the ID column in df. Defaults to
settings.id_columnwhenNone. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g.id_column="patient_id").overwrite – If
True, replace existing features with the same name in the virtual context.
- Raises:
KeyError – If the resolved ID column is not found in df.
- apply(exprs: Expr | list[Expr], is_static: bool = False, *, by_id: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Evaluate Polars expressions against the current features.
This is a read-only computation: the result is returned, not stored. Use
add_entity_features()oradd_static_features()to persist the result.Each expression must produce a named column (
.alias()).When
by_id=True, the result always includes the ID column (settings.id_column).- Parameters:
exprs – One or more Polars expressions producing new columns.
is_static – Whether to read static or entity features.
by_id – If
True, expressions are evaluated per sequence (group_byon the sequence ID). Only valid for entity features (is_static=False). The ID column is included in the result.fmt –
Format of the returned object. One of:
"pandas"(default): returns apandas.DataFrame."polars": returns apolars.DataFrame.
- Returns:
The computed columns as a DataFrame. When
by_id=True, the first column is the sequence ID.- Raises:
ValueError – If
by_id=Trueandis_static=True.
Examples
Compute and inspect:
result = pool.apply( (pl.col("age") * pl.col("score")).alias("age_score"), is_static=True, ) print(result)
Persist entity features:
result = pool.apply( (pl.col("value") - pl.col("value").mean()).alias("centered"), ) pool.add_entity_features(result)
Per-sequence aggregation (result includes ID column):
summary = pool.apply( pl.col("value").mean().alias("value_mean"), by_id=True, ) pool.add_static_features(summary)
Per-sequence normalization (result includes ID column):
normed = pool.apply( (pl.col("value") - pl.col("value").mean()).alias("v_normed"), by_id=True, ) pool.add_entity_features(normed)
- binned_data(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') DataFrame | DataFrame[source]#
Project sequences onto a binned temporal table (long-format dataframe).
Each sequence is aligned to a shared time axis divided into fixed-size bins. When multiple values compete for the same bin,
overlap_ruleresolves the ambiguity. Empty bins are filled withfill_value.For an ML-ready 3-D tensor with feature labels and ID order, see
to_tensor().- Parameters:
features – Feature(s) to project onto the grid.
bin_size – Width of each bin (duration string for datetime, numeric otherwise).
max_bins – Maximum number of bins.
Noneinfers from the data span, capped byMAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.fill_value – Value used to fill empty bins.
Nonekeeps nulls.overlap_rule – Polars aggregation name for in-bin conflict resolution (
"first","last","mean","max","sum","median", …).ohe – One-hot encode features before binning. Requires
CategoricalorEnumdtypes.fmt –
"pandas"(default) or"polars".use_arrow – Pandas conversion uses Arrow when
True.bin_col – Output column name for the bin index.
- Returns:
DataFrame with columns
[id_col, bin_col, *feature_cols]andlen(unique_ids) * n_binsrows.
- cast_features(schema: dict[str, DataType | type], is_static: bool = False) None[source]#
Casts feature columns to new types, scoped to this Pool only.
To make a cast permanent on disk, save the Pool first (
pool.save()) and reload. Persisting changes might affect other views sharing the same store, so use with caution.- Parameters:
schema – Dictionary mapping feature names to target Polars DataTypes.
is_static – Whether these are static features (True) or entity features (False).
- Raises:
TypeError – If schema is not a dict.
KeyError – If a feature name does not exist.
- cast_id(dtype: DataType) None[source]#
Casts the ID column to a new type.
- Parameters:
dtype – The target Polars DataType.
- cast_to_datetime(unit: str = 'us', time_zone: str | None = None)[source]#
Cast time columns to Datetime.
- Parameters:
unit – The datetime resolution (“ms”, “us”, “ns”). Default is “us” (microsecond), the Python standard.
time_zone – Optional timezone string (e.g. “UTC”, “Europe/Paris”).
- cast_to_timestep(dtype: DataType = Int64)[source]#
Cast time columns to numeric-based timesteps.
- Parameters:
dtype – The target numeric type (e.g., pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64 for safety.
- Raises:
TypeError – If dtype is not a numeric type.
TypeError – If the underlying data is already in Datetime format. (Conversion from Datetime to Timestep is not allowed).
- copy() SequencePool[source]#
Returns a shallow copy of this Pool, sharing the same store but with all view state (masks, casts, virtual features) conserved.
The virtual context is forked into a new UUID so that the copy owns its own independent context. Garbage-collecting either instance will not destroy the other’s virtual features.
The T0 strategy (
_t0_setter) is propagated to the copy. The T0 result cache is not copied; it is recomputed on the first call tot0_data()on the copy.
- describe(by_id: bool = True, add_to_static: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Compute summary statistics for every sequence in the pool.
- Parameters:
by_id – If
True(default), return one row per sequence ID with columns[id, length, n_unique_entities, …]. IfFalse, return the cross-sequence pandas.describe()(count, mean, std, min, 25%, …).add_to_static – If
True, write the per-ID result to the static-feature store viaadd_static_features(). Ignored (with a warning) whenby_id=False.fmt –
"pandas"(default) or"polars". Ignored whenby_id=False(always pandas).use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with one row per sequence ID. -
by_id=False: Aggregated statistics (pandasdescribe()output).
- Return type:
by_id=True
Examples:
pool.describe() # one row per ID, pandas pool.describe(fmt="polars") # same, polars pool.describe(by_id=False) # cross-ID stats pool.describe(add_to_static=True) # persist as static cols
- drop_features(features: list[str], is_static: bool = False, *, permanently: bool = False) None[source]#
Removes features from the current view.
By default, this is a soft drop: features are removed from the Pool settings so they no longer appear in
temporal_data(),static_data()ormetadata, but the underlying files are left untouched.With
permanently=Truethe columns are also physically deleted from disk (physical store and/or virtual store).- Parameters:
features – Feature names to drop.
is_static –
Truefor static features,Falsefor entity features.permanently – If
True, also delete the columns from disk. This is irreversible for physical features.
- extend(other: SequencePool | Sequence, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) SequencePool[source]#
Merge other into this pool and write the result to disk.
Mirrors the semantics of
save().Same-store fast path: if
selfandotherpoint to the same physical store and neither carries virtual content (_virtual_id is Noneon both sides), no read I/O is performed. A new pool backed by the same store with the union of both ID masks is built immediately. If destination is provided the merged pool is then materialised to disk viasave(); otherwise it is returned as an in-memory view with zero I/O.Different stores (or virtual content present): a full merge-and-write is performed and destination is required. A named destination writes the merged data to a new store. To rewrite in-place, pass
destination=self._store.root_pathexplicitly together withoverwrite=True.- Parameters:
other –
Data to merge. Accepted types:
SequencePool: must be the same concrete subclass with an identical entity feature schema.Sequence: single sequence object; schema is checked against this pool.
destination –
None→ in-memory view (same-store fast path only; no I/O);str/Path→ materialise the merged data to disk. destination is required when merging from different stores.on_duplicate –
Behaviour when other contains an ID already present in this pool:
"raise"(default): raiseValueErrorlisting the conflicting IDs."skip": silently ignore duplicates.
overwrite – Allows overwriting an existing destination when it already exists on disk.
- Returns:
A new
SequencePoolinstance.- Raises:
TypeError – If other is not a
SequencePoolorSequence.ValueError – If other is a
SequencePoolof a different concrete type.ValueError – If other is missing entity features declared in this pool’s settings.
ValueError – If
on_duplicate="raise"and duplicate IDs are found.ValueError – If
destination=Noneand stores differ (same-store fast path only supports an in-memory view without I/O).FileExistsError – If destination exists and
overwrite=False.
See also
- filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) SequencePool[source]#
Return a view with entities pruned by criterion.
- Parameters:
- Returns:
Filtered pool (or self when inplace=True).
- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.
- classmethod from_parent(store_path: Path, *, parent_pool: TrajectoryPool) SequencePool[source]#
Create a managed sub-pool owned by parent_pool.
Builds the pool from store_path, aligns settings and casts with the parent
TrajectoryPool, and marks it locked. T0 is resolved lazily via the_t0_setterproperty which delegates to parent_pool.- Parameters:
store_path – Root path of the sequence store to load.
parent_pool – The owning
TrajectoryPool.
- Returns:
A locked
SequencePoolwhose T0 delegates to parent_pool.
- get_sequences(entity_features: list[str] | None = None, static_features: list[str] | None = None) dict[str, Sequence][source]#
Return a mapping of sequence IDs to
Sequenceobjects.- Parameters:
entity_features – Entity feature subset to expose in each sequence.
None-> use pool-level settings.static_features – Static feature subset to expose in each sequence.
None-> use pool-level settings.
- Returns:
Dict mapping each visible sequence ID to its
Sequenceinstance.
Examples:
seqs = pool.get_sequences() print(seqs[42].temporal_data())
- property is_dirty: bool[source]#
Trueif the pool has state not yet written to disk.Covers virtual features, view scopes, type casts, and soft feature drops. A dirty pool needs
save()to materialise its current view.
- save(destination: str | Path | None = None, *, overwrite: bool = False) Path[source]#
Persist the current pool state (virtual features + view masks).
Without destination the store is rewritten in-place. With destination the pool is rebuilt into that path and then redirects to it - the original files are left untouched.
In both cases the pool is left in a clean state after a successful save: virtual context, masks, soft-drops and cast recipe are all reset, and
is_dirtybecomesFalse.When a mask is active and destination is
None, the store is overwritten with a subset of the data.overwrite=Trueis required to confirm.- Parameters:
destination –
None→ in-place; path → rebuild into new path and redirect the pool there.overwrite – Required when saving a filtered view in-place. Also allows overwriting an existing destination.
- Returns:
The
Pathof the written store.- Raises:
FileExistsError – If destination exists and overwrite is
False.RuntimeError – If a mask is active in-place without overwrite.
- set_t0(*, position: int | None = None, direct=None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True) SequencePool[source]#
Configure the T0 strategy for this pool.
Exactly one strategy keyword must be provided. All others must remain
None.- Parameters:
position – Row index (0-based; negative indexing supported).
direct – Scalar value or
{seq_id: value}dict.feature – Static feature column name.
query – Polars boolean expression on any sequence column (time columns or entity features).
anchor –
Which end of each interval/state row to use as the reference timestamp for the floor lookup:
"start"(default): use the start timestamp."end": use the end timestamp."middle": use the midpoint(start + end) / 2.
Omitting
anchor=on an interval/state pool emits aUserWarningand defaults to"start". Passinganchor=on an event pool emits aUserWarningand the value is ignored (single time column, anchor is irrelevant).use_first – For the query strategy, whether to take the first (
True) or last (False) matching row.
- Returns:
selffor chaining.
- static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#
Return static (non-temporal) data for all sequences in this pool.
- Parameters:
features – Feature name(s) to include (
None-> all static features).fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
One-row-per-sequence DataFrame with columns
[id, feature...].Nonewhen no static features are exposed by this pool.
Examples:
df = pool.static_data() # pandas, all static features df = pool.static_data(["age", "sex"]) # subset df = pool.subset([1, 2, 3]).static_data()
- subset(ids, *, inplace=False) SequencePool[source]#
Returns a new Pool containing only the specified sequence IDs.
- Parameters:
ids – A list of sequence IDs to include in the subset.
inplace – If
True, modify this Pool’s view instead of returning a new one.
- Returns:
A new SequencePool instance with the subset of IDs, or self if inplace=True.
- Raises:
ValueError – If any ID in ids is not present in
unique_ids.
- survival_target(endpoint_time: str, occurred: str | None = None, censure_time: str | None = None, fmt: Literal['sksurv', 'polars', 'pandas'] = 'sksurv') tuple[ndarray | DataFrame | DataFrame, list][source]#
Build a survival target (occurred, time) from static features stored in the pool.
For each patient, assembles a binary indicator (did the endpoint occur?) and the corresponding duration (time from T0 to the endpoint, or to the last observation for censored patients). Patients with unresolvable or non-positive durations are excluded and reported via a warning.
Durations are computed as the difference between the absolute value stored in the static column and the per-patient T0.
If
censure_timeisNone, the last recorded time in the sequence is used as the censoring reference.- Parameters:
endpoint_time – Name of the static column containing the absolute time at which the endpoint occurred (e.g. age at death, event datetime). T0 is subtracted internally to produce the duration. Null = endpoint not observed (censored), when occurred is None. Expected dtype: same as the pool time axis (numeric for timestep pools, datetime for datetime pools).
occurred – Name of the static column with a binary endpoint indicator. True (or 1) = endpoint observed, False (or 0) or null = censored. Expected dtype: bool or numeric (int or float); cast to bool internally. If None, inferred as endpoint_time.is_not_null().
censure_time – Name of the static column containing the absolute time of the last observation for censored patients (e.g. age at last visit, last visit datetime). T0 is subtracted internally to produce the duration. Expected dtype: same as the pool time axis. If None, derived automatically as
max(get_temporal_columns()[-1])per patient (i.e. max of time_column for events, max of end_column for state/interval pools).fmt – Format of the returned target y. “sksurv” returns a structured np.ndarray with fields (occurred: bool, time: float) compatible with scikit-survival. “polars” and “pandas” return a DataFrame with columns [“id”, “occurred”, “time”].
- Returns:
A tuple (y, valid_ids). y is the survival target in the requested format. valid_ids is the list of patient identifiers retained after filtering invalid rows.
- Raises:
KeyError – If a column name is not found in the pool’s static features.
RuntimeError – If no static features are available in this pool.
Examples
>>> y, valid_ids = pool.survival_target( ... endpoint_time="death_age_occur", ... ) >>> y, valid_ids = pool.survival_target( ... endpoint_time="death_age_occur", ... occurred="death", ... )
- t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return the T0 table for the sequences visible in this pool view.
Thin public wrapper around
_get_t0_df()that handles format conversion.- Parameters:
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with columns
[id_col, _T0_, _T0_NEAREST_RANK_], one row per visible sequence.
Examples:
pool.set_t0(position=0, anchor="start") df = pool.t0_data() df_pl = pool.t0_data(fmt="polars")
- temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return temporal data for all sequences visible in this pool.
Each row is one entity: the atomic observation of a sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).
- Parameters:
features – Entity feature name(s) to include.
None→ all entity features.fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Long-format DataFrame with columns
[id, temporal…, feature…]covering every visible sequence.
Examples:
df = pool.temporal_data() # pandas, all features df = pool.temporal_data("heart_rate") # single feature df = pool.temporal_data(["a", "b"], fmt="polars") # Restrict to a subset of IDs: df = pool.subset([1, 2, 3]).temporal_data()
- to_dummies(features: list[str] | str, is_static: bool = False, *, drop_first: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
One-hot encode categorical features.
Returns a DataFrame with binary columns for each category. Only features typed as
CategoricalorEnumare accepted. Cast first withcast_featuresif needed.This is a consumption method: the result is returned, not stored in the pool. Use it when preparing data for training.
- Parameters:
features – Feature name(s) to encode.
is_static – Whether these are static or entity features.
drop_first – Drop the first category column to avoid multicollinearity (useful for linear models).
fmt – Format of the returned object. One of: -
"pandas"(default): returns apandas.DataFrame. -"polars": returns apolars.DataFrame.
- Returns:
DataFrame with binary columns (one per category per feature).
- Raises:
TypeError – If any feature is not
CategoricalorEnum.
Examples
Basic usage:
pool.cast_features(schema={"status": pl.Categorical}) dummies = pool.to_dummies("status") # → status_OK, status_ERROR, status_WARNING columns
With
drop_firstfor linear models:X = pool.to_dummies("group", is_static=True, drop_first=True)
- to_tensor(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') tuple[ndarray, list, list[str]][source]#
Project sequences onto a 3-D temporal grid (ML-ready tensor).
Returns a dense
(N, M, K)ndarray together with the list of K-axis feature labels. Sequence order on the N axis followsunique_ids.For a long-format dataframe variant (joins, plotting, exploration), see
binned_data().- Parameters:
features – Feature(s) to project.
bin_size – Bin width (duration string for datetime, numeric otherwise).
max_bins – Maximum bins.
Noneinfers from the data span, capped byMAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.fill_value – Value for empty bins.
Nonekeeps NaN.overlap_rule – In-bin aggregation. See
binned_data().ohe – One-hot encode before binning. Post-OHE column names are reflected in the returned
feature_names.bin_col – Internal bin column name (forwarded to the underlying pipeline for consistency; not present in the output).
- Returns:
arrhas shape(N, M, K)=(len(unique_ids), n_bins, len(feature_names)).idsis the sequence of entity IDs matching the N-axis order (identical tounique_ids).feature_nameslists the K-axis labels in column order (post-OHE names whenohe=True).
- Return type:
A 3-tuple
(arr, ids, feature_names)where
Examples:
arr, ids, names = pool.to_tensor(["dose", "route"], "1d", ohe=True) # arr.shape == (N, M, K) # names == ["dose", "route_oral", "route_iv"]
- train_test_split(*, test_size: float | int | None = None, train_size: float | int | None = None, random_state: int | None = None, shuffle: bool = True) tuple[SequencePool, SequencePool][source]#
Split the pool into train and test subsets.
Mirrors the interface of
sklearn.model_selection.train_test_split().- Parameters:
test_size – Proportion (
floatin(0, 1)) or absolute count (int) of samples for the test subset. Defaults to0.25when both test_size and train_size areNone.train_size – Proportion (
floatin(0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.random_state – Seed for the random number generator. Pass an integer for reproducibility.
shuffle – Whether to shuffle IDs before splitting. When
False, the first IDs go to train and the last to test.
- Returns:
two new non-overlapping pool views.
- Return type:
(train_pool, test_pool)- Raises:
ValueError – If the pool is empty, sizes are non-positive, or
n_train + n_testexceeds the pool size.
- property unique_ids: list[source]#
Visible sequence IDs in store order as a plain Python list.
Respects
_id_mask. Deterministic order (sorted at build time).Warning
listerases rich Polars dtypes (Categorical→str). Prefer_id_lfwhen the result feeds a Polars join.
- which(criterion: Criterion, *, verbose: bool = True) set[source]#
Return the set of IDs in this pool satisfying criterion.
- Parameters:
criterion – A
Criterioninstance.verbose – If
True, print a one-line report.
- Returns:
Set of matching IDs.
- Raises:
TypeError – If criterion is not a Criterion object.
- class tanat.sequence.StateEntity(id_value, store: str | Path | SequenceStore, features: list[str] | None = None, *, rank: int, store_index: int, cast_recipe: SequenceCastRecipe | dict | None = None, virtual_id: str | None = None, parent_metadata: SequenceMetadata | None = None)[source]#
Bases:
EntityEntity representing one state row (start/end).
States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.
- class tanat.sequence.StateSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#
Bases:
SequenceA single state sequence.
States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.
- SETTINGS_CLASS[source]#
alias of
StateSequenceSettings
- __init__(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None) None[source]#
Create a state sequence for id_value.
- Parameters:
id_value – Sequence identifier.
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the state start column.
end_column – User-facing name for the state end column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.
- class tanat.sequence.StateSequencePool(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#
Bases:
SequencePoolPool of state sequences.
States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.
- SETTINGS_CLASS[source]#
alias of
StateSequenceSettings
- __init__(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None) None[source]#
Create a state sequence pool backed by store.
- Parameters:
store – Store path, name, or
SequenceStoreinstance.id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the state start column.
end_column – User-facing name for the state end column.
entity_features – Subset of entity feature names to expose.
None→ all available from the store.static_features – Static feature names to expose.
None→ all available.[]→ none.cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via
SequenceCastRecipe.coerce()and probed eagerly.
- Raises:
TypeError – If cast_recipe is not a
SequenceCastRecipe,dict, orNone.
- as_event(anchor: Literal['start', 'end', 'middle'], *, time_column: str = 'time', destination: str | Path | None = None, overwrite: bool = False) EventSequencePool[source]#
Convert this state pool to an event pool by anchoring to one timestamp.
- Parameters:
anchor –
"start","end", or"middle"- selects which timestamp (or their midpoint) becomes the event timestamp.time_column – User-facing name for the event timestamp. Defaults to
"time".destination –
None→ ephemeral result; path → new persistent store.overwrite – Replace destination if it already exists.
- Returns:
A new
EventSequencePool.
- as_interval(*, start_column: str | None = None, end_column: str | None = None, destination: str | Path | None = None, overwrite: bool = False) IntervalSequencePool[source]#
Convert this state pool to an interval pool.
States and intervals share the same
(_t_start, _t_end)physical layout - no temporal recomputation needed.- Parameters:
start_column – User-facing name for the start column.
Noneinherits this pool’s current setting.end_column – User-facing name for the end column.
Noneinherits this pool’s current setting.destination –
None→ ephemeral result; path → new persistent store.overwrite – Replace destination if it already exists.
- Returns:
A new
IntervalSequencePool.
- as_state() StateSequencePool[source]#
Return this pool unchanged - source and target types are identical.
A warning is emitted to signal the no-op conversion.
- Returns:
self(no copy, no I/O).
- classmethod builder(*, end_value: datetime | int | float | None = None, validate_continuity: bool = True) StateSequenceStoreBuilder[source]#
Return a fluent builder for constructing a state sequence store.
- Parameters:
end_value – Sentinel for
T_ENDof the last state in each sequence whenend_columnis not provided at source registration time.None→ leaves the lastT_ENDasnull.validate_continuity – When
end_columnis provided, verify that states are truly contiguous (T_END[i] == T_START[i+1]) before writing. Defaults toTrue. Set toFalseon large datasets where the cost of a fullcollect()is unacceptable.
- class tanat.sequence.StateSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, start_column: str, end_column: str)[source]#
Bases:
SequenceSettingsSettings for state sequences (start + end timestamp columns).
States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.
- tanat.sequence.build_events(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, time_column: str, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) EventSequencePool[source]#
Build an
EventSequencePoolfrom a single DataFrame.All columns in
temporal_dataexceptid_columnandtime_columnare treated as entity features. All columns instatic_dataexceptid_columnare treated as static features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, time, and one feature column.
id_column – Name of the sequence identifier column (present in both
temporal_dataandstatic_dataif provided).time_column – Name of the timestamp column.
static_data – Optional DataFrame or LazyFrame with per-id static features. Must contain a column named
id_columnfor joining.store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_event_<hex8>).
- Returns:
A ready-to-use
EventSequencePool.- Raises:
ValueError – If
id_columnortime_columnare missing, if no feature columns remain, or ifid_columnis absent fromstatic_data.
Examples:
pool = build_events(df, id_column="patient", time_column="date") pool.temporal_data(fmt="polars").head()
- tanat.sequence.build_intervals(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, start_column: str, end_column: str, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) IntervalSequencePool[source]#
Build an
IntervalSequencePoolfrom a single DataFrame.All columns in
temporal_dataexceptid_column,start_column, andend_columnare treated as entity features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, start, end, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the interval start column.
end_column – Name of the interval end column.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_interval_<hex8>).
- Returns:
A ready-to-use
IntervalSequencePool.- Raises:
ValueError – If required columns are missing, if no feature columns remain, or if
id_columnis absent fromstatic_data.
Examples:
pool = build_intervals( df, id_column="id", start_column="start", end_column="end", ) pool.temporal_data(fmt="polars").head()
- tanat.sequence.build_states(temporal_data: DataFrame | DataFrame | LazyFrame, *, id_column: str, start_column: str, end_column: str | None = None, static_data: DataFrame | DataFrame | LazyFrame | None = None, store_name: str | None = None) StateSequencePool[source]#
Build a
StateSequencePoolfrom a single DataFrame.When
end_columnisNonethe end of each state is derived from the start of the next state (last state stays open-ended withnull).All columns in
temporal_dataexcept the structural columns (id, start, and optionally end) are treated as entity features.- Parameters:
temporal_data – DataFrame or LazyFrame with at least id, start, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the state start column.
end_column – Name of the state end column. When
Nonethe builder derives end values automatically.static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When
Nonea unique name is generated automatically (_quick_state_<hex8>).
- Returns:
A ready-to-use
StateSequencePool.- Raises:
ValueError – If required columns are missing, if no feature columns remain, or if
id_columnis absent fromstatic_data.
Examples:
pool = build_states(df, id_column="id", start_column="start") pool.temporal_data(fmt="polars").head()