tanat.sequence package#

Subpackages#

Submodules#

tanat.sequence.shortcuts module#

Quick-build helpers for sequence pools.

Build an EventSequencePool from a single DataFrame.

All columns in temporal_data except id_column and time_column are treated as entity features. All columns in static_data except id_column are treated as static features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, time, and one feature column.
id_column – Name of the sequence identifier column (present in both temporal_data and static_data if provided).
time_column – Name of the timestamp column.
static_data – Optional DataFrame or LazyFrame with per-id static features. Must contain a column named id_column for joining.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_event_<hex8>).

Returns:

A ready-to-use EventSequencePool.

Raises:

ValueError – If id_column or time_column are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_events(df, id_column="patient", time_column="date")
pool.temporal_data(fmt="polars").head()

Build an IntervalSequencePool from a single DataFrame.

All columns in temporal_data except id_column, start_column, and end_column are treated as entity features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, start, end, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the interval start column.
end_column – Name of the interval end column.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_interval_<hex8>).

Returns:

A ready-to-use IntervalSequencePool.

Raises:

ValueError – If required columns are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_intervals(
    df, id_column="id", start_column="start", end_column="end",
)
pool.temporal_data(fmt="polars").head()

Build a StateSequencePool from a single DataFrame.

When end_column is None the end of each state is derived from the start of the next state (last state stays open-ended with null).

All columns in temporal_data except the structural columns (id, start, and optionally end) are treated as entity features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, start, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the state start column.
end_column – Name of the state end column. When None the builder derives end values automatically.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_state_<hex8>).

Returns:

A ready-to-use StateSequencePool.

Raises:

ValueError – If required columns are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_states(df, id_column="id", start_column="start")
pool.temporal_data(fmt="polars").head()

Module contents#

Sequence module entry point.

Bases: Entity

Entity representing one event row (single timestamp).

class tanat.sequence.EventSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#

Bases: Sequence

A single event sequence (one timestamp per entity row).

SETTINGS_CLASS[source]#: alias of EventSequenceSettings

__init__(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', time_column: str = 'time', entity_features: list[str] | None = None, static_features: list[str] | None = None) → None[source]#

Create an event sequence for id_value.

Parameters:

id_value – Sequence identifier.
store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
time_column – User-facing name for the event timestamp column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.

Bases: SequencePool

Pool of event sequences (single timestamp per entity row).

SETTINGS_CLASS[source]#: alias of EventSequenceSettings

Create an event sequence pool backed by store.

Parameters:

store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
time_column – User-facing name for the event timestamp column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.
cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via SequenceCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a SequenceCastRecipe, dict, or None.

as_event() → EventSequencePool[source]#

Return this pool unchanged . Source and target types are identical.

A warning is emitted to signal the no-op conversion.

Returns:: self (no copy, no I/O).

as_interval(duration: Duration, *, start_column: str = 'start', end_column: str = 'end', destination: str | Path | None = None, overwrite: bool = False) → IntervalSequencePool[source]#

Convert this event pool to an interval pool by computing _t_end.

Each event timestamp becomes _t_start; _t_end is computed as _t_start + duration. The resulting time index is stored as a virtual override (ephemeral) or written to a new persistent store.

Parameters:

duration –
Interval length added to each event timestamp. Can be:
- A timedelta or numeric scalar: applied uniformly to every event.
- A str: name of an entity feature column whose values provide per-row durations.
start_column – User-facing name for the start column. Defaults to "start".
end_column – User-facing name for the end column. Defaults to "end".
destination – None → ephemeral result; path → new persistent store.
overwrite – Replace destination if it already exists.

Returns:

A new IntervalSequencePool.

Convert this event pool to a state pool by computing _t_end.

Each event timestamp becomes _t_start; _t_end is taken from the next event in the same sequence (shift(-1).over(_seq_id)).

Parameters:

end_value – Sentinel for _t_end of the last event per sequence. None leaves the last row with _t_end = null. A str names a static feature column whose per-sequence value fills the last _t_end.
start_column – User-facing name for the start column. Defaults to "start".
end_column – User-facing name for the end column. Defaults to "end".
destination – None → ephemeral result; path → new persistent store.
overwrite – Replace destination if it already exists.

Returns:

A new StateSequencePool.

classmethod builder() → EventSequenceStoreBuilder[source]#: Return a fluent builder for constructing an event sequence store.

class tanat.sequence.EventSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, time_column: str)[source]#

Bases: SequenceSettings

Settings for event sequences (single timestamp column).

__init__(*args: Any, **kwargs: Any) → None[source]#

get_time_columns() → list[str][source]#: Returns time index columns for Event sequences [time].

id_column: str[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

time_column: str[source]#

Bases: Entity

Entity representing one interval row (start/end timestamps).

Unlike StateEntity, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.

class tanat.sequence.IntervalSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#

Bases: Sequence

A single interval sequence.

Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.

SETTINGS_CLASS[source]#: alias of IntervalSequenceSettings

__init__(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None) → None[source]#

Create an interval sequence for id_value.

Parameters:

id_value – Sequence identifier.
store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the interval start column.
end_column – User-facing name for the interval end column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.

class tanat.sequence.IntervalSequencePool(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#

Bases: SequencePool

Pool of interval sequences.

Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.

SETTINGS_CLASS[source]#: alias of IntervalSequenceSettings

Create an interval sequence pool backed by store.

Parameters:

store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the interval start column.
end_column – User-facing name for the interval end column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.
cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via SequenceCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a SequenceCastRecipe, dict, or None.

as_event(anchor: Literal['start', 'end', 'middle'], *, time_column: str = 'time', destination: str | Path | None = None, overwrite: bool = False) → EventSequencePool[source]#

Convert this interval pool to an event pool by anchoring to one timestamp.

Parameters:

anchor – "start", "end", or "middle" - selects which timestamp (or their midpoint) becomes the event timestamp.
time_column – User-facing name for the event timestamp. Defaults to "time".
destination – None → ephemeral result; path → new persistent store.
overwrite – Replace destination if it already exists.

Returns:

A new EventSequencePool.

as_interval() → IntervalSequencePool[source]#

Return this pool unchanged - source and target types are identical.

A warning is emitted to signal the no-op conversion.

Returns:: self (no copy, no I/O).

as_state() → NoReturn[source]#

Not supported: interval → state conversion is ambiguous.

Intervals may overlap or contain gaps; neither property can be resolved into contiguous non-overlapping states without domain-specific merge / fill logic. Apply a manual Polars transformation instead.

Raises:: NotImplementedError – Always.

classmethod builder(*, sort_anchor: Literal['start', 'end', 'middle'] = 'start') → IntervalSequenceStoreBuilder[source]#

Return a fluent builder for constructing an interval sequence store.

Parameters:: sort_anchor – Intra-sequence sort column - "start" (default), "end" for right-censored datasets, or "middle" to sort by the interval midpoint (T_START + T_END) / 2.

class tanat.sequence.IntervalSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, start_column: str, end_column: str)[source]#

Bases: SequenceSettings

Settings for interval sequences (start + end timestamp columns).

Unlike state sequences, intervals are not required to be contiguous: gaps between intervals are allowed, and two intervals may overlap in time.

__init__(*args: Any, **kwargs: Any) → None[source]#

end_column: str[source]#

get_time_columns() → list[str][source]#: Returns time index columns for Interval sequences [start, end].

id_column: str[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

start_column: str[source]#

class tanat.sequence.Sequence(id_value, store: SequenceStore, settings)[source]#

Bases: ABC, SequenceViewMixin, CachableSettings, Registrable

Interface to a single sequence within a Store.

A Sequence is a scoped view on the data for one specific ID. It shares the same SequenceStore as its parent Pool (no copy).

Typical creation patterns:

# From a Pool (recommended)
seq = pool[42]

# Standalone
seq = StateSequence(id_value=42, store="my_store")

__init__(id_value, store: SequenceStore, settings) → None[source]#

Base initialiser. Delegated to by concrete subclasses and from_parent() after store and feature resolution have been performed.

Parameters:

id_value – Unique identifier for this sequence in the store.
store – Already-resolved SequenceStore.
settings – Fully-resolved SequenceSettings (entity_features and static_features never None).

apply(exprs: Expr | list[Expr], is_static: bool = False, *, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Evaluates Polars expressions against this sequence’s features.

This is a read-only computation scoped to this single sequence. The result is returned, not stored.

At the Pool level, use Pool.apply(by_id=True) for per-sequence computations across all sequences, then Pool.add_entity_features() or Pool.add_static_features() to persist.

Parameters:

exprs – One or more Polars expressions producing new columns. Each must use .alias() to name the output.
is_static – Whether to read static or entity features.
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

The computed columns for this sequence only.

Examples

Local normalization:

seq = pool[42]
result = seq.apply(
    (pl.col("value") - pl.col("value").mean()).alias("v_centered")
)

Multiple expressions:

result = seq.apply([
    (pl.col("value").diff()).alias("v_diff"),
    (pl.col("value").rolling_mean(3)).alias("v_rm3"),
])

See also

Pool.apply: Apply across all sequences (with optional by_id). Pool.add_entity_features: Persist entity features. Pool.add_static_features: Persist static features.

copy() → Sequence[source]#

Return a standalone copy of this sequence, detached from any parent pool.

Returns:: A new standalone Sequence with _parent_pool=None.

Examples:

seq = pool[42]
standalone = seq.copy()         # detaches from pool
standalone.filter_entities(crit, inplace=True)  # safe

describe(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Compute summary statistics for this single sequence.

Parameters:

fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Single-row DataFrame with columns [length, n_unique_entities, temporal_span, …].

Examples:

seq = pool[42]
seq.describe()
seq.describe(fmt="polars")

filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) → Sequence[source]#

Return a view with entities pruned by criterion.

Parameters:

criterion – A Criterion instance supporting ENTITY.
inplace – If True, modify this sequence in place.
verbose – If True, print a one-line report.

Returns:

Filtered sequence (or self when inplace=True).

Raises:

TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.

classmethod from_parent(id_value, store: SequenceStore, settings, *, parent_pool: SequencePool) → Sequence[source]#

Create a pool-managed sequence. Not part of the public API.

Bypasses store resolution, feature resolution, and cast probe: all already performed by the pool. Every piece of pool context (casts, filters, virtual ID, T0) is read lazily from parent_pool via the corresponding cached properties.

Parameters:

id_value – Sequence identifier.
store – Already-resolved SequenceStore.
settings – Fully-resolved SequenceSettings.
parent_pool – The owning SequencePool.

Returns:

A new Sequence instance bound to parent_pool.

property id_value[source]#: The sequence identifier.

match(criterion: Criterion) → bool[source]#

Return True if this sequence satisfies criterion.

Parameters:

criterion – A Criterion instance.

Raises:

TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion is incompatible with this sequence.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars', 'dict'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame | None[source]#

Return static (non-temporal) data for this sequence.

Parameters:

features – Feature name(s) to include (None -> all).
fmt – "pandas" (default), “dict” or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Single-row DataFrame with columns [id, feature…]; A python dictionary with named attribute-value pairs or None when no static features are exposed by this sequence.

Examples:

seq = pool[42]
row = seq.static_data()               # pandas, all static features
row = seq.static_data("age", "sex")   # subset of features
row = seq.static_data(fmt="dict")     # returns a dictionary

property t0: datetime | date | int | float | None[source]#

T0 value for this sequence (scalar, not a DataFrame).

None when no valid T0 row was found (e.g. sequence too short, or no row matched the query).

property t0_nearest_rank: int | None[source]#

0-based rank of the nearest row at or before T0 within this sequence.

None when no valid T0 row was found (e.g. sequence too short, T0 before all timestamps, or no row matched the query).

temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Return temporal data for this sequence.

Each row is one entity: the atomic observation of this sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).

Parameters:

features – Entity feature name(s) to include. None → all entity features.
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with columns [id, temporal…, feature…] scoped to this sequence ID.

Examples:

seq = pool[42]
df = seq.temporal_data()                    # pandas, all features
df = seq.temporal_data("heart_rate")        # single feature
df = seq.temporal_data(fmt="polars")

class tanat.sequence.SequencePool(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#

Bases: ABC, SequenceViewMixin, CachableSettings, Registrable

Base class for sequence pool objects.

MAX_BINS_LIMIT: int = 2000[source]#

__init__(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None) → None[source]#

Base initialiser. Delegated to by concrete subclasses after store and feature resolution have been performed.

Parameters:

store – Already-resolved SequenceStore.
settings – Fully-resolved SequenceSettings (or equivalent dict). entity_features and static_features never None.
cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via SequenceCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a SequenceCastRecipe, dict, or None.

add_entity_features(df: DataFrame | DataFrame | LazyFrame, *, overwrite: bool = False) → None[source]#

Add new entity features to the virtual store.

The input DataFrame must be positionally aligned with the full entity row set of the store (i.e. it must have exactly as many rows as there are entity rows in the physical store, not just the current view). Use save() first to materialise a filtered view before calling this method.

Parameters:

df – Feature-only DataFrame (no ID column) positionally aligned with the entity rows in the store. Can be pandas, Polars eager, or Polars lazy.
overwrite – If True, replace existing features with the same name in the virtual context.

Raises:

RuntimeError – If the pool has an active _id_mask or entity filter expression. Call pool.save() first and then add features to the resulting unfiltered pool.
ValueError – If the number of rows in df does not match the number of entity rows in the store.

add_static_features(df: DataFrame | DataFrame | LazyFrame, *, id_column: str | None = None, overwrite: bool = False) → None[source]#

Add static features to the virtual store via an ID-keyed join.

The input DataFrame must include the ID column (either under settings.id_column or under the name given by id_column). A LEFT JOIN against the full sequence index is performed internally, so partial DataFrames (covering only a subset of IDs) are accepted: IDs absent from df receive null in the virtual context.

Parameters:

df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.
id_column – Name of the ID column in df. Defaults to settings.id_column when None. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g. id_column="patient_id").
overwrite – If True, replace existing features with the same name in the virtual context.

Raises:

KeyError – If the resolved ID column is not found in df.

apply(exprs: Expr | list[Expr], is_static: bool = False, *, by_id: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Evaluate Polars expressions against the current features.

This is a read-only computation: the result is returned, not stored. Use add_entity_features() or add_static_features() to persist the result.

Each expression must produce a named column (.alias()).

When by_id=True, the result always includes the ID column (settings.id_column).

Parameters:

exprs – One or more Polars expressions producing new columns.
is_static – Whether to read static or entity features.
by_id – If True, expressions are evaluated per sequence (group_by on the sequence ID). Only valid for entity features (is_static=False). The ID column is included in the result.
fmt –
Format of the returned object. One of:
- "pandas" (default): returns a pandas.DataFrame.
- "polars": returns a polars.DataFrame.

Returns:

The computed columns as a DataFrame. When by_id=True, the first column is the sequence ID.

Raises:

ValueError – If by_id=True and is_static=True.

Examples

Compute and inspect:

result = pool.apply(
    (pl.col("age") * pl.col("score")).alias("age_score"),
    is_static=True,
)
print(result)

Persist entity features:

result = pool.apply(
    (pl.col("value") - pl.col("value").mean()).alias("centered"),
)
pool.add_entity_features(result)

Per-sequence aggregation (result includes ID column):

summary = pool.apply(
    pl.col("value").mean().alias("value_mean"),
    by_id=True,
)
pool.add_static_features(summary)

Per-sequence normalization (result includes ID column):

normed = pool.apply(
    (pl.col("value") - pl.col("value").mean()).alias("v_normed"),
    by_id=True,
)
pool.add_entity_features(normed)

binned_data(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') → DataFrame | DataFrame[source]#

Project sequences onto a binned temporal table (long-format dataframe).

Each sequence is aligned to a shared time axis divided into fixed-size bins. When multiple values compete for the same bin, overlap_rule resolves the ambiguity. Empty bins are filled with fill_value.

For an ML-ready 3-D tensor with feature labels and ID order, see to_tensor().

Parameters:

features – Feature(s) to project onto the grid.
bin_size – Width of each bin (duration string for datetime, numeric otherwise).
max_bins – Maximum number of bins. None infers from the data span, capped by MAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.
fill_value – Value used to fill empty bins. None keeps nulls.
overlap_rule – Polars aggregation name for in-bin conflict resolution ("first", "last", "mean", "max", "sum", "median", …).
ohe – One-hot encode features before binning. Requires Categorical or Enum dtypes.
fmt – "pandas" (default) or "polars".
use_arrow – Pandas conversion uses Arrow when True.
bin_col – Output column name for the bin index.

Returns:

DataFrame with columns [id_col, bin_col, *feature_cols] and len(unique_ids) * n_bins rows.

cast_features(schema: dict[str, DataType | type], is_static: bool = False, strict: bool = True) → None[source]#

Casts feature columns to new types, scoped to this Pool only.

To make a cast permanent on disk, save the Pool first (pool.save()) and reload. Persisting changes might affect other views sharing the same store, so use with caution.

Parameters:

schema – Dictionary mapping feature names to target Polars DataTypes.
is_static – Whether these are static features (True) or entity features (False).
strict – When True (default), non-convertible values raise a TypeError during probing. When False, non-convertible values silently become null.

Raises:

TypeError – If schema is not a dict.
KeyError – If a feature name does not exist.

cast_id(dtype: DataType) → None[source]#

Casts the ID column to a new type.

Parameters:: dtype – The target Polars DataType.

cast_to_datetime(unit: str = 'us', time_zone: str | None = None)[source]#

Cast time columns to Datetime.

Parameters:

unit – The datetime resolution (“ms”, “us”, “ns”). Default is “us” (microsecond), the Python standard.
time_zone – Optional timezone string (e.g. “UTC”, “Europe/Paris”).

cast_to_timestep(dtype: DataType = Int64)[source]#

Cast time columns to numeric-based timesteps.

Parameters:

dtype – The target numeric type (e.g., pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64 for safety.

Raises:

TypeError – If dtype is not a numeric type.
TypeError – If the underlying data is already in Datetime format. (Conversion from Datetime to Timestep is not allowed).

copy() → SequencePool[source]#

Returns a shallow copy of this Pool, sharing the same store but with all view state (masks, casts, virtual features) conserved.

The virtual context is forked into a new UUID so that the copy owns its own independent context. Garbage-collecting either instance will not destroy the other’s virtual features.

The T0 strategy (_t0_setter) is propagated to the copy. The T0 result cache is not copied; it is recomputed on the first call to t0_data() on the copy.

describe(by_id: bool = True, add_to_static: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Compute summary statistics for every sequence in the pool.

Parameters:

by_id – If True (default), return one row per sequence ID with columns [id, length, n_unique_entities, …]. If False, return the cross-sequence pandas .describe() (count, mean, std, min, 25%, …).
add_to_static – If True, write the per-ID result to the static-feature store via add_static_features(). Ignored (with a warning) when by_id=False.
fmt – "pandas" (default) or "polars". Ignored when by_id=False (always pandas).
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with one row per sequence ID when by_id=True. Aggregated statistics (pandas describe() output) when by_id=False.

Examples:

pool.describe()                          # one row per ID, pandas
pool.describe(fmt="polars")    # same, polars
pool.describe(by_id=False)               # cross-ID stats
pool.describe(add_to_static=True)        # persist as static cols

drop_features(features: list[str], is_static: bool = False, *, permanently: bool = False) → None[source]#

Removes features from the current view.

By default, this is a soft drop: features are removed from the Pool settings so they no longer appear in temporal_data(), static_data() or metadata, but the underlying files are left untouched.

With permanently=True the columns are also physically deleted from disk (physical store and/or virtual store).

Parameters:

features – Feature names to drop.
is_static – True for static features, False for entity features.
permanently – If True, also delete the columns from disk. This is irreversible for physical features.

extend(other: SequencePool | Sequence, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) → SequencePool[source]#

Merge other into this pool and write the result to disk.

Mirrors the semantics of save().

Same-store fast path: if self and other point to the same physical store and neither carries virtual content (_virtual_id is None on both sides), no read I/O is performed. A new pool backed by the same store with the union of both ID masks is built immediately. If destination is provided the merged pool is then materialised to disk via save(); otherwise it is returned as an in-memory view with zero I/O.

Different stores (or virtual content present): a full merge-and-write is performed and destination is required. A named destination writes the merged data to a new store. To rewrite in-place, pass destination=self._store.root_path explicitly together with overwrite=True.

Parameters:

other –
Data to merge. Accepted types:
- SequencePool: must be the same concrete subclass with an identical entity feature schema.
- Sequence: single sequence object; schema is checked against this pool.
destination – None → in-memory view (same-store fast path only; no I/O); str / Path → materialise the merged data to disk. destination is required when merging from different stores.
on_duplicate –
Behaviour when other contains an ID already present in this pool:
- "raise" (default): raise ValueError listing the conflicting IDs.
- "skip": silently ignore duplicates.
overwrite – Allows overwriting an existing destination when it already exists on disk.

Returns:

A new SequencePool instance.

Raises:

TypeError – If other is not a SequencePool or Sequence.
ValueError – If other is a SequencePool of a different concrete type.
ValueError – If other is missing entity features declared in this pool’s settings.
ValueError – If on_duplicate="raise" and duplicate IDs are found.
ValueError – If destination=None and stores differ (same-store fast path only supports an in-memory view without I/O).
FileExistsError – If destination exists and overwrite=False.

See also

save()

filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) → SequencePool[source]#

Return a view with entities pruned by criterion.

Parameters:

criterion – A Criterion instance supporting ENTITY.
inplace – If True, modify this pool in place.
verbose – If True, print a one-line report.

Returns:

Filtered pool (or self when inplace=True).

Raises:

TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.

classmethod from_parent(store_path: Path, *, parent_pool: TrajectoryPool) → SequencePool[source]#

Create a managed sub-pool owned by parent_pool.

Builds the pool from store_path, aligns settings and casts with the parent TrajectoryPool, and marks it locked. T0 is resolved lazily via the _t0_setter property which delegates to parent_pool.

Parameters:

store_path – Root path of the sequence store to load.
parent_pool – The owning TrajectoryPool.

Returns:

A locked SequencePool whose T0 delegates to parent_pool.

get_sequences(entity_features: list[str] | None = None, static_features: list[str] | None = None) → dict[str, Sequence][source]#

Return a mapping of sequence IDs to Sequence objects.

Parameters:

entity_features – Entity feature subset to expose in each sequence. None -> use pool-level settings.
static_features – Static feature subset to expose in each sequence. None -> use pool-level settings.

Returns:

Dict mapping each visible sequence ID to its Sequence instance.

Examples:

seqs = pool.get_sequences()
print(seqs[42].temporal_data())

property is_dirty: bool[source]#

True if the pool has state not yet written to disk.

Covers virtual features, view scopes, type casts, and soft feature drops. A dirty pool needs save() to materialise its current view.

save(destination: str | Path | None = None, *, overwrite: bool = False) → Path[source]#

Persist the current pool state (virtual features + view masks).

Without destination the store is rewritten in-place. With destination the pool is rebuilt into that path and then redirects to it - the original files are left untouched.

In both cases the pool is left in a clean state after a successful save: virtual context, masks, soft-drops and cast recipe are all reset, and is_dirty becomes False.

When a mask is active and destination is None, the store is overwritten with a subset of the data. overwrite=True is required to confirm.

Parameters:

destination – None → in-place; path → rebuild into new path and redirect the pool there.
overwrite – Required when saving a filtered view in-place. Also allows overwriting an existing destination.

Returns:

The Path of the written store.

Raises:

FileExistsError – If destination exists and overwrite is False.
RuntimeError – If a mask is active in-place without overwrite.

set_t0(*, position: int | None = None, direct=None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True) → SequencePool[source]#

Configure the T0 strategy for this pool.

Exactly one strategy keyword must be provided. All others must remain None.

Parameters:

position – Row index (0-based; negative indexing supported).
direct – Scalar value or {seq_id: value} dict.
feature – Static feature column name.
query – Polars boolean expression on any sequence column (time columns or entity features).
anchor –
Which end of each interval/state row to use as the reference timestamp for the floor lookup:
- "start" (default): use the start timestamp.
- "end": use the end timestamp.
- "middle": use the midpoint (start + end) / 2.
Omitting anchor= on an interval/state pool emits a UserWarning and defaults to "start". Passing anchor= on an event pool emits a UserWarning and the value is ignored (single time column, anchor is irrelevant).
use_first – For the query strategy, whether to take the first (True) or last (False) matching row.

Returns:

self for chaining.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame | None[source]#

Return static (non-temporal) data for all sequences in this pool.

Parameters:

features – Feature name(s) to include (None -> all static features).
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row-per-sequence DataFrame with columns [id, feature...]. None when no static features are exposed by this pool.

Examples:

df = pool.static_data()               # pandas, all static features
df = pool.static_data(["age", "sex"]) # subset
df = pool.subset([1, 2, 3]).static_data()

subset(ids, *, inplace=False) → SequencePool[source]#

Returns a new Pool containing only the specified sequence IDs.

Parameters:

ids – A list of sequence IDs to include in the subset.
inplace – If True, modify this Pool’s view instead of returning a new one.

Returns:

A new SequencePool instance with the subset of IDs, or self if inplace=True.

Raises:

ValueError – If any ID in ids is not present in unique_ids.

survival_target(endpoint_time: str, occurred: str | None = None, censure_time: str | None = None, fmt: Literal['sksurv', 'polars', 'pandas'] = 'sksurv') → tuple[ndarray | DataFrame | DataFrame, list][source]#

Build a survival target (occurred, time) from static features stored in the pool.

For each patient, assembles a binary indicator (did the endpoint occur?) and the corresponding duration (time from T0 to the endpoint, or to the last observation for censored patients). Patients with unresolvable or non-positive durations are excluded and reported via a warning.

Durations are computed as the difference between the absolute value stored in the static column and the per-patient T0.

If censure_time is None, the last recorded time in the sequence is used as the censoring reference.

Parameters:

endpoint_time – Name of the static column containing the absolute time at which the endpoint occurred (e.g. age at death, event datetime). T0 is subtracted internally to produce the duration. Null = endpoint not observed (censored), when occurred is None. Expected dtype: same as the pool time axis (numeric for timestep pools, datetime for datetime pools).
occurred – Name of the static column with a binary endpoint indicator. True (or 1) = endpoint observed, False (or 0) or null = censored. Expected dtype: bool or numeric (int or float); cast to bool internally. If None, inferred as endpoint_time.is_not_null().
censure_time – Name of the static column containing the absolute time of the last observation for censored patients (e.g. age at last visit, last visit datetime). T0 is subtracted internally to produce the duration. Expected dtype: same as the pool time axis. If None, derived automatically as max(get_temporal_columns()[-1]) per patient (i.e. max of time_column for events, max of end_column for state/interval pools).
fmt – Format of the returned target y. “sksurv” returns a structured np.ndarray with fields (occurred: bool, time: float) compatible with scikit-survival. “polars” and “pandas” return a DataFrame with columns [“id”, “occurred”, “time”].

Returns:

A tuple (y, valid_ids). y is the survival target in the requested format. valid_ids is the list of patient identifiers retained after filtering invalid rows.

Raises:

KeyError – If a column name is not found in the pool’s static features.
RuntimeError – If no static features are available in this pool.

Examples

>>> y, valid_ids = pool.survival_target(
...     endpoint_time="death_age_occur",
... )
>>> y, valid_ids = pool.survival_target(
...     endpoint_time="death_age_occur",
...     occurred="death",
... )

t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Return the T0 table for the sequences visible in this pool view.

Thin public wrapper around _get_t0_df() that handles format conversion.

Parameters:

fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with columns [id_col, _T0_, _T0_NEAREST_RANK_], one row per visible sequence.

Examples:

pool.set_t0(position=0, anchor="start")
df = pool.t0_data()
df_pl = pool.t0_data(fmt="polars")

temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Return temporal data for all sequences visible in this pool.

Each row is one entity: the atomic observation of a sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).

Parameters:

features – Entity feature name(s) to include. None → all entity features.
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Long-format DataFrame with columns [id, temporal…, feature…] covering every visible sequence.

Examples:

df = pool.temporal_data()                    # pandas, all features
df = pool.temporal_data("heart_rate")        # single feature
df = pool.temporal_data(["a", "b"], fmt="polars")
# Restrict to a subset of IDs:
df = pool.subset([1, 2, 3]).temporal_data()

to_dummies(features: list[str] | str, is_static: bool = False, *, drop_first: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

One-hot encode categorical features.

Returns a DataFrame with binary columns for each category. Only features typed as Categorical or Enum are accepted. Cast first with cast_features if needed.

This is a consumption method: the result is returned, not stored in the pool. Use it when preparing data for training.

Parameters:

features – Feature name(s) to encode.
is_static – Whether these are static or entity features.
drop_first – Drop the first category column to avoid multicollinearity (useful for linear models).
fmt – Format of the returned object. One of: - "pandas" (default): returns a pandas.DataFrame. - "polars": returns a polars.DataFrame.

Returns:

DataFrame with binary columns (one per category per feature).

Raises:

TypeError – If any feature is not Categorical or Enum.

Examples

Basic usage:

pool.cast_features(schema={"status": pl.Categorical})
dummies = pool.to_dummies("status")
# → status_OK, status_ERROR, status_WARNING columns

With drop_first for linear models:

X = pool.to_dummies("group", is_static=True, drop_first=True)

to_tensor(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') → tuple[ndarray, list, list[str]][source]#

Project sequences onto a 3-D temporal grid (ML-ready tensor).

Returns a dense (N, M, K) ndarray together with the list of K-axis feature labels. Sequence order on the N axis follows unique_ids.

For a long-format dataframe variant (joins, plotting, exploration), see binned_data().

Parameters:

features – Feature(s) to project.
bin_size – Bin width (duration string for datetime, numeric otherwise).
max_bins – Maximum bins. None infers from the data span, capped by MAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.
fill_value – Value for empty bins. None keeps NaN.
overlap_rule – In-bin aggregation. See binned_data().
ohe – One-hot encode before binning. Post-OHE column names are reflected in the returned feature_names.
bin_col – Internal bin column name (forwarded to the underlying pipeline for consistency; not present in the output).

Returns:

arr has shape (N, M, K) = (len(unique_ids), n_bins, len(feature_names)).
ids is the sequence of entity IDs matching the N-axis order (identical to unique_ids).
feature_names lists the K-axis labels in column order (post-OHE names when ohe=True).

Return type:

A 3-tuple (arr, ids, feature_names) where

Examples:

arr, ids, names = pool.to_tensor(["dose", "route"], "1d", ohe=True)
# arr.shape == (N, M, K)
# names == ["dose", "route_oral", "route_iv"]

Split the pool into train and test subsets.

Mirrors the interface of sklearn.model_selection.train_test_split().

Parameters:

test_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the test subset. Defaults to 0.25 when both test_size and train_size are None.
train_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.
random_state – Seed for the random number generator. Pass an integer for reproducibility.
shuffle – Whether to shuffle IDs before splitting. When False, the first IDs go to train and the last to test.

Returns:

two new non-overlapping pool views.

Return type:

(train_pool, test_pool)

Raises:

ValueError – If the pool is empty, sizes are non-positive, or n_train + n_test exceeds the pool size.

property unique_ids: list[source]#

Visible sequence IDs in store order as a plain Python list.

Respects _id_mask. Deterministic order (sorted at build time).

Warning

list erases rich Polars dtypes (Categorical → str). Prefer _id_lf when the result feeds a Polars join.

which(criterion: Criterion, *, verbose: bool = True) → set[source]#

Return the set of IDs in this pool satisfying criterion.

Parameters:

criterion – A Criterion instance.
verbose – If True, print a one-line report.

Returns:

Set of matching IDs.

Raises:

TypeError – If criterion is not a Criterion object.

Bases: Entity

Entity representing one state row (start/end).

States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.

class tanat.sequence.StateSequence(id_value, store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None)[source]#

Bases: Sequence

A single state sequence.

States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.

SETTINGS_CLASS[source]#: alias of StateSequenceSettings

Create a state sequence for id_value.

Parameters:

id_value – Sequence identifier.
store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the state start column.
end_column – User-facing name for the state end column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.

filter_entities(criterion, *, inplace=False, verbose=True)[source]#

Not supported on state sequences.

States are contiguous and non-overlapping by definition: removing individual rows would leave temporal gaps and break the invariant T_END[i] == T_START[i+1].

class tanat.sequence.StateSequencePool(store: str | Path | SequenceStore, *, id_column: str = 'id', start_column: str = 'start', end_column: str = 'end', entity_features: list[str] | None = None, static_features: list[str] | None = None, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#

Bases: SequencePool

Pool of state sequences.

States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.

SETTINGS_CLASS[source]#: alias of StateSequenceSettings

Create a state sequence pool backed by store.

Parameters:

store – Store path, name, or SequenceStore instance.
id_column – User-facing name for the sequence ID column.
start_column – User-facing name for the state start column.
end_column – User-facing name for the state end column.
entity_features – Subset of entity feature names to expose. None → all available from the store.
static_features – Static feature names to expose. None → all available. [] → none.
cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via SequenceCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a SequenceCastRecipe, dict, or None.

as_event(anchor: Literal['start', 'end', 'middle'], *, time_column: str = 'time', destination: str | Path | None = None, overwrite: bool = False) → EventSequencePool[source]#

Convert this state pool to an event pool by anchoring to one timestamp.

Parameters:

anchor – "start", "end", or "middle" - selects which timestamp (or their midpoint) becomes the event timestamp.
time_column – User-facing name for the event timestamp. Defaults to "time".
destination – None → ephemeral result; path → new persistent store.
overwrite – Replace destination if it already exists.

Returns:

A new EventSequencePool.

as_interval(*, start_column: str | None = None, end_column: str | None = None, destination: str | Path | None = None, overwrite: bool = False) → IntervalSequencePool[source]#

Convert this state pool to an interval pool.

States and intervals share the same (_t_start, _t_end) physical layout - no temporal recomputation needed.

Parameters:

start_column – User-facing name for the start column. None inherits this pool’s current setting.
end_column – User-facing name for the end column. None inherits this pool’s current setting.
destination – None → ephemeral result; path → new persistent store.
overwrite – Replace destination if it already exists.

Returns:

A new IntervalSequencePool.

as_state() → StateSequencePool[source]#

Return this pool unchanged - source and target types are identical.

A warning is emitted to signal the no-op conversion.

Returns:: self (no copy, no I/O).

classmethod builder(*, end_value: datetime | int | float | None = None, validate_continuity: bool = True) → StateSequenceStoreBuilder[source]#

Return a fluent builder for constructing a state sequence store.

Parameters:

end_value – Sentinel for T_END of the last state in each sequence when end_column is not provided at source registration time. None → leaves the last T_END as null.
validate_continuity – When end_column is provided, verify that states are truly contiguous (T_END[i] == T_START[i+1]) before writing. Defaults to True. Set to False on large datasets where the cost of a full collect() is unacceptable.

filter_entities(criterion, *, inplace=False, verbose=True)[source]#

Not supported on state pools.

States are contiguous and non-overlapping by definition: removing individual rows would leave temporal gaps and break the invariant T_END[i] == T_START[i+1].

class tanat.sequence.StateSequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>, start_column: str, end_column: str)[source]#

Bases: SequenceSettings

Settings for state sequences (start + end timestamp columns).

States are contiguous and non-overlapping: the end of one state is always the start of the next, with no gaps in between.

__init__(*args: Any, **kwargs: Any) → None[source]#

end_column: str[source]#

get_time_columns() → list[str][source]#: Returns time index columns for State sequences [start, end].

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

start_column: str[source]#

Build an EventSequencePool from a single DataFrame.

All columns in temporal_data except id_column and time_column are treated as entity features. All columns in static_data except id_column are treated as static features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, time, and one feature column.
id_column – Name of the sequence identifier column (present in both temporal_data and static_data if provided).
time_column – Name of the timestamp column.
static_data – Optional DataFrame or LazyFrame with per-id static features. Must contain a column named id_column for joining.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_event_<hex8>).

Returns:

A ready-to-use EventSequencePool.

Raises:

ValueError – If id_column or time_column are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_events(df, id_column="patient", time_column="date")
pool.temporal_data(fmt="polars").head()

Build an IntervalSequencePool from a single DataFrame.

All columns in temporal_data except id_column, start_column, and end_column are treated as entity features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, start, end, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the interval start column.
end_column – Name of the interval end column.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_interval_<hex8>).

Returns:

A ready-to-use IntervalSequencePool.

Raises:

ValueError – If required columns are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_intervals(
    df, id_column="id", start_column="start", end_column="end",
)
pool.temporal_data(fmt="polars").head()

Build a StateSequencePool from a single DataFrame.

When end_column is None the end of each state is derived from the start of the next state (last state stays open-ended with null).

All columns in temporal_data except the structural columns (id, start, and optionally end) are treated as entity features.

Parameters:

temporal_data – DataFrame or LazyFrame with at least id, start, and one feature column.
id_column – Name of the sequence identifier column.
start_column – Name of the state start column.
end_column – Name of the state end column. When None the builder derives end values automatically.
static_data – Optional DataFrame or LazyFrame with per-id static features.
store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_state_<hex8>).

Returns:

A ready-to-use StateSequencePool.

Raises:

ValueError – If required columns are missing, if no feature columns remain, or if id_column is absent from static_data.

Examples:

pool = build_states(df, id_column="id", start_column="start")
pool.temporal_data(fmt="polars").head()