tanat.sequence.base package#

Submodules#

tanat.sequence.base.entity module#

Entity: Flyweight object representing a single row in a Sequence (Event, State, etc.).

Bases: Registrable, ABC

Abstract Flyweight object acting as a proxy to a specific row in a Sequence.

Create an entity proxy for a single row in a sequence.

Parameters:

id_value – Sequence identifier this entity belongs to.
store – Store path, name, or SequenceStore instance.
features – Visible feature names propagated from the parent Sequence. None → all store features.
rank – 0-based position within the sequence as seen by the user (accounts for filtering/masking).
store_index – Absolute physical row index in the store (SCH.STORE_INDEX).
cast_recipe – Cast recipe propagated from the parent Sequence. Normalised via SequenceCastRecipe.coerce().
virtual_id – Virtual context UUID from the parent Sequence.
parent_metadata – Pre-computed SequenceMetadata from the parent pool. When provided, metadata returns this directly (no extra I/O).

data(features: list[str] | None = None) → dict[str, Any][source]#

Access the feature values for this entity as a dictionary.

Only feature columns are returned; the sequence identifier and time columns are excluded (use id_value and temporal_extent instead).

Parameters:: features – Feature name(s) to include (None → all visible entity features).
Returns:: A dict mapping feature names to their scalar values.

property feature_names: list[str][source]#: Visible feature names.

classmethod from_parent(parent: Sequence, *, rank: int, store_index: int, prefetched: PrefetchedEntityData | None = None) → Entity[source]#

Build a sequence-managed entity. Not part of the public API.

Snapshots the shared context from parent (store, features, casts, virtual_id, metadata), bypassing store and feature resolution. The per-row prefetched is optional: present during iteration, absent for point access.

property id_value[source]#: The sequence identifier this entity belongs to.

property metadata: dict[str, FeatureInfo][source]#

Feature metadata for this entity.

Returns a dictionary mapping each visible feature name to its FeatureInfo descriptor (type, stats, …). When created from a Sequence (which itself comes from a Pool), the pool-level metadata is reused directly. Stats are consistent across all entities in the pool, no extra I/O.

property rank: int[source]#

0-based position of this entity within its sequence.

Always matches the index used to retrieve it:

entity = seq[3]
entity.rank  # 3

property temporal_extent: Any[source]#

The temporal extent of this entity.

Returns:: A list of two values [start, end] for interval-based sequences, or a single scalar for event sequences.

class tanat.sequence.base.entity.PrefetchedEntityData(_row: dict[str, Any], _time_columns: tuple[str, ...], _feature_columns: tuple[str, ...])[source]#

Bases: object

One entity row pre-collected by the parent Sequence during iteration.

Owns the view-schema row and knows how to expose it the way Entity does, so Entity never slices the raw row itself.

__init__(_row: dict[str, Any], _time_columns: tuple[str, ...], _feature_columns: tuple[str, ...]) → None[source]#

features() → dict[str, Any][source]#: Return the feature values from the prefetched row.

temporal_extent() → Any[source]#: Return the temporal extent from the prefetched row.

tanat.sequence.base.pool module#

Base class for sequence pool objects.

class tanat.sequence.base.pool.SequencePool(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#

Bases: ABC, SequenceViewMixin, CachableSettings, Registrable

Base class for sequence pool objects.

MAX_BINS_LIMIT: int = 2000[source]#

__init__(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None) → None[source]#

Base initialiser. Delegated to by concrete subclasses after store and feature resolution have been performed.

Parameters:

store – Already-resolved SequenceStore.
settings – Fully-resolved SequenceSettings (or equivalent dict). entity_features and static_features never None.
cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via SequenceCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a SequenceCastRecipe, dict, or None.

add_entity_features(df: DataFrame | DataFrame | LazyFrame, *, overwrite: bool = False) → None[source]#

Add new entity features to the virtual store.

The input DataFrame must be positionally aligned with the full entity row set of the store (i.e. it must have exactly as many rows as there are entity rows in the physical store, not just the current view). Use save() first to materialise a filtered view before calling this method.

Parameters:

df – Feature-only DataFrame (no ID column) positionally aligned with the entity rows in the store. Can be pandas, Polars eager, or Polars lazy.
overwrite – If True, replace existing features with the same name in the virtual context.

Raises:

RuntimeError – If the pool has an active _id_mask or entity filter expression. Call pool.save() first and then add features to the resulting unfiltered pool.
ValueError – If the number of rows in df does not match the number of entity rows in the store.

add_static_features(df: DataFrame | DataFrame | LazyFrame, *, id_column: str | None = None, overwrite: bool = False) → None[source]#

Add static features to the virtual store via an ID-keyed join.

The input DataFrame must include the ID column (either under settings.id_column or under the name given by id_column). A LEFT JOIN against the full sequence index is performed internally, so partial DataFrames (covering only a subset of IDs) are accepted: IDs absent from df receive null in the virtual context.

Parameters:

df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.
id_column – Name of the ID column in df. Defaults to settings.id_column when None. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g. id_column="patient_id").
overwrite – If True, replace existing features with the same name in the virtual context.

Raises:

KeyError – If the resolved ID column is not found in df.

apply(exprs: Expr | list[Expr], is_static: bool = False, *, by_id: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Evaluate Polars expressions against the current features.

This is a read-only computation: the result is returned, not stored. Use add_entity_features() or add_static_features() to persist the result.

Each expression must produce a named column (.alias()).

When by_id=True, the result always includes the ID column (settings.id_column).

Parameters:

exprs – One or more Polars expressions producing new columns.
is_static – Whether to read static or entity features.
by_id – If True, expressions are evaluated per sequence (group_by on the sequence ID). Only valid for entity features (is_static=False). The ID column is included in the result.
fmt –
Format of the returned object. One of:
- "pandas" (default): returns a pandas.DataFrame.
- "polars": returns a polars.DataFrame.

Returns:

The computed columns as a DataFrame. When by_id=True, the first column is the sequence ID.

Raises:

ValueError – If by_id=True and is_static=True.

Examples

Compute and inspect:

result = pool.apply(
    (pl.col("age") * pl.col("score")).alias("age_score"),
    is_static=True,
)
print(result)

Persist entity features:

result = pool.apply(
    (pl.col("value") - pl.col("value").mean()).alias("centered"),
)
pool.add_entity_features(result)

Per-sequence aggregation (result includes ID column):

summary = pool.apply(
    pl.col("value").mean().alias("value_mean"),
    by_id=True,
)
pool.add_static_features(summary)

Per-sequence normalization (result includes ID column):

normed = pool.apply(
    (pl.col("value") - pl.col("value").mean()).alias("v_normed"),
    by_id=True,
)
pool.add_entity_features(normed)

binned_data(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') → DataFrame | DataFrame[source]#

Project sequences onto a binned temporal table (long-format dataframe).

Each sequence is aligned to a shared time axis divided into fixed-size bins. When multiple values compete for the same bin, overlap_rule resolves the ambiguity. Empty bins are filled with fill_value.

For an ML-ready 3-D tensor with feature labels and ID order, see to_tensor().

Parameters:

features – Feature(s) to project onto the grid.
bin_size – Width of each bin (duration string for datetime, numeric otherwise).
max_bins – Maximum number of bins. None infers from the data span, capped by MAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.
fill_value – Value used to fill empty bins. None keeps nulls.
overlap_rule – Polars aggregation name for in-bin conflict resolution ("first", "last", "mean", "max", "sum", "median", …).
ohe – One-hot encode features before binning. Requires Categorical or Enum dtypes.
fmt – "pandas" (default) or "polars".
use_arrow – Pandas conversion uses Arrow when True.
bin_col – Output column name for the bin index.

Returns:

DataFrame with columns [id_col, bin_col, *feature_cols] and len(unique_ids) * n_bins rows.

cast_features(schema: dict[str, DataType | type], is_static: bool = False, strict: bool = True) → None[source]#

Casts feature columns to new types, scoped to this Pool only.

To make a cast permanent on disk, save the Pool first (pool.save()) and reload. Persisting changes might affect other views sharing the same store, so use with caution.

Parameters:

schema – Dictionary mapping feature names to target Polars DataTypes.
is_static – Whether these are static features (True) or entity features (False).
strict – When True (default), non-convertible values raise a TypeError during probing. When False, non-convertible values silently become null.

Raises:

TypeError – If schema is not a dict.
KeyError – If a feature name does not exist.

cast_id(dtype: DataType) → None[source]#

Casts the ID column to a new type.

Parameters:: dtype – The target Polars DataType.

cast_to_datetime(unit: str = 'us', time_zone: str | None = None)[source]#

Cast time columns to Datetime.

Parameters:

unit – The datetime resolution (“ms”, “us”, “ns”). Default is “us” (microsecond), the Python standard.
time_zone – Optional timezone string (e.g. “UTC”, “Europe/Paris”).

cast_to_timestep(dtype: DataType = Int64)[source]#

Cast time columns to numeric-based timesteps.

Parameters:

dtype – The target numeric type (e.g., pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64 for safety.

Raises:

TypeError – If dtype is not a numeric type.
TypeError – If the underlying data is already in Datetime format. (Conversion from Datetime to Timestep is not allowed).

copy() → SequencePool[source]#

Returns a shallow copy of this Pool, sharing the same store but with all view state (masks, casts, virtual features) conserved.

The virtual context is forked into a new UUID so that the copy owns its own independent context. Garbage-collecting either instance will not destroy the other’s virtual features.

The T0 strategy (_t0_setter) is propagated to the copy. The T0 result cache is not copied; it is recomputed on the first call to t0_data() on the copy.

describe(by_id: bool = True, add_to_static: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Compute summary statistics for every sequence in the pool.

Parameters:

by_id – If True (default), return one row per sequence ID with columns [id, length, n_unique_entities, …]. If False, return the cross-sequence pandas .describe() (count, mean, std, min, 25%, …).
add_to_static – If True, write the per-ID result to the static-feature store via add_static_features(). Ignored (with a warning) when by_id=False.
fmt – "pandas" (default) or "polars". Ignored when by_id=False (always pandas).
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with one row per sequence ID when by_id=True. Aggregated statistics (pandas describe() output) when by_id=False.

Examples:

pool.describe()                          # one row per ID, pandas
pool.describe(fmt="polars")    # same, polars
pool.describe(by_id=False)               # cross-ID stats
pool.describe(add_to_static=True)        # persist as static cols

drop_features(features: list[str], is_static: bool = False, *, permanently: bool = False) → None[source]#

Removes features from the current view.

By default, this is a soft drop: features are removed from the Pool settings so they no longer appear in temporal_data(), static_data() or metadata, but the underlying files are left untouched.

With permanently=True the columns are also physically deleted from disk (physical store and/or virtual store).

Parameters:

features – Feature names to drop.
is_static – True for static features, False for entity features.
permanently – If True, also delete the columns from disk. This is irreversible for physical features.

extend(other: SequencePool | Sequence, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) → SequencePool[source]#

Merge other into this pool and write the result to disk.

Mirrors the semantics of save().

Same-store fast path: if self and other point to the same physical store and neither carries virtual content (_virtual_id is None on both sides), no read I/O is performed. A new pool backed by the same store with the union of both ID masks is built immediately. If destination is provided the merged pool is then materialised to disk via save(); otherwise it is returned as an in-memory view with zero I/O.

Different stores (or virtual content present): a full merge-and-write is performed and destination is required. A named destination writes the merged data to a new store. To rewrite in-place, pass destination=self._store.root_path explicitly together with overwrite=True.

Parameters:

other –
Data to merge. Accepted types:
- SequencePool: must be the same concrete subclass with an identical entity feature schema.
- Sequence: single sequence object; schema is checked against this pool.
destination – None → in-memory view (same-store fast path only; no I/O); str / Path → materialise the merged data to disk. destination is required when merging from different stores.
on_duplicate –
Behaviour when other contains an ID already present in this pool:
- "raise" (default): raise ValueError listing the conflicting IDs.
- "skip": silently ignore duplicates.
overwrite – Allows overwriting an existing destination when it already exists on disk.

Returns:

A new SequencePool instance.

Raises:

TypeError – If other is not a SequencePool or Sequence.
ValueError – If other is a SequencePool of a different concrete type.
ValueError – If other is missing entity features declared in this pool’s settings.
ValueError – If on_duplicate="raise" and duplicate IDs are found.
ValueError – If destination=None and stores differ (same-store fast path only supports an in-memory view without I/O).
FileExistsError – If destination exists and overwrite=False.

See also

save()

filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) → SequencePool[source]#

Return a view with entities pruned by criterion.

Parameters:

criterion – A Criterion instance supporting ENTITY.
inplace – If True, modify this pool in place.
verbose – If True, print a one-line report.

Returns:

Filtered pool (or self when inplace=True).

Raises:

TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.

classmethod from_parent(store_path: Path, *, parent_pool: TrajectoryPool) → SequencePool[source]#

Create a managed sub-pool owned by parent_pool.

Builds the pool from store_path, aligns settings and casts with the parent TrajectoryPool, and marks it locked. T0 is resolved lazily via the _t0_setter property which delegates to parent_pool.

Parameters:

store_path – Root path of the sequence store to load.
parent_pool – The owning TrajectoryPool.

Returns:

A locked SequencePool whose T0 delegates to parent_pool.

get_sequences(entity_features: list[str] | None = None, static_features: list[str] | None = None) → dict[str, Sequence][source]#

Return a mapping of sequence IDs to Sequence objects.

Parameters:

entity_features – Entity feature subset to expose in each sequence. None -> use pool-level settings.
static_features – Static feature subset to expose in each sequence. None -> use pool-level settings.

Returns:

Dict mapping each visible sequence ID to its Sequence instance.

Examples:

seqs = pool.get_sequences()
print(seqs[42].temporal_data())

property is_dirty: bool[source]#

True if the pool has state not yet written to disk.

Covers virtual features, view scopes, type casts, and soft feature drops. A dirty pool needs save() to materialise its current view.

save(destination: str | Path | None = None, *, overwrite: bool = False) → Path[source]#

Persist the current pool state (virtual features + view masks).

Without destination the store is rewritten in-place. With destination the pool is rebuilt into that path and then redirects to it - the original files are left untouched.

In both cases the pool is left in a clean state after a successful save: virtual context, masks, soft-drops and cast recipe are all reset, and is_dirty becomes False.

When a mask is active and destination is None, the store is overwritten with a subset of the data. overwrite=True is required to confirm.

Parameters:

destination – None → in-place; path → rebuild into new path and redirect the pool there.
overwrite – Required when saving a filtered view in-place. Also allows overwriting an existing destination.

Returns:

The Path of the written store.

Raises:

FileExistsError – If destination exists and overwrite is False.
RuntimeError – If a mask is active in-place without overwrite.

set_t0(*, position: int | None = None, direct=None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True) → SequencePool[source]#

Configure the T0 strategy for this pool.

Exactly one strategy keyword must be provided. All others must remain None.

Parameters:

position – Row index (0-based; negative indexing supported).
direct – Scalar value or {seq_id: value} dict.
feature – Static feature column name.
query – Polars boolean expression on any sequence column (time columns or entity features).
anchor –
Which end of each interval/state row to use as the reference timestamp for the floor lookup:
- "start" (default): use the start timestamp.
- "end": use the end timestamp.
- "middle": use the midpoint (start + end) / 2.
Omitting anchor= on an interval/state pool emits a UserWarning and defaults to "start". Passing anchor= on an event pool emits a UserWarning and the value is ignored (single time column, anchor is irrelevant).
use_first – For the query strategy, whether to take the first (True) or last (False) matching row.

Returns:

self for chaining.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame | None[source]#

Return static (non-temporal) data for all sequences in this pool.

Parameters:

features – Feature name(s) to include (None -> all static features).
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row-per-sequence DataFrame with columns [id, feature...]. None when no static features are exposed by this pool.

Examples:

df = pool.static_data()               # pandas, all static features
df = pool.static_data(["age", "sex"]) # subset
df = pool.subset([1, 2, 3]).static_data()

subset(ids, *, inplace=False) → SequencePool[source]#

Returns a new Pool containing only the specified sequence IDs.

Parameters:

ids – A list of sequence IDs to include in the subset.
inplace – If True, modify this Pool’s view instead of returning a new one.

Returns:

A new SequencePool instance with the subset of IDs, or self if inplace=True.

Raises:

ValueError – If any ID in ids is not present in unique_ids.

survival_target(endpoint_time: str, occurred: str | None = None, censure_time: str | None = None, fmt: Literal['sksurv', 'polars', 'pandas'] = 'sksurv') → tuple[ndarray | DataFrame | DataFrame, list][source]#

Build a survival target (occurred, time) from static features stored in the pool.

For each patient, assembles a binary indicator (did the endpoint occur?) and the corresponding duration (time from T0 to the endpoint, or to the last observation for censored patients). Patients with unresolvable or non-positive durations are excluded and reported via a warning.

Durations are computed as the difference between the absolute value stored in the static column and the per-patient T0.

If censure_time is None, the last recorded time in the sequence is used as the censoring reference.

Parameters:

endpoint_time – Name of the static column containing the absolute time at which the endpoint occurred (e.g. age at death, event datetime). T0 is subtracted internally to produce the duration. Null = endpoint not observed (censored), when occurred is None. Expected dtype: same as the pool time axis (numeric for timestep pools, datetime for datetime pools).
occurred – Name of the static column with a binary endpoint indicator. True (or 1) = endpoint observed, False (or 0) or null = censored. Expected dtype: bool or numeric (int or float); cast to bool internally. If None, inferred as endpoint_time.is_not_null().
censure_time – Name of the static column containing the absolute time of the last observation for censored patients (e.g. age at last visit, last visit datetime). T0 is subtracted internally to produce the duration. Expected dtype: same as the pool time axis. If None, derived automatically as max(get_temporal_columns()[-1]) per patient (i.e. max of time_column for events, max of end_column for state/interval pools).
fmt – Format of the returned target y. “sksurv” returns a structured np.ndarray with fields (occurred: bool, time: float) compatible with scikit-survival. “polars” and “pandas” return a DataFrame with columns [“id”, “occurred”, “time”].

Returns:

A tuple (y, valid_ids). y is the survival target in the requested format. valid_ids is the list of patient identifiers retained after filtering invalid rows.

Raises:

KeyError – If a column name is not found in the pool’s static features.
RuntimeError – If no static features are available in this pool.

Examples

>>> y, valid_ids = pool.survival_target(
...     endpoint_time="death_age_occur",
... )
>>> y, valid_ids = pool.survival_target(
...     endpoint_time="death_age_occur",
...     occurred="death",
... )

t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Return the T0 table for the sequences visible in this pool view.

Thin public wrapper around _get_t0_df() that handles format conversion.

Parameters:

fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with columns [id_col, _T0_, _T0_NEAREST_RANK_], one row per visible sequence.

Examples:

pool.set_t0(position=0, anchor="start")
df = pool.t0_data()
df_pl = pool.t0_data(fmt="polars")

temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Return temporal data for all sequences visible in this pool.

Each row is one entity: the atomic observation of a sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).

Parameters:

features – Entity feature name(s) to include. None → all entity features.
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Long-format DataFrame with columns [id, temporal…, feature…] covering every visible sequence.

Examples:

df = pool.temporal_data()                    # pandas, all features
df = pool.temporal_data("heart_rate")        # single feature
df = pool.temporal_data(["a", "b"], fmt="polars")
# Restrict to a subset of IDs:
df = pool.subset([1, 2, 3]).temporal_data()

to_dummies(features: list[str] | str, is_static: bool = False, *, drop_first: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

One-hot encode categorical features.

Returns a DataFrame with binary columns for each category. Only features typed as Categorical or Enum are accepted. Cast first with cast_features if needed.

This is a consumption method: the result is returned, not stored in the pool. Use it when preparing data for training.

Parameters:

features – Feature name(s) to encode.
is_static – Whether these are static or entity features.
drop_first – Drop the first category column to avoid multicollinearity (useful for linear models).
fmt – Format of the returned object. One of: - "pandas" (default): returns a pandas.DataFrame. - "polars": returns a polars.DataFrame.

Returns:

DataFrame with binary columns (one per category per feature).

Raises:

TypeError – If any feature is not Categorical or Enum.

Examples

Basic usage:

pool.cast_features(schema={"status": pl.Categorical})
dummies = pool.to_dummies("status")
# → status_OK, status_ERROR, status_WARNING columns

With drop_first for linear models:

X = pool.to_dummies("group", is_static=True, drop_first=True)

to_tensor(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') → tuple[ndarray, list, list[str]][source]#

Project sequences onto a 3-D temporal grid (ML-ready tensor).

Returns a dense (N, M, K) ndarray together with the list of K-axis feature labels. Sequence order on the N axis follows unique_ids.

For a long-format dataframe variant (joins, plotting, exploration), see binned_data().

Parameters:

features – Feature(s) to project.
bin_size – Bin width (duration string for datetime, numeric otherwise).
max_bins – Maximum bins. None infers from the data span, capped by MAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.
fill_value – Value for empty bins. None keeps NaN.
overlap_rule – In-bin aggregation. See binned_data().
ohe – One-hot encode before binning. Post-OHE column names are reflected in the returned feature_names.
bin_col – Internal bin column name (forwarded to the underlying pipeline for consistency; not present in the output).

Returns:

arr has shape (N, M, K) = (len(unique_ids), n_bins, len(feature_names)).
ids is the sequence of entity IDs matching the N-axis order (identical to unique_ids).
feature_names lists the K-axis labels in column order (post-OHE names when ohe=True).

Return type:

A 3-tuple (arr, ids, feature_names) where

Examples:

arr, ids, names = pool.to_tensor(["dose", "route"], "1d", ohe=True)
# arr.shape == (N, M, K)
# names == ["dose", "route_oral", "route_iv"]

Split the pool into train and test subsets.

Mirrors the interface of sklearn.model_selection.train_test_split().

Parameters:

test_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the test subset. Defaults to 0.25 when both test_size and train_size are None.
train_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.
random_state – Seed for the random number generator. Pass an integer for reproducibility.
shuffle – Whether to shuffle IDs before splitting. When False, the first IDs go to train and the last to test.

Returns:

two new non-overlapping pool views.

Return type:

(train_pool, test_pool)

Raises:

ValueError – If the pool is empty, sizes are non-positive, or n_train + n_test exceeds the pool size.

property unique_ids: list[source]#

Visible sequence IDs in store order as a plain Python list.

Respects _id_mask. Deterministic order (sorted at build time).

Warning

list erases rich Polars dtypes (Categorical → str). Prefer _id_lf when the result feeds a Polars join.

which(criterion: Criterion, *, verbose: bool = True) → set[source]#

Return the set of IDs in this pool satisfying criterion.

Parameters:

criterion – A Criterion instance.
verbose – If True, print a one-line report.

Returns:

Set of matching IDs.

Raises:

TypeError – If criterion is not a Criterion object.

tanat.sequence.base.sequence module#

Base class for sequence objects.

class tanat.sequence.base.sequence.Sequence(id_value, store: SequenceStore, settings)[source]#

Bases: ABC, SequenceViewMixin, CachableSettings, Registrable

Interface to a single sequence within a Store.

A Sequence is a scoped view on the data for one specific ID. It shares the same SequenceStore as its parent Pool (no copy).

Typical creation patterns:

# From a Pool (recommended)
seq = pool[42]

# Standalone
seq = StateSequence(id_value=42, store="my_store")

__init__(id_value, store: SequenceStore, settings) → None[source]#

Base initialiser. Delegated to by concrete subclasses and from_parent() after store and feature resolution have been performed.

Parameters:

id_value – Unique identifier for this sequence in the store.
store – Already-resolved SequenceStore.
settings – Fully-resolved SequenceSettings (entity_features and static_features never None).

apply(exprs: Expr | list[Expr], is_static: bool = False, *, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) → DataFrame | DataFrame[source]#

Evaluates Polars expressions against this sequence’s features.

This is a read-only computation scoped to this single sequence. The result is returned, not stored.

At the Pool level, use Pool.apply(by_id=True) for per-sequence computations across all sequences, then Pool.add_entity_features() or Pool.add_static_features() to persist.

Parameters:

exprs – One or more Polars expressions producing new columns. Each must use .alias() to name the output.
is_static – Whether to read static or entity features.
fmt – "pandas" (default) or "polars".
use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

The computed columns for this sequence only.

Examples

Local normalization:

seq = pool[42]
result = seq.apply(
    (pl.col("value") - pl.col("value").mean()).alias("v_centered")
)

Multiple expressions:

result = seq.apply([
    (pl.col("value").diff()).alias("v_diff"),
    (pl.col("value").rolling_mean(3)).alias("v_rm3"),
])

tanat.sequence.base.settings module#

Abstract class for sequence settings.

class tanat.sequence.base.settings.SequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>)[source]#

Bases: ABC

Abstract class for sequence settings.

__init__(*args: Any, **kwargs: Any) → None[source]#

available_features(is_static: bool = False) → list[str][source]#

Returns all feature names for the given scope.

Parameters:: is_static – If True, return static features; otherwise entity features.
Returns:: List of feature names (may be empty).

entity_features: list[str][source]#

get_column_rename_map(is_static: bool = False) → dict[str, str][source]#

Returns a mapping from store internal column names (StoreSchema) to user-facing column names.

Inferred from get_time_columns(): - 1 column → T_EVENT - 2 columns → T_START, T_END (in order)

abstractmethod get_time_columns() → list[str][source]#: Returns a list of time index columns configured for this sequence type.

id_column: str[source]#

is_compatible_with(other: SequenceSettings) → tuple[bool, list[str]][source]#

Check if these settings are compatible with another SequenceSettings instance.

Compatibility rules: - id_column must be identical - time index columns must be identical - entity_features must be identical or a subset (no extra features) - static_features must be identical or a subset (no extra features)

Parameters:: other – SequenceSettings instance to compare with.
Returns:: Tuple of (is_compatible, list_of_errors)

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

classmethod normalize_entity_features(v)[source]#: Normalize to a sorted, deduplicated list; require at least one.

classmethod normalize_static_features(v)[source]#: Normalize to a sorted, deduplicated list.

static_features: list[str][source]#

validate_compatibility(other: SequenceSettings) → None[source]#

Validate compatibility with another SequenceSettings instance.

Parameters:: other – SequenceSettings instance to validate against.
Raises:: ValueError – If settings are incompatible.

validate_features(features: list[str] | str, is_static: bool = False, on_missing: Literal['raise', 'warn', 'ignore'] = 'raise') → list[str][source]#

Validates explicit feature names against the current settings.

Parameters:

features – Feature name(s) to validate.
is_static – Whether to check static or entity features.
on_missing – Strategy when a requested feature does not exist. "raise" – raise a KeyError immediately (default). "warn" – log a warning and skip. "ignore" – silently skip.

Returns:

List of validated feature names that exist in the configuration.

Raises:

KeyError – If on_missing="raise" and a feature is not found.

tanat.sequence.base.view_mixin module#

SequenceViewMixin: shared view-layer logic for Pool and Sequence.

Both SequencePool and Sequence are scoped views on a SequenceStore. This mixin factors out the logic they share:

class tanat.sequence.base.view_mixin.SequenceFrameAssembler(view: Sequence | SequencePool)[source]#

Bases: object

Assembles view-schema LazyFrames from the store for Sequence/SequencePool view

__init__(view: Sequence | SequencePool) → None[source]#

id_time_index(*, with_store_index: bool = False) → LazyFrame[source]#: Return id + time-index columns with view scopes applied.

ids() → LazyFrame[source]#: Return visible IDs with the view ID dtype and schema.

merged_for_extend(other: SequencePool, other_ids_to_add: list) → tuple[pl.LazyFrame, pl.LazyFrame | None][source]#

Build merged entity and static LazyFrames for a cross-store extend.

Reads both sides (self._view and other), applies view scopes, projects to self._view’s entity feature set, and concatenates. Returned frames are lazy (no I/O until collected by the builder).

Parameters:

other – The pool to merge from.
other_ids_to_add – IDs from other to include (duplicates already resolved by the caller).

Returns:

(merged_entity, merged_static) — second element is None when neither side has static features.

select(lf: LazyFrame, feature_names: list[str], is_static: bool = False) → LazyFrame[source]#: Select store structural columns plus feature_names.

static(features: list[str] | str | None = None) → LazyFrame | None[source]#: Return static data in view schema with scopes and casts applied.

static_for_store(features: list[str] | str | None = None) → LazyFrame | None[source]#: Return static data in store schema after view scopes and casts.

temporal(features: list[str] | str | None = None, *, with_store_index: bool = False) → LazyFrame[source]#: Return temporal data in view schema with scopes and casts applied.

temporal_for_store(features: list[str] | str | None = None) → LazyFrame[source]#: Return temporal data in store schema after view scopes and casts.

class tanat.sequence.base.view_mixin.SequenceViewMixin[source]#

Bases: object

Mixin providing the view-layer helpers shared by SequencePool and Sequence.

Note: SequencePool and Sequence inherit from CachableSettings, not this mixin.

property has_entity_filter_expr: bool[source]#: Whether this view has an expression-based entity filter.

property metadata: SequenceMetadata[source]#

Returns rich metadata fully reflecting this view’s cast recipes, masks, and feature selection.

When created from a parent SequencePool, the pool’s metadata is returned directly or scoped by the view’s settings (when built with a feature subset).

For a standalone view (no parent), metadata is inferred directly from the assembled, cast and masked LazyFrames. The seq_id dtype is derived from the cast recipe (or the store schema when no cast is active).

Automatically cached via CachableSettings: the cache is invalidated whenever settings change (e.g. after cast_features or drop_features).

Module contents#

Package stub.