tanat.sequence.base package#
Submodules#
tanat.sequence.base.entity module#
Entity: Flyweight object representing a single row in a Sequence (Event, State, etc.).
- class tanat.sequence.base.entity.Entity(id_value, store: str | Path | SequenceStore, features: list[str] | None = None, *, rank: int, store_index: int, cast_recipe: SequenceCastRecipe | dict | None = None, virtual_id: str | None = None, parent_metadata: SequenceMetadata | None = None)[source]#
Bases:
Registrable,ABCAbstract Flyweight object acting as a proxy to a specific row in a Sequence.
- __init__(id_value, store: str | Path | SequenceStore, features: list[str] | None = None, *, rank: int, store_index: int, cast_recipe: SequenceCastRecipe | dict | None = None, virtual_id: str | None = None, parent_metadata: SequenceMetadata | None = None) None[source]#
Create an entity proxy for a single row in a sequence.
- Parameters:
id_value – Sequence identifier this entity belongs to.
store – Store path, name, or
SequenceStoreinstance.features – Visible feature names propagated from the parent
Sequence.None→ all store features.rank – 0-based position within the sequence as seen by the user (accounts for filtering/masking).
store_index – Absolute physical row index in the store (
SCH.STORE_INDEX).cast_recipe – Cast recipe propagated from the parent
Sequence. Normalised viaSequenceCastRecipe.coerce().virtual_id – Virtual context UUID from the parent
Sequence.parent_metadata – Pre-computed
SequenceMetadatafrom the parent pool. When provided,metadatareturns this directly (no extra I/O).
- data(features: list[str] | None = None) dict[str, Any][source]#
Access the feature values for this entity as a dictionary.
Only feature columns are returned; the sequence identifier and time columns are excluded (use :pyattr:`id_value` and :pyattr:`temporal_extent` instead).
- Parameters:
features – Feature name(s) to include (
None→ all visible entity features).- Returns:
A
dictmapping feature names to their scalar values.
- classmethod from_parent(parent: Sequence, *, rank: int, store_index: int, prefetched: PrefetchedEntityData | None = None) Entity[source]#
Build a sequence-managed entity. Not part of the public API.
Snapshots the shared context from parent (store, features, casts, virtual_id, metadata), bypassing store and feature resolution. The per-row prefetched is optional: present during iteration, absent for point access.
- property metadata: dict[str, FeatureInfo][source]#
Feature metadata for this entity.
Returns a dictionary mapping each visible feature name to its
FeatureInfodescriptor (type, stats, …). When created from a Sequence (which itself comes from a Pool), the pool-level metadata is reused directly. Stats are consistent across all entities in the pool, no extra I/O.
- class tanat.sequence.base.entity.PrefetchedEntityData(_row: dict[str, Any], _time_columns: tuple[str, ...], _feature_columns: tuple[str, ...])[source]#
Bases:
objectOne entity row pre-collected by the parent Sequence during iteration.
Owns the view-schema row and knows how to expose it the way Entity does, so Entity never slices the raw row itself.
tanat.sequence.base.pool module#
Base class for sequence pool objects.
- class tanat.sequence.base.pool.SequencePool(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None)[source]#
Bases:
ABC,SequenceViewMixin,CachableSettings,RegistrableBase class for sequence pool objects.
- __init__(store: SequenceStore, settings: SequenceSettings | dict, cast_recipe: SequenceCastRecipe | dict | None = None) None[source]#
Base initialiser. Delegated to by concrete subclasses after store and feature resolution have been performed.
- Parameters:
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings(or equivalent dict).entity_featuresandstatic_featuresneverNone.cast_recipe – Optional cast recipe (or dict) applied at read time. Normalised via
SequenceCastRecipe.coerce()and probed eagerly.
- Raises:
TypeError – If cast_recipe is not a
SequenceCastRecipe,dict, orNone.
- add_entity_features(df: DataFrame | DataFrame | LazyFrame, *, overwrite: bool = False) None[source]#
Add new entity features to the virtual store.
The input DataFrame must be positionally aligned with the full entity row set of the store (i.e. it must have exactly as many rows as there are entity rows in the physical store, not just the current view). Use
save()first to materialise a filtered view before calling this method.- Parameters:
df – Feature-only DataFrame (no ID column) positionally aligned with the entity rows in the store. Can be pandas, Polars eager, or Polars lazy.
overwrite – If
True, replace existing features with the same name in the virtual context.
- Raises:
RuntimeError – If the pool has an active
_id_maskor entity filter expression. Callpool.save()first and then add features to the resulting unfiltered pool.ValueError – If the number of rows in df does not match the number of entity rows in the store.
- add_static_features(df: DataFrame | DataFrame | LazyFrame, *, id_column: str | None = None, overwrite: bool = False) None[source]#
Add static features to the virtual store via an ID-keyed join.
The input DataFrame must include the ID column (either under
settings.id_columnor under the name given by id_column). A LEFT JOIN against the full sequence index is performed internally, so partial DataFrames (covering only a subset of IDs) are accepted: IDs absent from df receivenullin the virtual context.- Parameters:
df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.
id_column – Name of the ID column in df. Defaults to
settings.id_columnwhenNone. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g.id_column="patient_id").overwrite – If
True, replace existing features with the same name in the virtual context.
- Raises:
KeyError – If the resolved ID column is not found in df.
- apply(exprs: Expr | list[Expr], is_static: bool = False, *, by_id: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Evaluate Polars expressions against the current features.
This is a read-only computation: the result is returned, not stored. Use
add_entity_features()oradd_static_features()to persist the result.Each expression must produce a named column (
.alias()).When
by_id=True, the result always includes the ID column (settings.id_column).- Parameters:
exprs – One or more Polars expressions producing new columns.
is_static – Whether to read static or entity features.
by_id – If
True, expressions are evaluated per sequence (group_byon the sequence ID). Only valid for entity features (is_static=False). The ID column is included in the result.fmt –
Format of the returned object. One of:
"pandas"(default): returns apandas.DataFrame."polars": returns apolars.DataFrame.
- Returns:
The computed columns as a DataFrame. When
by_id=True, the first column is the sequence ID.- Raises:
ValueError – If
by_id=Trueandis_static=True.
Examples
Compute and inspect:
result = pool.apply( (pl.col("age") * pl.col("score")).alias("age_score"), is_static=True, ) print(result)
Persist entity features:
result = pool.apply( (pl.col("value") - pl.col("value").mean()).alias("centered"), ) pool.add_entity_features(result)
Per-sequence aggregation (result includes ID column):
summary = pool.apply( pl.col("value").mean().alias("value_mean"), by_id=True, ) pool.add_static_features(summary)
Per-sequence normalization (result includes ID column):
normed = pool.apply( (pl.col("value") - pl.col("value").mean()).alias("v_normed"), by_id=True, ) pool.add_entity_features(normed)
- binned_data(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') DataFrame | DataFrame[source]#
Project sequences onto a binned temporal table (long-format dataframe).
Each sequence is aligned to a shared time axis divided into fixed-size bins. When multiple values compete for the same bin,
overlap_ruleresolves the ambiguity. Empty bins are filled withfill_value.For an ML-ready 3-D tensor with feature labels and ID order, see
to_tensor().- Parameters:
features – Feature(s) to project onto the grid.
bin_size – Width of each bin (duration string for datetime, numeric otherwise).
max_bins – Maximum number of bins.
Noneinfers from the data span, capped byMAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.fill_value – Value used to fill empty bins.
Nonekeeps nulls.overlap_rule – Polars aggregation name for in-bin conflict resolution (
"first","last","mean","max","sum","median", …).ohe – One-hot encode features before binning. Requires
CategoricalorEnumdtypes.fmt –
"pandas"(default) or"polars".use_arrow – Pandas conversion uses Arrow when
True.bin_col – Output column name for the bin index.
- Returns:
DataFrame with columns
[id_col, bin_col, *feature_cols]andlen(unique_ids) * n_binsrows.
- cast_features(schema: dict[str, DataType | type], is_static: bool = False) None[source]#
Casts feature columns to new types, scoped to this Pool only.
To make a cast permanent on disk, save the Pool first (
pool.save()) and reload. Persisting changes might affect other views sharing the same store, so use with caution.- Parameters:
schema – Dictionary mapping feature names to target Polars DataTypes.
is_static – Whether these are static features (True) or entity features (False).
- Raises:
TypeError – If schema is not a dict.
KeyError – If a feature name does not exist.
- cast_id(dtype: DataType) None[source]#
Casts the ID column to a new type.
- Parameters:
dtype – The target Polars DataType.
- cast_to_datetime(unit: str = 'us', time_zone: str | None = None)[source]#
Cast time columns to Datetime.
- Parameters:
unit – The datetime resolution (“ms”, “us”, “ns”). Default is “us” (microsecond), the Python standard.
time_zone – Optional timezone string (e.g. “UTC”, “Europe/Paris”).
- cast_to_timestep(dtype: DataType = Int64)[source]#
Cast time columns to numeric-based timesteps.
- Parameters:
dtype – The target numeric type (e.g., pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64 for safety.
- Raises:
TypeError – If dtype is not a numeric type.
TypeError – If the underlying data is already in Datetime format. (Conversion from Datetime to Timestep is not allowed).
- copy() SequencePool[source]#
Returns a shallow copy of this Pool, sharing the same store but with all view state (masks, casts, virtual features) conserved.
The virtual context is forked into a new UUID so that the copy owns its own independent context. Garbage-collecting either instance will not destroy the other’s virtual features.
The T0 strategy (
_t0_setter) is propagated to the copy. The T0 result cache is not copied; it is recomputed on the first call tot0_data()on the copy.
- describe(by_id: bool = True, add_to_static: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Compute summary statistics for every sequence in the pool.
- Parameters:
by_id – If
True(default), return one row per sequence ID with columns[id, length, n_unique_entities, …]. IfFalse, return the cross-sequence pandas.describe()(count, mean, std, min, 25%, …).add_to_static – If
True, write the per-ID result to the static-feature store viaadd_static_features(). Ignored (with a warning) whenby_id=False.fmt –
"pandas"(default) or"polars". Ignored whenby_id=False(always pandas).use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with one row per sequence ID. -
by_id=False: Aggregated statistics (pandasdescribe()output).
- Return type:
by_id=True
Examples:
pool.describe() # one row per ID, pandas pool.describe(fmt="polars") # same, polars pool.describe(by_id=False) # cross-ID stats pool.describe(add_to_static=True) # persist as static cols
- drop_features(features: list[str], is_static: bool = False, *, permanently: bool = False) None[source]#
Removes features from the current view.
By default, this is a soft drop: features are removed from the Pool settings so they no longer appear in
temporal_data(),static_data()ormetadata, but the underlying files are left untouched.With
permanently=Truethe columns are also physically deleted from disk (physical store and/or virtual store).- Parameters:
features – Feature names to drop.
is_static –
Truefor static features,Falsefor entity features.permanently – If
True, also delete the columns from disk. This is irreversible for physical features.
- extend(other: SequencePool | Sequence, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) SequencePool[source]#
Merge other into this pool and write the result to disk.
Mirrors the semantics of
save().Same-store fast path: if
selfandotherpoint to the same physical store and neither carries virtual content (_virtual_id is Noneon both sides), no read I/O is performed. A new pool backed by the same store with the union of both ID masks is built immediately. If destination is provided the merged pool is then materialised to disk viasave(); otherwise it is returned as an in-memory view with zero I/O.Different stores (or virtual content present): a full merge-and-write is performed and destination is required. A named destination writes the merged data to a new store. To rewrite in-place, pass
destination=self._store.root_pathexplicitly together withoverwrite=True.- Parameters:
other –
Data to merge. Accepted types:
SequencePool: must be the same concrete subclass with an identical entity feature schema.Sequence: single sequence object; schema is checked against this pool.
destination –
None→ in-memory view (same-store fast path only; no I/O);str/Path→ materialise the merged data to disk. destination is required when merging from different stores.on_duplicate –
Behaviour when other contains an ID already present in this pool:
"raise"(default): raiseValueErrorlisting the conflicting IDs."skip": silently ignore duplicates.
overwrite – Allows overwriting an existing destination when it already exists on disk.
- Returns:
A new
SequencePoolinstance.- Raises:
TypeError – If other is not a
SequencePoolorSequence.ValueError – If other is a
SequencePoolof a different concrete type.ValueError – If other is missing entity features declared in this pool’s settings.
ValueError – If
on_duplicate="raise"and duplicate IDs are found.ValueError – If
destination=Noneand stores differ (same-store fast path only supports an in-memory view without I/O).FileExistsError – If destination exists and
overwrite=False.
See also
- filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) SequencePool[source]#
Return a view with entities pruned by criterion.
- Parameters:
- Returns:
Filtered pool (or self when inplace=True).
- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.
- classmethod from_parent(store_path: Path, *, parent_pool: TrajectoryPool) SequencePool[source]#
Create a managed sub-pool owned by parent_pool.
Builds the pool from store_path, aligns settings and casts with the parent
TrajectoryPool, and marks it locked. T0 is resolved lazily via the_t0_setterproperty which delegates to parent_pool.- Parameters:
store_path – Root path of the sequence store to load.
parent_pool – The owning
TrajectoryPool.
- Returns:
A locked
SequencePoolwhose T0 delegates to parent_pool.
- get_sequences(entity_features: list[str] | None = None, static_features: list[str] | None = None) dict[str, Sequence][source]#
Return a mapping of sequence IDs to
Sequenceobjects.- Parameters:
entity_features – Entity feature subset to expose in each sequence.
None-> use pool-level settings.static_features – Static feature subset to expose in each sequence.
None-> use pool-level settings.
- Returns:
Dict mapping each visible sequence ID to its
Sequenceinstance.
Examples:
seqs = pool.get_sequences() print(seqs[42].temporal_data())
- property is_dirty: bool[source]#
Trueif the pool has state not yet written to disk.Covers virtual features, view scopes, type casts, and soft feature drops. A dirty pool needs
save()to materialise its current view.
- save(destination: str | Path | None = None, *, overwrite: bool = False) Path[source]#
Persist the current pool state (virtual features + view masks).
Without destination the store is rewritten in-place. With destination the pool is rebuilt into that path and then redirects to it - the original files are left untouched.
In both cases the pool is left in a clean state after a successful save: virtual context, masks, soft-drops and cast recipe are all reset, and
is_dirtybecomesFalse.When a mask is active and destination is
None, the store is overwritten with a subset of the data.overwrite=Trueis required to confirm.- Parameters:
destination –
None→ in-place; path → rebuild into new path and redirect the pool there.overwrite – Required when saving a filtered view in-place. Also allows overwriting an existing destination.
- Returns:
The
Pathof the written store.- Raises:
FileExistsError – If destination exists and overwrite is
False.RuntimeError – If a mask is active in-place without overwrite.
- set_t0(*, position: int | None = None, direct=None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True) SequencePool[source]#
Configure the T0 strategy for this pool.
Exactly one strategy keyword must be provided. All others must remain
None.- Parameters:
position – Row index (0-based; negative indexing supported).
direct – Scalar value or
{seq_id: value}dict.feature – Static feature column name.
query – Polars boolean expression on any sequence column (time columns or entity features).
anchor –
Which end of each interval/state row to use as the reference timestamp for the floor lookup:
"start"(default): use the start timestamp."end": use the end timestamp."middle": use the midpoint(start + end) / 2.
Omitting
anchor=on an interval/state pool emits aUserWarningand defaults to"start". Passinganchor=on an event pool emits aUserWarningand the value is ignored (single time column, anchor is irrelevant).use_first – For the query strategy, whether to take the first (
True) or last (False) matching row.
- Returns:
selffor chaining.
- static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#
Return static (non-temporal) data for all sequences in this pool.
- Parameters:
features – Feature name(s) to include (
None-> all static features).fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
One-row-per-sequence DataFrame with columns
[id, feature...].Nonewhen no static features are exposed by this pool.
Examples:
df = pool.static_data() # pandas, all static features df = pool.static_data(["age", "sex"]) # subset df = pool.subset([1, 2, 3]).static_data()
- subset(ids, *, inplace=False) SequencePool[source]#
Returns a new Pool containing only the specified sequence IDs.
- Parameters:
ids – A list of sequence IDs to include in the subset.
inplace – If
True, modify this Pool’s view instead of returning a new one.
- Returns:
A new SequencePool instance with the subset of IDs, or self if inplace=True.
- Raises:
ValueError – If any ID in ids is not present in
unique_ids.
- survival_target(endpoint_time: str, occurred: str | None = None, censure_time: str | None = None, fmt: Literal['sksurv', 'polars', 'pandas'] = 'sksurv') tuple[ndarray | DataFrame | DataFrame, list][source]#
Build a survival target (occurred, time) from static features stored in the pool.
For each patient, assembles a binary indicator (did the endpoint occur?) and the corresponding duration (time from T0 to the endpoint, or to the last observation for censored patients). Patients with unresolvable or non-positive durations are excluded and reported via a warning.
Durations are computed as the difference between the absolute value stored in the static column and the per-patient T0.
If
censure_timeisNone, the last recorded time in the sequence is used as the censoring reference.- Parameters:
endpoint_time – Name of the static column containing the absolute time at which the endpoint occurred (e.g. age at death, event datetime). T0 is subtracted internally to produce the duration. Null = endpoint not observed (censored), when occurred is None. Expected dtype: same as the pool time axis (numeric for timestep pools, datetime for datetime pools).
occurred – Name of the static column with a binary endpoint indicator. True (or 1) = endpoint observed, False (or 0) or null = censored. Expected dtype: bool or numeric (int or float); cast to bool internally. If None, inferred as endpoint_time.is_not_null().
censure_time – Name of the static column containing the absolute time of the last observation for censored patients (e.g. age at last visit, last visit datetime). T0 is subtracted internally to produce the duration. Expected dtype: same as the pool time axis. If None, derived automatically as
max(get_temporal_columns()[-1])per patient (i.e. max of time_column for events, max of end_column for state/interval pools).fmt – Format of the returned target y. “sksurv” returns a structured np.ndarray with fields (occurred: bool, time: float) compatible with scikit-survival. “polars” and “pandas” return a DataFrame with columns [“id”, “occurred”, “time”].
- Returns:
A tuple (y, valid_ids). y is the survival target in the requested format. valid_ids is the list of patient identifiers retained after filtering invalid rows.
- Raises:
KeyError – If a column name is not found in the pool’s static features.
RuntimeError – If no static features are available in this pool.
Examples
>>> y, valid_ids = pool.survival_target( ... endpoint_time="death_age_occur", ... ) >>> y, valid_ids = pool.survival_target( ... endpoint_time="death_age_occur", ... occurred="death", ... )
- t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return the T0 table for the sequences visible in this pool view.
Thin public wrapper around
_get_t0_df()that handles format conversion.- Parameters:
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with columns
[id_col, _T0_, _T0_NEAREST_RANK_], one row per visible sequence.
Examples:
pool.set_t0(position=0, anchor="start") df = pool.t0_data() df_pl = pool.t0_data(fmt="polars")
- temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return temporal data for all sequences visible in this pool.
Each row is one entity: the atomic observation of a sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).
- Parameters:
features – Entity feature name(s) to include.
None→ all entity features.fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Long-format DataFrame with columns
[id, temporal…, feature…]covering every visible sequence.
Examples:
df = pool.temporal_data() # pandas, all features df = pool.temporal_data("heart_rate") # single feature df = pool.temporal_data(["a", "b"], fmt="polars") # Restrict to a subset of IDs: df = pool.subset([1, 2, 3]).temporal_data()
- to_dummies(features: list[str] | str, is_static: bool = False, *, drop_first: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
One-hot encode categorical features.
Returns a DataFrame with binary columns for each category. Only features typed as
CategoricalorEnumare accepted. Cast first withcast_featuresif needed.This is a consumption method: the result is returned, not stored in the pool. Use it when preparing data for training.
- Parameters:
features – Feature name(s) to encode.
is_static – Whether these are static or entity features.
drop_first – Drop the first category column to avoid multicollinearity (useful for linear models).
fmt – Format of the returned object. One of: -
"pandas"(default): returns apandas.DataFrame. -"polars": returns apolars.DataFrame.
- Returns:
DataFrame with binary columns (one per category per feature).
- Raises:
TypeError – If any feature is not
CategoricalorEnum.
Examples
Basic usage:
pool.cast_features(schema={"status": pl.Categorical}) dummies = pool.to_dummies("status") # → status_OK, status_ERROR, status_WARNING columns
With
drop_firstfor linear models:X = pool.to_dummies("group", is_static=True, drop_first=True)
- to_tensor(features: list[str] | str, bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') tuple[ndarray, list, list[str]][source]#
Project sequences onto a 3-D temporal grid (ML-ready tensor).
Returns a dense
(N, M, K)ndarray together with the list of K-axis feature labels. Sequence order on the N axis followsunique_ids.For a long-format dataframe variant (joins, plotting, exploration), see
binned_data().- Parameters:
features – Feature(s) to project.
bin_size – Bin width (duration string for datetime, numeric otherwise).
max_bins – Maximum bins.
Noneinfers from the data span, capped byMAX_BINS_LIMIT. An explicit value bypasses the safety cap — the caller opts in knowingly.fill_value – Value for empty bins.
Nonekeeps NaN.overlap_rule – In-bin aggregation. See
binned_data().ohe – One-hot encode before binning. Post-OHE column names are reflected in the returned
feature_names.bin_col – Internal bin column name (forwarded to the underlying pipeline for consistency; not present in the output).
- Returns:
arrhas shape(N, M, K)=(len(unique_ids), n_bins, len(feature_names)).idsis the sequence of entity IDs matching the N-axis order (identical tounique_ids).feature_nameslists the K-axis labels in column order (post-OHE names whenohe=True).
- Return type:
A 3-tuple
(arr, ids, feature_names)where
Examples:
arr, ids, names = pool.to_tensor(["dose", "route"], "1d", ohe=True) # arr.shape == (N, M, K) # names == ["dose", "route_oral", "route_iv"]
- train_test_split(*, test_size: float | int | None = None, train_size: float | int | None = None, random_state: int | None = None, shuffle: bool = True) tuple[SequencePool, SequencePool][source]#
Split the pool into train and test subsets.
Mirrors the interface of
sklearn.model_selection.train_test_split().- Parameters:
test_size – Proportion (
floatin(0, 1)) or absolute count (int) of samples for the test subset. Defaults to0.25when both test_size and train_size areNone.train_size – Proportion (
floatin(0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.random_state – Seed for the random number generator. Pass an integer for reproducibility.
shuffle – Whether to shuffle IDs before splitting. When
False, the first IDs go to train and the last to test.
- Returns:
two new non-overlapping pool views.
- Return type:
(train_pool, test_pool)- Raises:
ValueError – If the pool is empty, sizes are non-positive, or
n_train + n_testexceeds the pool size.
- property unique_ids: list[source]#
Visible sequence IDs in store order as a plain Python list.
Respects
_id_mask. Deterministic order (sorted at build time).Warning
listerases rich Polars dtypes (Categorical→str). Prefer_id_lfwhen the result feeds a Polars join.
- which(criterion: Criterion, *, verbose: bool = True) set[source]#
Return the set of IDs in this pool satisfying criterion.
- Parameters:
criterion – A
Criterioninstance.verbose – If
True, print a one-line report.
- Returns:
Set of matching IDs.
- Raises:
TypeError – If criterion is not a Criterion object.
tanat.sequence.base.sequence module#
Base class for sequence objects.
- class tanat.sequence.base.sequence.Sequence(id_value, store: SequenceStore, settings)[source]#
Bases:
ABC,SequenceViewMixin,CachableSettings,RegistrableInterface to a single sequence within a Store.
A Sequence is a scoped view on the data for one specific ID. It shares the same
SequenceStoreas its parent Pool (no copy).Typical creation patterns:
# From a Pool (recommended) seq = pool[42] # Standalone seq = StateSequence(id_value=42, store="my_store")
- __init__(id_value, store: SequenceStore, settings) None[source]#
Base initialiser. Delegated to by concrete subclasses and
from_parent()after store and feature resolution have been performed.- Parameters:
id_value – Unique identifier for this sequence in the store.
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings(entity_featuresandstatic_featuresneverNone).
- apply(exprs: Expr | list[Expr], is_static: bool = False, *, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Evaluates Polars expressions against this sequence’s features.
This is a read-only computation scoped to this single sequence. The result is returned, not stored.
At the Pool level, use
Pool.apply(by_id=True)for per-sequence computations across all sequences, thenPool.add_entity_features()orPool.add_static_features()to persist.- Parameters:
exprs – One or more Polars expressions producing new columns. Each must use
.alias()to name the output.is_static – Whether to read static or entity features.
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
The computed columns for this sequence only.
Examples
Local normalization:
seq = pool[42] result = seq.apply( (pl.col("value") - pl.col("value").mean()).alias("v_centered") )
Multiple expressions:
result = seq.apply([ (pl.col("value").diff()).alias("v_diff"), (pl.col("value").rolling_mean(3)).alias("v_rm3"), ])
See also
Pool.apply: Apply across all sequences (with optionalby_id).Pool.add_entity_features: Persist entity features.Pool.add_static_features: Persist static features.
- copy() Sequence[source]#
Return a standalone copy of this sequence, detached from any parent pool.
- Returns:
A new standalone
Sequencewith_parent_pool=None.
Examples:
seq = pool[42] standalone = seq.copy() # detaches from pool standalone.filter_entities(crit, inplace=True) # safe
- describe(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Compute summary statistics for this single sequence.
- Parameters:
fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Single-row DataFrame with columns
[length, n_unique_entities, temporal_span, …].
Examples:
seq = pool[42] seq.describe() seq.describe(fmt="polars")
- filter_entities(criterion: Criterion, *, inplace: bool = False, verbose: bool = True) Sequence[source]#
Return a view with entities pruned by criterion.
- Parameters:
- Returns:
Filtered sequence (or self when inplace=True).
- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion does not support entity filtering.
- classmethod from_parent(id_value, store: SequenceStore, settings, *, parent_pool: SequencePool) Sequence[source]#
Create a pool-managed sequence. Not part of the public API.
Bypasses store resolution, feature resolution, and cast probe: all already performed by the pool. Every piece of pool context (casts, filters, virtual ID, T0) is read lazily from parent_pool via the corresponding cached properties.
- Parameters:
id_value – Sequence identifier.
store – Already-resolved
SequenceStore.settings – Fully-resolved
SequenceSettings.parent_pool – The owning
SequencePool.
- Returns:
A new
Sequenceinstance bound to parent_pool.
- match(criterion: Criterion) bool[source]#
Return
Trueif this sequence satisfies criterion.- Parameters:
criterion – A
Criterioninstance.- Raises:
TypeError – If criterion is not a Criterion object.
CriterionLevelError – If the criterion is incompatible with this sequence.
- static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#
Return static (non-temporal) data for this sequence.
- Parameters:
features – Feature name(s) to include (
None-> all).fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
Single-row DataFrame with columns
[id, feature…].Nonewhen no static features are exposed by this pool.
Examples:
seq = pool[42] row = seq.static_data() # pandas, all static features row = seq.static_data("age", "sex") # subset
- property t0: datetime | date | int | float | None[source]#
T0 value for this sequence (scalar, not a DataFrame).
Nonewhen no valid T0 row was found (e.g. sequence too short, or no row matched the query).
- property t0_nearest_rank: int | None[source]#
0-based rank of the nearest row at or before T0 within this sequence.
Nonewhen no valid T0 row was found (e.g. sequence too short, T0 before all timestamps, or no row matched the query).
- temporal_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#
Return temporal data for this sequence.
Each row is one entity: the atomic observation of this sequence (an event, a state, or a time-step). Each entity carries the sequence ID, its temporal position (one column for events, two for intervals), and entity features: the per-row measurements that vary along the sequence (e.g. heart rate, label, sensor value).
- Parameters:
features – Entity feature name(s) to include.
None→ all entity features.fmt –
"pandas"(default) or"polars".use_arrow – Use Arrow extension arrays for polars -> pandas conversion.
- Returns:
DataFrame with columns
[id, temporal…, feature…]scoped to this sequence ID.
Examples:
seq = pool[42] df = seq.temporal_data() # pandas, all features df = seq.temporal_data("heart_rate") # single feature df = seq.temporal_data(fmt="polars")
tanat.sequence.base.settings module#
Abstract class for sequence settings.
- class tanat.sequence.base.settings.SequenceSettings(*, id_column: str, entity_features: list[str], static_features: list[str] = <factory>)[source]#
Bases:
ABCAbstract class for sequence settings.
- available_features(is_static: bool = False) list[str][source]#
Returns all feature names for the given scope.
- Parameters:
is_static – If
True, return static features; otherwise entity features.- Returns:
List of feature names (may be empty).
- get_column_rename_map(is_static: bool = False) dict[str, str][source]#
Returns a mapping from store internal column names (
StoreSchema) to user-facing column names.Inferred from
get_time_columns(): - 1 column →T_EVENT- 2 columns →T_START,T_END(in order)
- abstractmethod get_time_columns() list[str][source]#
Returns a list of time index columns configured for this sequence type.
- is_compatible_with(other: SequenceSettings) tuple[bool, list[str]][source]#
Check if these settings are compatible with another SequenceSettings instance.
Compatibility rules: - id_column must be identical - time index columns must be identical - entity_features must be identical or a subset (no extra features) - static_features must be identical or a subset (no extra features)
- Parameters:
other – SequenceSettings instance to compare with.
- Returns:
Tuple of (is_compatible, list_of_errors)
- model_dump(*, mode='python', **dump_kwargs)[source]#
Dump settings to a dict via Pydantic serialization.
- classmethod normalize_entity_features(v)[source]#
Normalize to a sorted, deduplicated list; require at least one.
- validate_compatibility(other: SequenceSettings) None[source]#
Validate compatibility with another SequenceSettings instance.
- Parameters:
other – SequenceSettings instance to validate against.
- Raises:
ValueError – If settings are incompatible.
- validate_features(features: list[str] | str, is_static: bool = False, on_missing: Literal['raise', 'warn', 'ignore'] = 'raise') list[str][source]#
Validates explicit feature names against the current settings.
- Parameters:
features – Feature name(s) to validate.
is_static – Whether to check static or entity features.
on_missing – Strategy when a requested feature does not exist.
"raise"– raise aKeyErrorimmediately (default)."warn"– log a warning and skip."ignore"– silently skip.
- Returns:
List of validated feature names that exist in the configuration.
- Raises:
KeyError – If
on_missing="raise"and a feature is not found.
tanat.sequence.base.view_mixin module#
SequenceViewMixin: shared view-layer logic for Pool and Sequence.
Both SequencePool and Sequence are scoped views on a
SequenceStore. This mixin factors out the logic they share:
- class tanat.sequence.base.view_mixin.SequenceFrameAssembler(view: Sequence | SequencePool)[source]#
Bases:
objectAssembles view-schema LazyFrames from the store for Sequence/SequencePool view
- __init__(view: Sequence | SequencePool) None[source]#
- id_time_index(*, with_store_index: bool = False) LazyFrame[source]#
Return id + time-index columns with view scopes applied.
- merged_for_extend(other: SequencePool, other_ids_to_add: list) tuple[pl.LazyFrame, pl.LazyFrame | None][source]#
Build merged entity and static LazyFrames for a cross-store extend.
Reads both sides (
self._viewand other), applies view scopes, projects toself._view’s entity feature set, and concatenates. Returned frames are lazy (no I/O until collected by the builder).- Parameters:
other – The pool to merge from.
other_ids_to_add – IDs from other to include (duplicates already resolved by the caller).
- Returns:
(merged_entity, merged_static)— second element isNonewhen neither side has static features.
- select(lf: LazyFrame, feature_names: list[str], is_static: bool = False) LazyFrame[source]#
Select store structural columns plus feature_names.
- static(features: list[str] | str | None = None) LazyFrame | None[source]#
Return static data in view schema with scopes and casts applied.
- static_for_store(features: list[str] | str | None = None) LazyFrame | None[source]#
Return static data in store schema after view scopes and casts.
- class tanat.sequence.base.view_mixin.SequenceViewMixin[source]#
Bases:
objectMixin providing the view-layer helpers shared by
SequencePoolandSequence.Note:
SequencePoolandSequenceinherit from CachableSettings, not this mixin.- property has_entity_filter_expr: bool[source]#
Whether this view has an expression-based entity filter.
- property metadata: SequenceMetadata[source]#
Returns rich metadata fully reflecting this view’s cast recipes, masks, and feature selection.
When created from a parent
SequencePool, the pool’s metadata is returned directly or scoped by the view’s settings (when built with a feature subset).For a standalone view (no parent), metadata is inferred directly from the assembled, cast and masked LazyFrames. The seq_id dtype is derived from the cast recipe (or the store schema when no cast is active).
Automatically cached via
CachableSettings: the cache is invalidated whenever settings change (e.g. aftercast_featuresordrop_features).
Module contents#
Package stub.