tanat.trajectory package#

Submodules#

tanat.trajectory.pool module#

TrajectoryPool: aggregation of SequencePool views.

class tanat.trajectory.pool.TrajectoryPool(store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None, cast_recipe: TrajectoryCastRecipe | dict | None = None)[source]#

Bases: TrajectoryViewMixin, CachableSettings

Aggregates SequencePool views into trajectories.

Accepts a store name, path, or TrajectoryStore instance, following the same convention as SequencePool.

Usage:

store_path = (
    TrajectoryPool.builder()
    .add("medical", medical_pool)
    .add("lab", lab_pool)
    .build("./my_trajectories")
)
pool = TrajectoryPool(store="./my_trajectories")
SETTINGS_CLASS[source]#

alias of TrajectorySettings

__init__(store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None, cast_recipe: TrajectoryCastRecipe | dict | None = None) None[source]#

Create a trajectory pool backed by store.

Parameters:
  • store – Store path, name, or TrajectoryStore instance.

  • id_column – User-facing name for the trajectory ID column.

  • static_features – Static feature names to expose. None → all available. [] → none.

  • cast_recipe – Optional cast recipe (or dict) applied at read time. Only id and static fields are meaningful at this level. Normalised via TrajectoryCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a TrajectoryCastRecipe, dict, or None.

add_static_features(df: DataFrame | LazyFrame | DataFrame, *, id_column: str | None = None, overwrite: bool = False) None[source]#

Add static features to the trajectory pool via an ID-keyed join.

The input DataFrame must include the trajectory ID column (either under settings.id_column or under the name given by id_column). A LEFT JOIN against the full trajectory index is performed internally, so partial DataFrames (covering only a subset of trajectory IDs) are valid: absent IDs receive null in the virtual context.

Because alignment is handled by the join rather than by row position, this method works on views with pending changes (cast, virtual features, masks). Only the IDs visible in the view are exposed when reading back with static_data().

Parameters:
  • df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.

  • id_column – Name of the ID column in df. Defaults to settings.id_column when None. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g. id_column="traj_id").

  • overwrite – If True, replaces features that already exist in the virtual context.

Raises:

KeyError – If the resolved ID column is not found in df.

binned_data(features: dict[str, list[str] | str], bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') DataFrame | DataFrame[source]#

Project all aliases onto a single shared binned table (long format).

All sub-pools share one global (t_min, t_max, bin_size) axis, derived from the union of their temporal spans. Output columns are prefixed "{alias}_{feature}" to avoid collisions.

For an ML-ready 3-D tensor with feature labels and ID order, see to_tensor().

Parameters:
  • features – Mapping {alias: feature(s)}. Each alias must exist in this pool. str values are auto-promoted to [str].

  • bin_size – Bin width on the shared axis.

  • max_bins – Capped by MAX_BINS_LIMIT when None. An explicit value bypasses the cap — the caller opts in knowingly.

  • fill_value – Applied once, after the cross-join over trajectory IDs. Per-alias fills are not applied.

  • overlap_rule – In-bin aggregation, applied per alias.

  • ohe – One-hot encode per alias. Output names reflect post-OHE columns.

  • fmt"pandas" or "polars".

  • use_arrow – Arrow-backed pandas conversion.

  • bin_col – Output bin index column name.

Returns:

DataFrame with columns [traj_id, bin_col, "{alias1}_{feat1}", "{alias1}_{feat2}", ..., "{alias2}_{feat1}", ...].

classmethod builder() TrajectoryStoreBuilder[source]#

Return a fluent builder for constructing a trajectory store.

cast_id(dtype: DataType) None[source]#

Casts the trajectory ID column to a new type.

The cast is propagated automatically to all linked sequence pools (accessible via sequence_pools) so that entity data and static data surface IDs in the same type at every level.

Parameters:

dtype – Target Polars DataType (e.g. pl.String, pl.UInt32).

Raises:

TypeError – If the cast is incompatible with the stored ID values.

cast_static_features(schema: dict[str, DataType | type]) None[source]#

Casts trajectory-level static-feature columns to new types.

Only static features can be cast at trajectory level: entity features live inside the linked sequence stores and must be cast there.

Parameters:

schema – Dictionary mapping feature names to target Polars DataTypes (e.g. {"group": pl.Categorical}).

Raises:
  • TypeError – If schema is not a dict.

  • KeyError – If a feature name does not exist in the current view.

cast_to_datetime(unit: str = 'us', time_zone: str | None = None) None[source]#

Casts time columns to Datetime across all linked sequence pools.

All sequence stores are guaranteed to share the same temporal schema (enforced at build time), so a single probe against the trajectory store is sufficient - exactly like cast_id(). The cast is stored in the trajectory-level recipe and re-propagated to every pool on next sequence_pools access.

Parameters:
  • unit – Datetime resolution ("ms", "us", "ns"). Default is "us" (microsecond).

  • time_zone – Optional timezone string (e.g. "UTC", "Europe/Paris").

Raises:
  • ValueError – If unit is not one of the accepted values.

  • TypeError – If the cast is incompatible with the temporal data.

cast_to_timestep(dtype: DataType = Int64) None[source]#

Casts time columns to numeric-based timesteps across all linked sequence pools.

All sequence stores are guaranteed to share the same temporal schema (enforced at build time), so a single probe against the trajectory store is sufficient - exactly like cast_id(). The cast is stored in the trajectory-level recipe and re-propagated to every pool on next sequence_pools access.

Parameters:

dtype – Target numeric type (e.g. pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64.

Raises:

TypeError – If dtype is not a numeric type, or if the temporal data is already in Datetime format.

copy() TrajectoryPool[source]#

Return a shallow copy sharing the same store, with all view state preserved.

The new pool references the same TrajectoryStore and the same virtual context (_virtual_id) so virtual features are immediately visible.

Returns:

A new TrajectoryPool with identical settings, casts, masks and virtual context.

Note

Chaining with save() produces a fully independent pool at a new path without mutating the original instance:

pool2 = TrajectoryPool(store=pool.copy().save("other_path"))

Use this when you need both the original and a snapshot at a new destination. pool.save("other_path") alone would redirect pool itself to "other_path".

See also

save()

describe(by_id: bool = True, add_to_static: bool = False, separator: str = '_', fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Compute summary statistics across all sequences and all trajectories.

Parameters:
  • by_id – If True (default), return one row per trajectory. If False, return cross-trajectory pandas .describe().

  • add_to_static – If True, persist the per-ID result via add_static_features(). Ignored (with a warning) when by_id=False.

  • separator – Separator between alias and metric name (default _).

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with columns [id, n_sequences, {alias}{sep}length, …].

Examples:

traj_pool.describe()
traj_pool.describe(separator=".")
traj_pool.describe(by_id=False)
traj_pool.describe(add_to_static=True)
drop_sequence_pools(*aliases: str) None[source]#

Hides one or more store aliases from this view.

The underlying TrajectoryStore is not modified. Only the pool’s visible aliases (and derived properties like trajectory_index and unique_ids) are affected.

Parameters:

aliases – One or more alias names to hide.

Raises:
  • RuntimeError – If the pool is not built yet.

  • KeyError – If an alias does not exist in the store.

drop_static_features(features: list[str] | str, *, permanently: bool = False) None[source]#

Removes static features from the view (and optionally from disk).

By default this is a soft drop: features are removed from the settings so they no longer appear in static_data(), but the underlying data is left untouched.

With permanently=True the columns are also deleted from disk / virtual context (irreversible).

Parameters:
  • features – Feature name(s) to drop.

  • permanently – If True, also remove from disk/virtual.

extend(other: TrajectoryPool | Trajectory, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) TrajectoryPool[source]#

Merge other into this trajectory pool and write the result to disk.

Mirrors the semantics of save().

Same-store fast path - if both trajectory pools share _store.root_path and neither carries virtual content (_virtual_id is None on both sides), no I/O is performed. A new pool backed by the same store with the union of ID masks is returned immediately. If destination is provided the merged pool is materialised via save(); otherwise it is returned as an in-memory view with zero I/O.

Different stores (or virtual content present) - destination is required. For each alias in this pool, extend() is called on the corresponding sub-pools; the results are assembled into a new trajectory store via the builder. Pass destination=self._store.root_path with overwrite=True to rewrite in-place.

Parameters:
  • other – Trajectory pool or single Trajectory to merge.

  • destinationNone → in-memory view (same-store fast path only; no I/O); str / Path → materialise the merged data to disk. destination is required when merging from different stores.

  • on_duplicate

    Behaviour when other contains a trajectory ID already present in this pool:

    • "raise" (default): raise ValueError.

    • "skip": silently ignore duplicates.

  • overwrite – Allows overwriting an existing destination when it already exists on disk.

Returns:

Always a new TrajectoryPool - never self.

Raises:
  • TypeError – If other is not a TrajectoryPool or Trajectory.

  • TypeError – If a sub-pool has an incompatible ID dtype or temporal schema.

  • ValueError – If a sub-pool in other is missing features present in the corresponding sub-pool of self.

  • ValueError – If on_duplicate="raise" and duplicate IDs are found.

  • ValueError – If destination=None and stores differ.

  • FileExistsError – If destination exists and overwrite=False.

Note

Aliases present in other but absent from self are silently ignored (logged at WARNING). Aliases present in self but absent from other are carried over unchanged.

See also

save(), extend()

filter_entities(criterion: Criterion, *, alias: str, inplace: bool = False, verbose: bool = True) TrajectoryPool[source]#

Return a new TrajectoryPool view with entities filtered by criterion.

Parameters:
  • criterion – A Criterion instance.

  • alias – Sequence alias to apply the criterion on.

  • inplace – If True, modify this pool’s in place instead of returning a new view.

  • verbose – If True, print a one-line report.

Returns:

A new TrajectoryPool view with the criterion applied, or self if inplace is True.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with entity filtering.

get_trajectories(static_features: list[str] | None = None, aliases: list[str] | None = None) dict[str, Trajectory][source]#

All visible Trajectory instances, keyed by ID.

Materialises every trajectory reachable through the current view (respecting _id_mask and _alias_mask). Useful for iteration-heavy workflows where the same trajectory is accessed multiple times.

Parameters:
  • static_features – Static features to expose in each Trajectory. None → use the pool-level setting. [] → no static features.

  • aliases – Sequence-store aliases to expose in each Trajectory. None → use the pool-level alias mask. Must be a subset of the pool’s visible aliases.

Raises:

KeyError – If any alias in aliases is not visible in the current pool view.

property is_dirty: bool[source]#

True if the pool (or any linked sequence pool) has unsaved state.

Trajectory-level: virtual features, ID mask, type casts, soft drops. Sub-pool level: delegates to each pool’s is_dirty.

A dirty pool needs save() with a destination to materialise all pending changes (sub-pool changes cannot be saved in-place).

items() Iterator[tuple][source]#

Yield (id, trajectory) pairs for all visible trajectories.

save(destination: str | Path | None = None, *, overwrite: bool = False, deep: bool = False) Path[source]#

Persists the current pool state to disk.

Trajectory-level (trajectory_index.arrow, static_features.arrow) is always written, with ID and static casts baked in.

Sequence pools - persisted according to their state:

  • Modified pools (virtual features or casts) are always saved to destination/stores/<alias>/, regardless of deep.

  • Unmodified pools: copied when deep=True; referenced by absolute path when deep=False.

For in-place saves the linked sequence stores are never touched (they may be shared with other pool instances).

Without destination the trajectory-level files are rewritten in-place. With destination a copy is created; the original is untouched.

Parameters:
  • destination – Where to save. Can be: - None → in-place, - a workspace store name (no / or \), - or a filesystem Path / path string. Passing a path that resolves to the current store root is equivalent to None (treated as in-place).

  • overwrite – Required when saving in-place (trajectory-level files will be overwritten). Also allows overwriting an existing destination. Each dirty sub-pool is saved in-place at its current location with overwrite=True automatically.

  • deep – If True, all sequence stores (including unmodified ones) are copied to destination/stores/<alias>/. When False (default), only modified stores are materialised there; the rest are kept as absolute links. Ignored when saving in-place.

Returns:

The Path of the written store - the in-place root when destination is None, otherwise the resolved destination path. Useful for chaining:

pool2 = TrajectoryPool(store=pool.save("my_trajectories"))

Raises:
  • RuntimeError – If saving in-place without overwrite=True.

  • FileExistsError – If destination already exists and overwrite is False.

Note

This method mutates self: after the call, the pool is redirected to destination (its store, masks and virtual context are all reset to the written state). To keep the original instance unchanged while creating an independent copy elsewhere, use copy() first:

pool2 = TrajectoryPool(store=pool.copy().save("other_path"))
# pool is still pointing to its original store

See also

copy()

Warning

Saving in-place with dirty sequence pools will overwrite those stores at their current location (stores/<alias>/ within the trajectory root). If the stores are shared with other pool instances those instances will also reflect the changes.

Note

With deep=True all links are relative - suitable for archiving or transfer. With deep=False, absolute links to unchanged stores are not portable across machines.

property sequence_pools: MappingProxyType[source]#

Visible SequencePool instances, keyed by alias.

Returns a read-only mapping filtered by the current alias mask. Direct item assignment (e.g. tpool.sequence_pools[alias] = ) raises TypeError - use subset() or drop_sequence_pools() to change the visible pools.

set_t0(*, position: int | None = None, direct: datetime | date | int | float | None | dict[Any, datetime | date | int | float | None] = None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True, on: str | None = None) TrajectoryPool[source]#

Configure the T0 strategy for this trajectory pool.

Builds a T0Setter via the registry and delegates to setter.compute_from_trajectory(self, on=on). The setter stores the resulting [id_col, _T0_] DataFrame; per-alias nearest ranks are computed lazily in _get_traj_t0_df().

Parameters:
  • position – Row index (0-based; negative indexing supported).

  • direct – Scalar value or {traj_id: value} dict.

  • feature – Trajectory-level static feature column name.

  • query – Polars boolean expression evaluated on the reference sub-pool’s columns.

  • anchor"start" / "end" / "middle" for interval/state pools.

  • use_first – For the query strategy only.

  • on – Alias of the sub-pool used to compute T0. Required for position and query strategies. Ignored (with warning) for direct and feature.

Returns:

self for chaining.

Raises:
  • TypeError – If on is missing for position/query.

  • KeyError – If on refers to an alias not visible in this pool.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#

Return trajectory-level static data for visible trajectories.

Parameters:
  • features – Static feature name(s) to include. None -> all visible static features.

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row-per-trajectory DataFrame with columns [id, feature...]. None when no static features are exposed by this pool view.

To restrict to a subset of IDs, use pool.subset(ids).static_data().

subset(ids, *, inplace: bool = False) TrajectoryPool[source]#

Return a view restricted to the given trajectory IDs.

All IDs must be present in the current unique_ids (i.e. they must pass the existing mask, if any). The new view inherits the full pool state (casts, virtual features, alias mask).

Parameters:
  • ids – Trajectory ID(s) to keep. A single value is accepted and treated as a one-element list.

  • inplace – If True, modify this pool in-place rather than returning a new instance.

Returns:

A TrajectoryPool restricted to ids (or self when inplace=True).

Raises:

KeyError – If any ID is not present in unique_ids.

t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Return the T0 table for all visible trajectories.

Columns: [id_col, _T0_, <alias1>_T0_NEAREST_RANK_, ...]. Each alias gets its own nearest-rank column because the floor lookup depends on the alias-specific temporal index.

Parameters:
  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One row per visible trajectory ID.

to_tensor(features: dict[str, list[str] | str], bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') tuple[ndarray, list, list[str]][source]#

Project all aliases onto a single 3-D tensor with prefixed labels.

The K axis stacks features from every alias, ordered by the iteration order of features then by feature order within each alias. The returned feature_names list mirrors that ordering exactly.

For a long-format dataframe variant (joins, plotting, exploration), see binned_data().

Parameters:
Returns:

  • arr has shape (N, M, K) = (len(unique_ids), n_bins, len(feature_names)).

  • ids is the trajectory ID sequence matching the N-axis order (identical to unique_ids).

  • feature_names lists the K-axis labels in column order, prefixed "{alias}_{feat}".

Return type:

A 3-tuple (arr, ids, feature_names) where

Examples:

arr, ids, names = tpool.to_tensor(
    {"drugs": "dose", "labs": ["hb", "wbc"]}, "1d"
)
# names == ["drugs_dose", "labs_hb", "labs_wbc"]
# arr.shape == (N, M, 3)
train_test_split(*, test_size: float | int | None = None, train_size: float | int | None = None, random_state: int | None = None, shuffle: bool = True) tuple[TrajectoryPool, TrajectoryPool][source]#

Split the pool into train and test subsets.

Mirrors the interface of sklearn.model_selection.train_test_split().

Parameters:
  • test_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the test subset. Defaults to 0.25 when both test_size and train_size are None.

  • train_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.

  • random_state – Seed for the random number generator. Pass an integer for reproducibility.

  • shuffle – Whether to shuffle IDs before splitting. When False, the first IDs go to train and the last to test.

Returns:

(train_pool, test_pool) - two new non-overlapping pool views.

Raises:

ValueError – If the pool is empty, sizes are non-positive, or n_train + n_test exceeds the pool size.

property unique_ids: list[source]#

Visible trajectory IDs as a plain Python list.

Respects _id_mask.

Warning

list erases rich Polars dtypes. Prefer _id_lf when the result feeds a Polars join.

which(criterion: Criterion, *, verbose: bool = True) set[source]#

Return the set of IDs in this pool satisfying criterion.

Parameters:
  • criterion – A Criterion instance.

  • verbose – If True, print a one-line report.

Returns:

Set of matching IDs.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with Trajectory level.

tanat.trajectory.settings module#

Settings for TrajectoryPool views.

Mirrors the SequenceSettings pattern: the Store holds all columns on disk; the view exposes only the features listed here.

class tanat.trajectory.settings.TrajectorySettings(*, id_column: str = '_traj_id', static_features: list[str] = <factory>)[source]#

Bases: object

View-layer settings for a TrajectoryPool.

id_column[source]#

Name of the trajectory-ID column (always _traj_id).

Type:

str

static_features[source]#

Feature names visible in static_data(). None means no static features exposed (the default until add_static_features is called).

Type:

list[str]

__init__(*args: Any, **kwargs: Any) None[source]#
available_features() list[str][source]#

Returns all visible static feature names (may be empty).

get_column_rename_map() dict[str, str][source]#

Returns the mapping from store-internal column names to user-facing names.

Currently only the trajectory-ID column is renamed: _traj_idid_column.

id_column: str = '_traj_id'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

classmethod normalize_static_features(v)[source]#

Normalize to a sorted, deduplicated list.

static_features: list[str][source]#
validate_features(features: list[str] | str, *, on_missing: str = 'raise') list[str][source]#

Validates explicit feature names against the current settings.

Parameters:
  • features – Feature name(s) to validate.

  • on_missing"raise" (default), "warn" or "ignore".

Returns:

List of validated feature names.

Raises:

KeyError – If on_missing="raise" and a feature is missing.

tanat.trajectory.shortcuts module#

Quick-build helper for trajectory pools.

tanat.trajectory.shortcuts.build_trajectories(pools: dict[str, SequencePool], *, static_data: pd.DataFrame | pl.DataFrame | pl.LazyFrame | None = None, id_column: str | None = None, store_name: str | None = None) TrajectoryPool[source]#

Build a TrajectoryPool from a dict of pre-built sequence pools.

Parameters:
  • pools – Mapping of {alias: SequencePool}. Each alias becomes the key used to access the sub-sequence inside a trajectory (e.g. traj["admissions"]).

  • static_data – Optional DataFrame or LazyFrame with per-trajectory static features. When provided, id_column must also be set.

  • id_column – Name of the id column in static_data. Required when static_data is not None. Ignored otherwise.

  • store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_trajectory_<hex8>).

Returns:

A ready-to-use TrajectoryPool.

Raises:

ValueError – If static_data is provided without id_column, or if id_column is absent from static_data.

Examples:

tpool = build_trajectories(
    pools={"admissions": adm_pool, "procedures": proc_pool},
)
tpool[tpool.unique_ids[0]]["admissions"].temporal_data(fmt="polars")

tanat.trajectory.trajectory module#

Single trajectory: sequences sharing the same ID across stores.

class tanat.trajectory.trajectory.Trajectory(id_value, store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None)[source]#

Bases: TrajectoryViewMixin, CachableSettings

Access every Sequence that shares a given ID across the linked stores.

Usage:

traj["medical"]         → Sequence
"medical" in traj       → bool
for alias in traj: ...  → iterates aliases
for alias, seq in traj.items(): ...
SETTINGS_CLASS[source]#

alias of TrajectorySettings

__init__(id_value, store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None) None[source]#

Create a trajectory view for id_value.

Parameters:
  • id_value – Trajectory identifier.

  • store – Store path, name, or TrajectoryStore instance.

  • id_column – User-facing name for the trajectory ID column.

  • static_features – Static feature names to expose. None → all available. [] → none.

describe(separator: str = '_', fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Compute summary statistics for this single trajectory.

Calls seq.describe() for each visible sequence and prefixes metric columns with {alias}{separator}. The result is a single-row DataFrame.

Parameters:
  • separator – Separator between alias and metric name (default _).

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Single-row DataFrame with columns [n_sequences, {alias}{sep}length, …].

Examples:

traj = traj_pool[42]
traj.describe()
traj.describe(separator=".", fmt="polars")
classmethod from_parent(id_value, store: TrajectoryStore, settings: TrajectorySettings, *, parent_pool: TrajectoryPool, alias_mask: set[str] | None = None) Trajectory[source]#

Create a pool-managed trajectory. Not part of the public API.

Bypasses store resolution, feature resolution, and cast probe: all already performed by the pool. Pool context (casts, virtual ID, sequence pools, metadata) is read lazily from parent_pool via the corresponding properties.

Parameters:
  • id_value – Trajectory identifier.

  • store – Already-resolved TrajectoryStore.

  • settings – Fully-resolved TrajectorySettings.

  • parent_pool – The owning TrajectoryPool.

  • alias_mask – Override the pool-level alias mask (used by TrajectoryPool.get_trajectories() with explicit aliases).

Returns:

A new Trajectory instance bound to parent_pool.

property id_value[source]#

The trajectory identifier, in the cast type if a cast is active.

items() Iterator[tuple[str, Sequence]][source]#

Yield (alias, sequence) pairs - mirrors dict.items().

keys() Iterator[str][source]#

Yield visible store aliases - mirrors dict.keys().

match(criterion: Criterion) bool[source]#

Return True if this trajectory satisfies criterion.

Parameters:

criterion – A Criterion instance.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with trajectories.

property sequences: dict[source]#

All visible Sequence instances for this trajectory, keyed by store alias.

Cached per trajectory state: built once and reused across calls. Invalidated automatically when underlying settings change.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#

Return trajectory-level static data for this trajectory only.

Parameters:
  • features – Static feature name(s) to include. None -> all visible static features.

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row DataFrame with [id, feature...] or None when no static features are available in the current view.

property t0: datetime | date | int | float | None[source]#

T0 value for this trajectory.

None when T0 could not be determined (e.g. no matching row).

property t0_nearest_rank: dict[str, int | None][source]#

Per-alias nearest rank at or before T0.

Returns a dict keyed by visible alias name, e.g. {"medical": 2, "lab": 5}. Value is None when T0 is None or when no row satisfies start <= T0 in that alias. An empty dict is returned for standalone (non-pool) trajectories.

values() Iterator[Sequence][source]#

Yield Sequence objects for each visible alias - mirrors dict.values().

tanat.trajectory.view_mixin module#

TrajectoryViewMixin: shared view-layer logic for TrajectoryPool and Trajectory.

Both TrajectoryPool and Trajectory are scoped views on a TrajectoryStore.

class tanat.trajectory.view_mixin.TrajectoryFrameAssembler(view: Trajectory | TrajectoryPool)[source]#

Bases: object

Assembles view-schema LazyFrames from the store for one trajectory view.

__init__(view: Trajectory | TrajectoryPool) None[source]#
ids() LazyFrame[source]#

Return visible IDs with the view ID dtype and schema.

select(lf: LazyFrame, feature_names: list[str]) LazyFrame[source]#

Select the trajectory ID column plus feature_names.

static(features: list[str] | str | None = None) LazyFrame | None[source]#

Return static data in view schema with scopes and casts applied.

static_for_store(features: list[str] | str | None = None) LazyFrame | None[source]#

Return static data in store schema after view scopes and casts.

class tanat.trajectory.view_mixin.TrajectoryViewMixin[source]#

Bases: object

Mixin providing view-layer helpers shared by TrajectoryPool and Trajectory.

apply(exprs: Expr | list[Expr], *, lazy: bool = False, to_pandas: bool = False) LazyFrame | DataFrame | DataFrame[source]#

Evaluates Polars expressions against trajectory-level static features.

This is a read-only computation: the result is returned, not stored. Use add_static_features() to persist the result.

Each expression must produce a named column (.alias()).

Parameters:
  • exprs – One or more Polars expressions producing new columns.

  • lazy – If True, returns a pl.LazyFrame (no collect).

  • to_pandas – If True, returns a pandas.DataFrame.

Returns:

The computed columns as a DataFrame (or LazyFrame).

Raises:

ValueError – If no static features are available.

Examples:

result = pool.apply(
    (pl.col("score") * pl.col("weight")).alias("weighted_score"),
)
pool.add_static_features(result)
property metadata: TrajectoryMetadata[source]#

Returns trajectory-level metadata, fully reflecting this view’s cast recipes, masks, and feature selection.

When created from a parent TrajectoryPool, the pool’s metadata is returned directly or scoped by the view’s settings (when built with a feature subset).

For a standalone view, the traj_id dtype is derived from the cast recipe (or the store schema when no cast is active) - no full plan traversal required.

Automatically cached via CachableSettings: the cache is invalidated whenever settings change.

Module contents#

Trajectory module.

class tanat.trajectory.Trajectory(id_value, store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None)[source]#

Bases: TrajectoryViewMixin, CachableSettings

Access every Sequence that shares a given ID across the linked stores.

Usage:

traj["medical"]         → Sequence
"medical" in traj       → bool
for alias in traj: ...  → iterates aliases
for alias, seq in traj.items(): ...
SETTINGS_CLASS[source]#

alias of TrajectorySettings

__init__(id_value, store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None) None[source]#

Create a trajectory view for id_value.

Parameters:
  • id_value – Trajectory identifier.

  • store – Store path, name, or TrajectoryStore instance.

  • id_column – User-facing name for the trajectory ID column.

  • static_features – Static feature names to expose. None → all available. [] → none.

describe(separator: str = '_', fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Compute summary statistics for this single trajectory.

Calls seq.describe() for each visible sequence and prefixes metric columns with {alias}{separator}. The result is a single-row DataFrame.

Parameters:
  • separator – Separator between alias and metric name (default _).

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

Single-row DataFrame with columns [n_sequences, {alias}{sep}length, …].

Examples:

traj = traj_pool[42]
traj.describe()
traj.describe(separator=".", fmt="polars")
classmethod from_parent(id_value, store: TrajectoryStore, settings: TrajectorySettings, *, parent_pool: TrajectoryPool, alias_mask: set[str] | None = None) Trajectory[source]#

Create a pool-managed trajectory. Not part of the public API.

Bypasses store resolution, feature resolution, and cast probe: all already performed by the pool. Pool context (casts, virtual ID, sequence pools, metadata) is read lazily from parent_pool via the corresponding properties.

Parameters:
Returns:

A new Trajectory instance bound to parent_pool.

property id_value[source]#

The trajectory identifier, in the cast type if a cast is active.

items() Iterator[tuple[str, Sequence]][source]#

Yield (alias, sequence) pairs - mirrors dict.items().

keys() Iterator[str][source]#

Yield visible store aliases - mirrors dict.keys().

match(criterion: Criterion) bool[source]#

Return True if this trajectory satisfies criterion.

Parameters:

criterion – A Criterion instance.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with trajectories.

property sequences: dict[source]#

All visible Sequence instances for this trajectory, keyed by store alias.

Cached per trajectory state: built once and reused across calls. Invalidated automatically when underlying settings change.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#

Return trajectory-level static data for this trajectory only.

Parameters:
  • features – Static feature name(s) to include. None -> all visible static features.

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row DataFrame with [id, feature...] or None when no static features are available in the current view.

property t0: datetime | date | int | float | None[source]#

T0 value for this trajectory.

None when T0 could not be determined (e.g. no matching row).

property t0_nearest_rank: dict[str, int | None][source]#

Per-alias nearest rank at or before T0.

Returns a dict keyed by visible alias name, e.g. {"medical": 2, "lab": 5}. Value is None when T0 is None or when no row satisfies start <= T0 in that alias. An empty dict is returned for standalone (non-pool) trajectories.

values() Iterator[Sequence][source]#

Yield Sequence objects for each visible alias - mirrors dict.values().

class tanat.trajectory.TrajectoryPool(store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None, cast_recipe: TrajectoryCastRecipe | dict | None = None)[source]#

Bases: TrajectoryViewMixin, CachableSettings

Aggregates SequencePool views into trajectories.

Accepts a store name, path, or TrajectoryStore instance, following the same convention as SequencePool.

Usage:

store_path = (
    TrajectoryPool.builder()
    .add("medical", medical_pool)
    .add("lab", lab_pool)
    .build("./my_trajectories")
)
pool = TrajectoryPool(store="./my_trajectories")
SETTINGS_CLASS[source]#

alias of TrajectorySettings

__init__(store: str | Path | TrajectoryStore, *, id_column: str = 'id', static_features: list[str] | None = None, cast_recipe: TrajectoryCastRecipe | dict | None = None) None[source]#

Create a trajectory pool backed by store.

Parameters:
  • store – Store path, name, or TrajectoryStore instance.

  • id_column – User-facing name for the trajectory ID column.

  • static_features – Static feature names to expose. None → all available. [] → none.

  • cast_recipe – Optional cast recipe (or dict) applied at read time. Only id and static fields are meaningful at this level. Normalised via TrajectoryCastRecipe.coerce() and probed eagerly.

Raises:

TypeError – If cast_recipe is not a TrajectoryCastRecipe, dict, or None.

add_static_features(df: DataFrame | LazyFrame | DataFrame, *, id_column: str | None = None, overwrite: bool = False) None[source]#

Add static features to the trajectory pool via an ID-keyed join.

The input DataFrame must include the trajectory ID column (either under settings.id_column or under the name given by id_column). A LEFT JOIN against the full trajectory index is performed internally, so partial DataFrames (covering only a subset of trajectory IDs) are valid: absent IDs receive null in the virtual context.

Because alignment is handled by the join rather than by row position, this method works on views with pending changes (cast, virtual features, masks). Only the IDs visible in the view are exposed when reading back with static_data().

Parameters:
  • df – DataFrame containing the ID column plus one or more feature columns. Can be pandas, Polars eager, or Polars lazy.

  • id_column – Name of the ID column in df. Defaults to settings.id_column when None. Pass an explicit name when the join key in df differs from the pool’s public ID name (e.g. id_column="traj_id").

  • overwrite – If True, replaces features that already exist in the virtual context.

Raises:

KeyError – If the resolved ID column is not found in df.

binned_data(features: dict[str, list[str] | str], bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True, bin_col: str = '__bin__') DataFrame | DataFrame[source]#

Project all aliases onto a single shared binned table (long format).

All sub-pools share one global (t_min, t_max, bin_size) axis, derived from the union of their temporal spans. Output columns are prefixed "{alias}_{feature}" to avoid collisions.

For an ML-ready 3-D tensor with feature labels and ID order, see to_tensor().

Parameters:
  • features – Mapping {alias: feature(s)}. Each alias must exist in this pool. str values are auto-promoted to [str].

  • bin_size – Bin width on the shared axis.

  • max_bins – Capped by MAX_BINS_LIMIT when None. An explicit value bypasses the cap — the caller opts in knowingly.

  • fill_value – Applied once, after the cross-join over trajectory IDs. Per-alias fills are not applied.

  • overlap_rule – In-bin aggregation, applied per alias.

  • ohe – One-hot encode per alias. Output names reflect post-OHE columns.

  • fmt"pandas" or "polars".

  • use_arrow – Arrow-backed pandas conversion.

  • bin_col – Output bin index column name.

Returns:

DataFrame with columns [traj_id, bin_col, "{alias1}_{feat1}", "{alias1}_{feat2}", ..., "{alias2}_{feat1}", ...].

classmethod builder() TrajectoryStoreBuilder[source]#

Return a fluent builder for constructing a trajectory store.

cast_id(dtype: DataType) None[source]#

Casts the trajectory ID column to a new type.

The cast is propagated automatically to all linked sequence pools (accessible via sequence_pools) so that entity data and static data surface IDs in the same type at every level.

Parameters:

dtype – Target Polars DataType (e.g. pl.String, pl.UInt32).

Raises:

TypeError – If the cast is incompatible with the stored ID values.

cast_static_features(schema: dict[str, DataType | type]) None[source]#

Casts trajectory-level static-feature columns to new types.

Only static features can be cast at trajectory level: entity features live inside the linked sequence stores and must be cast there.

Parameters:

schema – Dictionary mapping feature names to target Polars DataTypes (e.g. {"group": pl.Categorical}).

Raises:
  • TypeError – If schema is not a dict.

  • KeyError – If a feature name does not exist in the current view.

cast_to_datetime(unit: str = 'us', time_zone: str | None = None) None[source]#

Casts time columns to Datetime across all linked sequence pools.

All sequence stores are guaranteed to share the same temporal schema (enforced at build time), so a single probe against the trajectory store is sufficient - exactly like cast_id(). The cast is stored in the trajectory-level recipe and re-propagated to every pool on next sequence_pools access.

Parameters:
  • unit – Datetime resolution ("ms", "us", "ns"). Default is "us" (microsecond).

  • time_zone – Optional timezone string (e.g. "UTC", "Europe/Paris").

Raises:
  • ValueError – If unit is not one of the accepted values.

  • TypeError – If the cast is incompatible with the temporal data.

cast_to_timestep(dtype: DataType = Int64) None[source]#

Casts time columns to numeric-based timesteps across all linked sequence pools.

All sequence stores are guaranteed to share the same temporal schema (enforced at build time), so a single probe against the trajectory store is sufficient - exactly like cast_id(). The cast is stored in the trajectory-level recipe and re-propagated to every pool on next sequence_pools access.

Parameters:

dtype – Target numeric type (e.g. pl.UInt32, pl.Int64, pl.Float64). Default is pl.Int64.

Raises:

TypeError – If dtype is not a numeric type, or if the temporal data is already in Datetime format.

copy() TrajectoryPool[source]#

Return a shallow copy sharing the same store, with all view state preserved.

The new pool references the same TrajectoryStore and the same virtual context (_virtual_id) so virtual features are immediately visible.

Returns:

A new TrajectoryPool with identical settings, casts, masks and virtual context.

Note

Chaining with save() produces a fully independent pool at a new path without mutating the original instance:

pool2 = TrajectoryPool(store=pool.copy().save("other_path"))

Use this when you need both the original and a snapshot at a new destination. pool.save("other_path") alone would redirect pool itself to "other_path".

See also

save()

describe(by_id: bool = True, add_to_static: bool = False, separator: str = '_', fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Compute summary statistics across all sequences and all trajectories.

Parameters:
  • by_id – If True (default), return one row per trajectory. If False, return cross-trajectory pandas .describe().

  • add_to_static – If True, persist the per-ID result via add_static_features(). Ignored (with a warning) when by_id=False.

  • separator – Separator between alias and metric name (default _).

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

DataFrame with columns [id, n_sequences, {alias}{sep}length, …].

Examples:

traj_pool.describe()
traj_pool.describe(separator=".")
traj_pool.describe(by_id=False)
traj_pool.describe(add_to_static=True)
drop_sequence_pools(*aliases: str) None[source]#

Hides one or more store aliases from this view.

The underlying TrajectoryStore is not modified. Only the pool’s visible aliases (and derived properties like trajectory_index and unique_ids) are affected.

Parameters:

aliases – One or more alias names to hide.

Raises:
  • RuntimeError – If the pool is not built yet.

  • KeyError – If an alias does not exist in the store.

drop_static_features(features: list[str] | str, *, permanently: bool = False) None[source]#

Removes static features from the view (and optionally from disk).

By default this is a soft drop: features are removed from the settings so they no longer appear in static_data(), but the underlying data is left untouched.

With permanently=True the columns are also deleted from disk / virtual context (irreversible).

Parameters:
  • features – Feature name(s) to drop.

  • permanently – If True, also remove from disk/virtual.

extend(other: TrajectoryPool | Trajectory, destination: str | Path | None = None, *, on_duplicate: Literal['raise', 'skip'] = 'raise', overwrite: bool = False) TrajectoryPool[source]#

Merge other into this trajectory pool and write the result to disk.

Mirrors the semantics of save().

Same-store fast path - if both trajectory pools share _store.root_path and neither carries virtual content (_virtual_id is None on both sides), no I/O is performed. A new pool backed by the same store with the union of ID masks is returned immediately. If destination is provided the merged pool is materialised via save(); otherwise it is returned as an in-memory view with zero I/O.

Different stores (or virtual content present) - destination is required. For each alias in this pool, extend() is called on the corresponding sub-pools; the results are assembled into a new trajectory store via the builder. Pass destination=self._store.root_path with overwrite=True to rewrite in-place.

Parameters:
  • other – Trajectory pool or single Trajectory to merge.

  • destinationNone → in-memory view (same-store fast path only; no I/O); str / Path → materialise the merged data to disk. destination is required when merging from different stores.

  • on_duplicate

    Behaviour when other contains a trajectory ID already present in this pool:

    • "raise" (default): raise ValueError.

    • "skip": silently ignore duplicates.

  • overwrite – Allows overwriting an existing destination when it already exists on disk.

Returns:

Always a new TrajectoryPool - never self.

Raises:
  • TypeError – If other is not a TrajectoryPool or Trajectory.

  • TypeError – If a sub-pool has an incompatible ID dtype or temporal schema.

  • ValueError – If a sub-pool in other is missing features present in the corresponding sub-pool of self.

  • ValueError – If on_duplicate="raise" and duplicate IDs are found.

  • ValueError – If destination=None and stores differ.

  • FileExistsError – If destination exists and overwrite=False.

Note

Aliases present in other but absent from self are silently ignored (logged at WARNING). Aliases present in self but absent from other are carried over unchanged.

See also

save(), extend()

filter_entities(criterion: Criterion, *, alias: str, inplace: bool = False, verbose: bool = True) TrajectoryPool[source]#

Return a new TrajectoryPool view with entities filtered by criterion.

Parameters:
  • criterion – A Criterion instance.

  • alias – Sequence alias to apply the criterion on.

  • inplace – If True, modify this pool’s in place instead of returning a new view.

  • verbose – If True, print a one-line report.

Returns:

A new TrajectoryPool view with the criterion applied, or self if inplace is True.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with entity filtering.

get_trajectories(static_features: list[str] | None = None, aliases: list[str] | None = None) dict[str, Trajectory][source]#

All visible Trajectory instances, keyed by ID.

Materialises every trajectory reachable through the current view (respecting _id_mask and _alias_mask). Useful for iteration-heavy workflows where the same trajectory is accessed multiple times.

Parameters:
  • static_features – Static features to expose in each Trajectory. None → use the pool-level setting. [] → no static features.

  • aliases – Sequence-store aliases to expose in each Trajectory. None → use the pool-level alias mask. Must be a subset of the pool’s visible aliases.

Raises:

KeyError – If any alias in aliases is not visible in the current pool view.

property is_dirty: bool[source]#

True if the pool (or any linked sequence pool) has unsaved state.

Trajectory-level: virtual features, ID mask, type casts, soft drops. Sub-pool level: delegates to each pool’s is_dirty.

A dirty pool needs save() with a destination to materialise all pending changes (sub-pool changes cannot be saved in-place).

items() Iterator[tuple][source]#

Yield (id, trajectory) pairs for all visible trajectories.

save(destination: str | Path | None = None, *, overwrite: bool = False, deep: bool = False) Path[source]#

Persists the current pool state to disk.

Trajectory-level (trajectory_index.arrow, static_features.arrow) is always written, with ID and static casts baked in.

Sequence pools - persisted according to their state:

  • Modified pools (virtual features or casts) are always saved to destination/stores/<alias>/, regardless of deep.

  • Unmodified pools: copied when deep=True; referenced by absolute path when deep=False.

For in-place saves the linked sequence stores are never touched (they may be shared with other pool instances).

Without destination the trajectory-level files are rewritten in-place. With destination a copy is created; the original is untouched.

Parameters:
  • destination – Where to save. Can be: - None → in-place, - a workspace store name (no / or \), - or a filesystem Path / path string. Passing a path that resolves to the current store root is equivalent to None (treated as in-place).

  • overwrite – Required when saving in-place (trajectory-level files will be overwritten). Also allows overwriting an existing destination. Each dirty sub-pool is saved in-place at its current location with overwrite=True automatically.

  • deep – If True, all sequence stores (including unmodified ones) are copied to destination/stores/<alias>/. When False (default), only modified stores are materialised there; the rest are kept as absolute links. Ignored when saving in-place.

Returns:

The Path of the written store - the in-place root when destination is None, otherwise the resolved destination path. Useful for chaining:

pool2 = TrajectoryPool(store=pool.save("my_trajectories"))

Raises:
  • RuntimeError – If saving in-place without overwrite=True.

  • FileExistsError – If destination already exists and overwrite is False.

Note

This method mutates self: after the call, the pool is redirected to destination (its store, masks and virtual context are all reset to the written state). To keep the original instance unchanged while creating an independent copy elsewhere, use copy() first:

pool2 = TrajectoryPool(store=pool.copy().save("other_path"))
# pool is still pointing to its original store

See also

copy()

Warning

Saving in-place with dirty sequence pools will overwrite those stores at their current location (stores/<alias>/ within the trajectory root). If the stores are shared with other pool instances those instances will also reflect the changes.

Note

With deep=True all links are relative - suitable for archiving or transfer. With deep=False, absolute links to unchanged stores are not portable across machines.

property sequence_pools: MappingProxyType[source]#

Visible SequencePool instances, keyed by alias.

Returns a read-only mapping filtered by the current alias mask. Direct item assignment (e.g. tpool.sequence_pools[alias] = ) raises TypeError - use subset() or drop_sequence_pools() to change the visible pools.

set_t0(*, position: int | None = None, direct: datetime | date | int | float | None | dict[Any, datetime | date | int | float | None] = None, feature: str | None = None, query: Expr | None = None, anchor: Literal['start', 'end', 'middle'] | None = None, use_first: bool = True, on: str | None = None) TrajectoryPool[source]#

Configure the T0 strategy for this trajectory pool.

Builds a T0Setter via the registry and delegates to setter.compute_from_trajectory(self, on=on). The setter stores the resulting [id_col, _T0_] DataFrame; per-alias nearest ranks are computed lazily in _get_traj_t0_df().

Parameters:
  • position – Row index (0-based; negative indexing supported).

  • direct – Scalar value or {traj_id: value} dict.

  • feature – Trajectory-level static feature column name.

  • query – Polars boolean expression evaluated on the reference sub-pool’s columns.

  • anchor"start" / "end" / "middle" for interval/state pools.

  • use_first – For the query strategy only.

  • on – Alias of the sub-pool used to compute T0. Required for position and query strategies. Ignored (with warning) for direct and feature.

Returns:

self for chaining.

Raises:
  • TypeError – If on is missing for position/query.

  • KeyError – If on refers to an alias not visible in this pool.

static_data(features: list[str] | str | None = None, fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame | None[source]#

Return trajectory-level static data for visible trajectories.

Parameters:
  • features – Static feature name(s) to include. None -> all visible static features.

  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One-row-per-trajectory DataFrame with columns [id, feature...]. None when no static features are exposed by this pool view.

To restrict to a subset of IDs, use pool.subset(ids).static_data().

subset(ids, *, inplace: bool = False) TrajectoryPool[source]#

Return a view restricted to the given trajectory IDs.

All IDs must be present in the current unique_ids (i.e. they must pass the existing mask, if any). The new view inherits the full pool state (casts, virtual features, alias mask).

Parameters:
  • ids – Trajectory ID(s) to keep. A single value is accepted and treated as a one-element list.

  • inplace – If True, modify this pool in-place rather than returning a new instance.

Returns:

A TrajectoryPool restricted to ids (or self when inplace=True).

Raises:

KeyError – If any ID is not present in unique_ids.

t0_data(fmt: Literal['pandas', 'polars'] = 'pandas', use_arrow: bool = True) DataFrame | DataFrame[source]#

Return the T0 table for all visible trajectories.

Columns: [id_col, _T0_, <alias1>_T0_NEAREST_RANK_, ...]. Each alias gets its own nearest-rank column because the floor lookup depends on the alias-specific temporal index.

Parameters:
  • fmt"pandas" (default) or "polars".

  • use_arrow – Use Arrow extension arrays for polars -> pandas conversion.

Returns:

One row per visible trajectory ID.

to_tensor(features: dict[str, list[str] | str], bin_size: str | int | float, max_bins: int | None = None, fill_value: Any = None, overlap_rule: str = 'first', ohe: bool = False, bin_col: str = '__bin__') tuple[ndarray, list, list[str]][source]#

Project all aliases onto a single 3-D tensor with prefixed labels.

The K axis stacks features from every alias, ordered by the iteration order of features then by feature order within each alias. The returned feature_names list mirrors that ordering exactly.

For a long-format dataframe variant (joins, plotting, exploration), see binned_data().

Parameters:
Returns:

  • arr has shape (N, M, K) = (len(unique_ids), n_bins, len(feature_names)).

  • ids is the trajectory ID sequence matching the N-axis order (identical to unique_ids).

  • feature_names lists the K-axis labels in column order, prefixed "{alias}_{feat}".

Return type:

A 3-tuple (arr, ids, feature_names) where

Examples:

arr, ids, names = tpool.to_tensor(
    {"drugs": "dose", "labs": ["hb", "wbc"]}, "1d"
)
# names == ["drugs_dose", "labs_hb", "labs_wbc"]
# arr.shape == (N, M, 3)
train_test_split(*, test_size: float | int | None = None, train_size: float | int | None = None, random_state: int | None = None, shuffle: bool = True) tuple[TrajectoryPool, TrajectoryPool][source]#

Split the pool into train and test subsets.

Mirrors the interface of sklearn.model_selection.train_test_split().

Parameters:
  • test_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the test subset. Defaults to 0.25 when both test_size and train_size are None.

  • train_size – Proportion (float in (0, 1)) or absolute count (int) of samples for the train subset. Defaults to the complement of test_size.

  • random_state – Seed for the random number generator. Pass an integer for reproducibility.

  • shuffle – Whether to shuffle IDs before splitting. When False, the first IDs go to train and the last to test.

Returns:

(train_pool, test_pool) - two new non-overlapping pool views.

Raises:

ValueError – If the pool is empty, sizes are non-positive, or n_train + n_test exceeds the pool size.

property unique_ids: list[source]#

Visible trajectory IDs as a plain Python list.

Respects _id_mask.

Warning

list erases rich Polars dtypes. Prefer _id_lf when the result feeds a Polars join.

which(criterion: Criterion, *, verbose: bool = True) set[source]#

Return the set of IDs in this pool satisfying criterion.

Parameters:
  • criterion – A Criterion instance.

  • verbose – If True, print a one-line report.

Returns:

Set of matching IDs.

Raises:
  • TypeError – If criterion is not a Criterion object.

  • CriterionLevelError – If the criterion is incompatible with Trajectory level.

class tanat.trajectory.TrajectorySettings(*, id_column: str = '_traj_id', static_features: list[str] = <factory>)[source]#

Bases: object

View-layer settings for a TrajectoryPool.

id_column[source]#

Name of the trajectory-ID column (always _traj_id).

Type:

str

static_features[source]#

Feature names visible in static_data(). None means no static features exposed (the default until add_static_features is called).

Type:

list[str]

__init__(*args: Any, **kwargs: Any) None[source]#
available_features() list[str][source]#

Returns all visible static feature names (may be empty).

get_column_rename_map() dict[str, str][source]#

Returns the mapping from store-internal column names to user-facing names.

Currently only the trajectory-ID column is renamed: _traj_idid_column.

id_column: str = '_traj_id'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

classmethod normalize_static_features(v)[source]#

Normalize to a sorted, deduplicated list.

static_features: list[str][source]#
validate_features(features: list[str] | str, *, on_missing: str = 'raise') list[str][source]#

Validates explicit feature names against the current settings.

Parameters:
  • features – Feature name(s) to validate.

  • on_missing"raise" (default), "warn" or "ignore".

Returns:

List of validated feature names.

Raises:

KeyError – If on_missing="raise" and a feature is missing.

tanat.trajectory.build_trajectories(pools: dict[str, SequencePool], *, static_data: pd.DataFrame | pl.DataFrame | pl.LazyFrame | None = None, id_column: str | None = None, store_name: str | None = None) TrajectoryPool[source]#

Build a TrajectoryPool from a dict of pre-built sequence pools.

Parameters:
  • pools – Mapping of {alias: SequencePool}. Each alias becomes the key used to access the sub-sequence inside a trajectory (e.g. traj["admissions"]).

  • static_data – Optional DataFrame or LazyFrame with per-trajectory static features. When provided, id_column must also be set.

  • id_column – Name of the id column in static_data. Required when static_data is not None. Ignored otherwise.

  • store_name – Name for the on-disk store. When None a unique name is generated automatically (_quick_trajectory_<hex8>).

Returns:

A ready-to-use TrajectoryPool.

Raises:

ValueError – If static_data is provided without id_column, or if id_column is absent from static_data.

Examples:

tpool = build_trajectories(
    pools={"admissions": adm_pool, "procedures": proc_pool},
)
tpool[tpool.unique_ids[0]]["admissions"].temporal_data(fmt="polars")