Data Manipulation#
Reference for main operations available on sequence pools, trajectory pools, individual sequences, trajectories, and entities.
Iteration#
All containers implement the standard Python iteration protocol.
Syntax → yields |
SP |
S |
TP |
T |
|---|---|---|---|---|
|
Sequence |
Entity |
Trajectory |
alias (str) |
|
✗ |
✗ |
(id, Trajectory) |
(alias, Sequence) |
SP: SequencePool · TP: TrajectoryPool · S: Sequence · T: Trajectory
for traj in tpool: # TP → Trajectory
print(traj.id_value)
for alias, seq in traj.items(): # T → (alias, Sequence)
print(alias, len(seq))
for seq in pool: # SP → Sequence
print(seq.id_value, len(seq))
for entity in seq: # S → Entity
print(entity.temporal_extent, entity.data())
Subset#
Restrict a pool to a subset of IDs without copying data.
view = pool.subset(ids=["id_001", "id_042", "id_099"])
print(len(view)) # 3
The returned object is a view: it shares the same underlying store. Changes to the view (casts, feature drops…) are visible through the view only.
Feature Engineering#
All methods below operate lazily: transformations are applied on the fly at
materialisation time and do not rewrite the store. Call pool.save to persist them.
Add and remove columns#
Attach new columns to the view or hide existing ones; the underlying store is never rewritten.
Method |
Scope |
Description |
|---|---|---|
|
SP |
Append new entity-level columns. |
|
SP, TP |
Append new static columns joined by ID. Works on filtered views. Pass
|
|
SP |
Hide entity (default) or static features from the view.
Pass |
|
TP |
Hide static features from a TrajectoryPool view.
Pass |
SP: SequencePool · TP: TrajectoryPool
Type casting#
All casts are lazy and scoped to the current view. Call pool.save to persist.
Method |
Scope |
Description |
|---|---|---|
|
SP |
Re-type entity (default) or static features.
|
|
TP |
Re-type static features. Entity features must be cast on each linked sequence pool directly. |
|
SP, TP |
Convert the time index to |
|
SP, TP |
Convert the time index to an integer type (e.g. |
SP: SequencePool · TP: TrajectoryPool
import polars as pl
# SequencePool: cast entity feature
pool.cast_features({"status": pl.Categorical})
# SequencePool: cast static feature
pool.cast_features({"age": pl.UInt8}, is_static=True)
# TrajectoryPool: cast static feature (different method name!)
tpool.cast_static_features({"group": pl.Categorical})
# Both: convert time index
pool.cast_to_datetime(unit="us", time_zone="UTC")
pool.cast_to_timestep(pl.Int32)
# Drop (SP only with is_static; TP: drop_static_features)
pool.drop_features(["flag_valid"], is_static=False)
tpool.drop_static_features(["debug_col"])
Transformation#
All methods in this section return a new DataFrame and do not modify the pool.
apply: evaluate an expression#
Evaluate a Polars expression against the pool’s temporal or static data.
Available on SP and TP. Pass is_static=True to target static features.
At pool level, by_id=True groups the evaluation per ID, making it ideal
for deriving per-sequence aggregates. A natural follow-up is to pipe the result
directly into add_static_features or add_entity_features:
# Per-sequence mean → attach as a static feature
means = pool.apply(pl.col("value").mean().alias("value_mean"), by_id=True)
pool.add_static_features(means)
# Without by_id: expression runs over the full temporal data
flags = pool.apply(pl.col("value") > 0)
# On static data
result = pool.apply(pl.col("age") > 65, is_static=True)
# Works on TrajectoryPool too
stats = tpool.apply(pl.col("score").max().alias("score_max"), by_id=True)
SP: SequencePool · TP: TrajectoryPool
to_dummies: one-hot encode#
One-hot encode one or more Categorical features into binary indicator
columns. Pass is_static=True to target static features instead of
entity features.
# Entity features (default)
dummies = pool.to_dummies(["status", "category"])
# Static features
dummies = pool.to_dummies(["site"], is_static=True)
binned_data / to_tensor: regular time bins#
Project temporal features onto a regular time grid.
binned_data()returns a long-format DataFrame (pandas or polars). Useful for exploration, joins, and plotting.to_tensor()returns a dense(N, M, K)ndarray together with IDs and K-axis feature labels. Useful for ML pipelines.
# Long-format dataframe
df = pool.binned_data(features=["value", "score"], bin_size="1d")
# ML-ready tensor with IDs and feature names
arr, ids, feature_names = pool.to_tensor(features=["value", "score"], bin_size="1d")
Descriptive Statistics#
# One row per sequence (length, temporal span, …)
pool.describe()
# Cross-ID aggregated stats (equivalent to pandas .describe())
pool.describe(by_id=False)
# Attach stats as static features (side-effect)
pool.describe(add_to_static=True)
# Single sequence
seq = pool[pool.unique_ids[0]]
seq.describe()
# TrajectoryPool: one row per trajectory, columns prefixed by alias
tpool.describe()
Persistence#
Transformations are lazy by default. Save a snapshot to make them permanent or to share a modified pool.
# Save under a new name (returns the new store path)
saved_path = pool.save("my_pool_optimised", overwrite=True)
# Copy the pool in-memory (deep copy of settings, same store)
clone = pool.copy()
Composition#
extend#
Merge another pool into the current one. Two execution paths:
Situation |
Behaviour |
|---|---|
Both pools share the same store |
Fast path: union of ID masks, no I/O |
Different stores |
Cross-store: rebuilds a new store on disk; |
# Same-store fast path (e.g. after train_test_split)
train, test = pool.train_test_split(test_size=0.3)
merged = train.extend(test)
# Cross-store merge
pool_a.extend(pool_b, destination="merged_pool",
on_duplicate="skip", overwrite=True)
train_test_split#
Split a pool by unique IDs. The interface mirrors
sklearn.model_selection.train_test_split.
train, test = pool.train_test_split(test_size=0.2, random_state=42)
# Guarantee: zero ID overlap
assert not set(train.unique_ids) & set(test.unique_ids)
Type Conversion#
Convert a pool between the three sequence types. The conversion is always view-level: the original store is not modified.
Method |
Converts to |
|---|---|
|
|
|
|
|
event_view = interval_pool.as_event() # treat interval start as event time
Temporal Alignment#
See Zeroing & Alignment for the full T0 reference.
# Set a reference date using the position strategy
pool.set_t0(position=0, anchor="start")
# Retrieve T0 values as a DataFrame
pool.t0_data()
# Sequence-level properties (available after set_t0)
seq = pool[pool.unique_ids[0]]
seq.t0 # T0 value for this sequence
seq.t0_nearest_rank # 0-based index of the entity at or just before T0
See Also#
Builder & Storage - How to build and load pools.
Zeroing & Alignment - T0 strategies and temporal alignment.
Metadata - Inspect dtypes, feature info, and cast methods.