Zeroing & Alignment#

Zeroing aligns sequences to a common reference date (T0 / index date), transforming absolute timestamps into relative ones. This is essential when comparing sequences across individuals who were observed at different calendar times, for example aligning patients to their first hospitalisation or users to their registration date.

Once set_t0 is called on a pool, every Sequence object automatically exposes seq.t0 and seq.t0_nearest_rank.

Strategies#

Four strategies are available. Pass exactly one keyword to set_t0.

Strategy	Keyword	Description
`position`	`set_t0(position=N)`	T0 = temporal value at row index `N` (0-based; negative indices are supported: `-1` is the last row). For interval and state pools, the `anchor` parameter selects the reference point within the period (`"start"`, `"end"`, or `"middle"`).
`direct`	`set_t0(direct=value)`	T0 = the same scalar timestamp for every sequence. Alternatively, pass a `dict[id, timestamp]` to assign a different T0 per individual; IDs absent from the dict receive `_T0_ = null`.
`feature`	`set_t0(feature="col")`	T0 = the value of a static feature column. The feature dtype must exactly match the pool’s temporal dtype; use `cast_features` to align if needed. IDs with a `null` static value receive `_T0_ = null`.
`query`	`set_t0(query=expr)`	T0 = temporal value of the first (or last, with `use_first=False`) entity row where the Polars expression evaluates to `True`. The `anchor` parameter determines the reference point within the matched period. Sequences with no matching row receive `_T0_ = null`.

The anchor parameter ("start" | "end" | "middle") applies only to the position and query strategies, and only for IntervalSequencePool and StateSequencePool; it is ignored otherwise.

Usage#

import polars as pl
from datetime import datetime

# position: first row, start of interval
pool.set_t0(position=0, anchor="start")

# position: last row, end of interval
pool.set_t0(position=-1, anchor="end")

# direct: same T0 for all sequences
pool.set_t0(direct=datetime(2000, 1, 1))

# direct: per-id mapping
pool.set_t0(direct={
    "pat_01": datetime(2020, 3, 15),
    "pat_02": datetime(2021, 6, 1),
})

# feature: read T0 from a static column
pool.cast_features({"registration_date": pl.Datetime("us")}, is_static=True)
pool.set_t0(feature="registration_date")

# query: first row where status == "error"
pool.set_t0(query=pl.col("status") == "error", anchor="start", use_first=True)

# query: last row where value > 0.9
pool.set_t0(query=pl.col("value") > 0.9, anchor="end", use_first=False)

Pool-Level Inspection#

pool.t0_data() returns the full T0 table as a DataFrame with columns [id, _T0_, _T0_NEAREST_RANK_].

The _T0_NEAREST_RANK_ column holds the 0-based index of the entity whose temporal start is the floor value at or just before T0. It is computed from the start column regardless of the anchor used in set_t0.

pool.set_t0(position=0, anchor="start")

# pandas (default)
pool.t0_data().head()

Sequence-Level Properties#

Once set_t0 has been called on the pool, every sequence object exposes two read-only properties:

Property	Type	Description
`seq.t0`	scalar or `None`	T0 value for this sequence. `None` when T0 could not be determined (sequence too short, no query match, `null` static feature value…).
`seq.t0_nearest_rank`	`int` or `None`	0-based index of the entity at or just before T0. `None` when `seq.t0` is `None`.

seq = pool[pool.unique_ids[0]]
print(seq.t0)               # e.g. datetime(2020, 3, 15, ...)
print(seq.t0_nearest_rank)  # e.g. 2

Trajectory-Level Zeroing#

TrajectoryPool.set_t0 accepts the same four strategy keywords as the sequence-level set_t0, plus an additional on= parameter that selects the reference sub-pool from which T0 is computed.

The `on=` parameter#

Strategies that inspect temporal rows (position, query) require on= because the row index or filter expression is evaluated against a specific sub-pool. Strategies that do not read rows (direct, feature) do not need on=; if provided it is ignored with a warning.

# position: first admission, start of interval
tpool.set_t0(position=0, anchor="start", on="admissions")

# direct: no on= needed
tpool.set_t0(direct=datetime(2010, 6, 1))

# feature: trajectory-level static column
tpool.set_t0(feature="admission_date")

# query: first lab matching a condition
tpool.set_t0(query=pl.col("status") == "error", on="labs")

Trajectory-Level Inspection#

tpool.t0_data() returns one row per trajectory with columns [id, _T0_, <alias1>_T0_NEAREST_RANK_, <alias2>_T0_NEAREST_RANK_, ...].

Each sub-pool gets its own nearest-rank column because the floor-index depends on each pool’s temporal grid. The column is named <alias>_T0_NEAREST_RANK_ (alias prefix, then the constant suffix).

tpool.set_t0(position=0, anchor="start", on="admissions")
tpool.t0_data().head()  # columns: id, _T0_, admissions_T0_NEAREST_RANK_, labs_T0_NEAREST_RANK_, ...

Trajectory Properties#

Once set_t0 has been called on the trajectory pool, every Trajectory exposes:

Property	Type	Description
`traj.t0`	scalar or `None`	T0 for this trajectory. `None` when no T0 could be determined.
`traj.t0_nearest_rank`	`dict[str, int \| None]`	Per-alias floor index: `{"admissions": 0, "labs": 2, ...}`. `None` per alias when `traj.t0` is `None`.

traj = tpool[tpool.unique_ids[0]]
print(traj.t0)               # e.g. datetime(2020, 3, 15, ...)
print(traj.t0_nearest_rank)  # e.g. {'admissions': 0, 'labs': 2, 'phases': 1}

T0 is shared across all children#

A single tpool.set_t0(...) call is enough. Every object you retrieve from the pool (a sub-pool, a trajectory, or an individual sequence) automatically returns the same t0 value. t0_nearest_rank still varies: each pool computes its floor index on its own temporal grid.

tpool.set_t0(position=0, anchor="start", on="admissions")

traj = tpool[tpool.unique_ids[0]]
print(traj.t0)                      # e.g. datetime(2020, 3, 15, ...)

seq = traj["labs"]
print(seq.t0)                       # == traj.t0
print(seq.t0_nearest_rank)          # floor index on the labs temporal grid

tpool.sequence_pools["labs"].t0_data().head()   # _T0_ column == traj.t0

Null Handling#

A sequence receives _T0_ = null in any of these situations:

position - the index is out of range for that sequence.
direct (dict) - the sequence ID is not a key in the dict.
feature - the static feature value is null for that ID.
query - no entity row matches the expression (or the sequence is empty).

Sequences with null T0 are not dropped from the pool. seq.t0 returns None and seq.t0_nearest_rank returns None for those individuals.

To inspect how many sequences are affected:

null_count = pool.t0_data()["_T0_"].isnull().sum()
print(f"{null_count}/{len(pool)} sequences with T0 = null")