Zeroing & Alignment#
Zeroing aligns sequences to a common reference date (T0 / index date), transforming absolute timestamps into relative ones. This is essential when comparing sequences across individuals who were observed at different calendar times, for example aligning patients to their first hospitalisation or users to their registration date.
Once set_t0 is called on a pool, every Sequence
object automatically exposes seq.t0 and seq.t0_nearest_rank.
Strategies#
Four strategies are available. Pass exactly one keyword to set_t0.
Strategy |
Keyword |
Description |
|---|---|---|
|
|
T0 = temporal value at row index |
|
|
T0 = the same scalar timestamp for every sequence. Alternatively,
pass a |
|
|
T0 = the value of a static feature column. The feature dtype must
exactly match the pool’s temporal dtype; use |
|
|
T0 = temporal value of the first (or last, with |
The anchor parameter ("start" | "end" | "middle") applies only to the
position and query strategies, and only for
IntervalSequencePool and
StateSequencePool; it is ignored
otherwise.
Usage#
import polars as pl
from datetime import datetime
# position: first row, start of interval
pool.set_t0(position=0, anchor="start")
# position: last row, end of interval
pool.set_t0(position=-1, anchor="end")
# direct: same T0 for all sequences
pool.set_t0(direct=datetime(2000, 1, 1))
# direct: per-id mapping
pool.set_t0(direct={
"pat_01": datetime(2020, 3, 15),
"pat_02": datetime(2021, 6, 1),
})
# feature: read T0 from a static column
pool.cast_features({"registration_date": pl.Datetime("us")}, is_static=True)
pool.set_t0(feature="registration_date")
# query: first row where status == "error"
pool.set_t0(query=pl.col("status") == "error", anchor="start", use_first=True)
# query: last row where value > 0.9
pool.set_t0(query=pl.col("value") > 0.9, anchor="end", use_first=False)
Pool-Level Inspection#
pool.t0_data() returns the full T0 table as a DataFrame with columns
[id, _T0_, _T0_NEAREST_RANK_].
The _T0_NEAREST_RANK_ column holds the 0-based index of the entity whose
temporal start is the floor value at or just before T0. It is computed
from the start column regardless of the anchor used in set_t0.
pool.set_t0(position=0, anchor="start")
# pandas (default)
pool.t0_data().head()
Sequence-Level Properties#
Once set_t0 has been called on the pool, every sequence object exposes
two read-only properties:
Property |
Type |
Description |
|---|---|---|
|
scalar or |
T0 value for this sequence. |
|
|
0-based index of the entity at or just before T0. |
seq = pool[pool.unique_ids[0]]
print(seq.t0) # e.g. datetime(2020, 3, 15, ...)
print(seq.t0_nearest_rank) # e.g. 2
Trajectory-Level Zeroing#
TrajectoryPool.set_t0 accepts the same four strategy keywords as the
sequence-level set_t0, plus an additional on= parameter that selects
the reference sub-pool from which T0 is computed.
The on= parameter#
Strategies that inspect temporal rows (position, query) require
on= because the row index or filter expression is evaluated against a
specific sub-pool. Strategies that do not read rows (direct, feature)
do not need on=; if provided it is ignored with a warning.
# position: first admission, start of interval
tpool.set_t0(position=0, anchor="start", on="admissions")
# direct: no on= needed
tpool.set_t0(direct=datetime(2010, 6, 1))
# feature: trajectory-level static column
tpool.set_t0(feature="admission_date")
# query: first lab matching a condition
tpool.set_t0(query=pl.col("status") == "error", on="labs")
Trajectory-Level Inspection#
tpool.t0_data() returns one row per trajectory with columns
[id, _T0_, <alias1>_T0_NEAREST_RANK_, <alias2>_T0_NEAREST_RANK_, ...].
Each sub-pool gets its own nearest-rank column because the floor-index
depends on each pool’s temporal grid. The column is named
<alias>_T0_NEAREST_RANK_ (alias prefix, then the constant suffix).
tpool.set_t0(position=0, anchor="start", on="admissions")
tpool.t0_data().head() # columns: id, _T0_, admissions_T0_NEAREST_RANK_, labs_T0_NEAREST_RANK_, ...
Trajectory Properties#
Once set_t0 has been called on the trajectory pool, every
Trajectory exposes:
Property |
Type |
Description |
|---|---|---|
|
scalar or |
T0 for this trajectory. |
|
|
Per-alias floor index: |
traj = tpool[tpool.unique_ids[0]]
print(traj.t0) # e.g. datetime(2020, 3, 15, ...)
print(traj.t0_nearest_rank) # e.g. {'admissions': 0, 'labs': 2, 'phases': 1}
Null Handling#
A sequence receives _T0_ = null in any of these situations:
position - the index is out of range for that sequence.
direct (dict) - the sequence ID is not a key in the dict.
feature - the static feature value is
nullfor that ID.query - no entity row matches the expression (or the sequence is empty).
Sequences with null T0 are not dropped from the pool. seq.t0
returns None and seq.t0_nearest_rank returns None for those
individuals.
To inspect how many sequences are affected:
null_count = pool.t0_data()["_T0_"].isnull().sum()
print(f"{null_count}/{len(pool)} sequences with T0 = null")
See Also#
Data Manipulation - Full operation reference including
set_t0andt0_data.Sequence Level Zeroing - Set T0 on sequence level.
Trajectory-Level Zeroing - Set T0 on trajectory level.