Zeroing & Alignment#

Zeroing aligns sequences to a common reference date (T0 / index date), transforming absolute timestamps into relative ones. This is essential when comparing sequences across individuals who were observed at different calendar times, for example aligning patients to their first hospitalisation or users to their registration date.

Once set_t0 is called on a pool, every Sequence object automatically exposes seq.t0 and seq.t0_nearest_rank.


Strategies#

Four strategies are available. Pass exactly one keyword to set_t0.

Strategy

Keyword

Description

position

set_t0(position=N)

T0 = temporal value at row index N (0-based; negative indices are supported: -1 is the last row). For interval and state pools, the anchor parameter selects the reference point within the period ("start", "end", or "middle").

direct

set_t0(direct=value)

T0 = the same scalar timestamp for every sequence. Alternatively, pass a dict[id, timestamp] to assign a different T0 per individual; IDs absent from the dict receive _T0_ = null.

feature

set_t0(feature="col")

T0 = the value of a static feature column. The feature dtype must exactly match the pool’s temporal dtype; use cast_features to align if needed. IDs with a null static value receive _T0_ = null.

query

set_t0(query=expr)

T0 = temporal value of the first (or last, with use_first=False) entity row where the Polars expression evaluates to True. The anchor parameter determines the reference point within the matched period. Sequences with no matching row receive _T0_ = null.

The anchor parameter ("start" | "end" | "middle") applies only to the position and query strategies, and only for IntervalSequencePool and StateSequencePool; it is ignored otherwise.


Usage#

import polars as pl
from datetime import datetime

# position: first row, start of interval
pool.set_t0(position=0, anchor="start")

# position: last row, end of interval
pool.set_t0(position=-1, anchor="end")

# direct: same T0 for all sequences
pool.set_t0(direct=datetime(2000, 1, 1))

# direct: per-id mapping
pool.set_t0(direct={
    "pat_01": datetime(2020, 3, 15),
    "pat_02": datetime(2021, 6, 1),
})

# feature: read T0 from a static column
pool.cast_features({"registration_date": pl.Datetime("us")}, is_static=True)
pool.set_t0(feature="registration_date")

# query: first row where status == "error"
pool.set_t0(query=pl.col("status") == "error", anchor="start", use_first=True)

# query: last row where value > 0.9
pool.set_t0(query=pl.col("value") > 0.9, anchor="end", use_first=False)

Pool-Level Inspection#

pool.t0_data() returns the full T0 table as a DataFrame with columns [id, _T0_, _T0_NEAREST_RANK_].

The _T0_NEAREST_RANK_ column holds the 0-based index of the entity whose temporal start is the floor value at or just before T0. It is computed from the start column regardless of the anchor used in set_t0.

pool.set_t0(position=0, anchor="start")

# pandas (default)
pool.t0_data().head()

Sequence-Level Properties#

Once set_t0 has been called on the pool, every sequence object exposes two read-only properties:

Property

Type

Description

seq.t0

scalar or None

T0 value for this sequence. None when T0 could not be determined (sequence too short, no query match, null static feature value…).

seq.t0_nearest_rank

int or None

0-based index of the entity at or just before T0. None when seq.t0 is None.

seq = pool[pool.unique_ids[0]]
print(seq.t0)               # e.g. datetime(2020, 3, 15, ...)
print(seq.t0_nearest_rank)  # e.g. 2

Trajectory-Level Zeroing#

TrajectoryPool.set_t0 accepts the same four strategy keywords as the sequence-level set_t0, plus an additional on= parameter that selects the reference sub-pool from which T0 is computed.

The on= parameter#

Strategies that inspect temporal rows (position, query) require on= because the row index or filter expression is evaluated against a specific sub-pool. Strategies that do not read rows (direct, feature) do not need on=; if provided it is ignored with a warning.

# position: first admission, start of interval
tpool.set_t0(position=0, anchor="start", on="admissions")

# direct: no on= needed
tpool.set_t0(direct=datetime(2010, 6, 1))

# feature: trajectory-level static column
tpool.set_t0(feature="admission_date")

# query: first lab matching a condition
tpool.set_t0(query=pl.col("status") == "error", on="labs")

Trajectory-Level Inspection#

tpool.t0_data() returns one row per trajectory with columns [id, _T0_, <alias1>_T0_NEAREST_RANK_, <alias2>_T0_NEAREST_RANK_, ...].

Each sub-pool gets its own nearest-rank column because the floor-index depends on each pool’s temporal grid. The column is named <alias>_T0_NEAREST_RANK_ (alias prefix, then the constant suffix).

tpool.set_t0(position=0, anchor="start", on="admissions")
tpool.t0_data().head()  # columns: id, _T0_, admissions_T0_NEAREST_RANK_, labs_T0_NEAREST_RANK_, ...

Trajectory Properties#

Once set_t0 has been called on the trajectory pool, every Trajectory exposes:

Property

Type

Description

traj.t0

scalar or None

T0 for this trajectory. None when no T0 could be determined.

traj.t0_nearest_rank

dict[str, int | None]

Per-alias floor index: {"admissions": 0, "labs": 2, ...}. None per alias when traj.t0 is None.

traj = tpool[tpool.unique_ids[0]]
print(traj.t0)               # e.g. datetime(2020, 3, 15, ...)
print(traj.t0_nearest_rank)  # e.g. {'admissions': 0, 'labs': 2, 'phases': 1}

T0 is shared across all children#

A single tpool.set_t0(...) call is enough. Every object you retrieve from the pool (a sub-pool, a trajectory, or an individual sequence) automatically returns the same t0 value. t0_nearest_rank still varies: each pool computes its floor index on its own temporal grid.

tpool.set_t0(position=0, anchor="start", on="admissions")

traj = tpool[tpool.unique_ids[0]]
print(traj.t0)                      # e.g. datetime(2020, 3, 15, ...)

seq = traj["labs"]
print(seq.t0)                       # == traj.t0
print(seq.t0_nearest_rank)          # floor index on the labs temporal grid

tpool.sequence_pools["labs"].t0_data().head()   # _T0_ column == traj.t0

Null Handling#

A sequence receives _T0_ = null in any of these situations:

  • position - the index is out of range for that sequence.

  • direct (dict) - the sequence ID is not a key in the dict.

  • feature - the static feature value is null for that ID.

  • query - no entity row matches the expression (or the sequence is empty).

Sequences with null T0 are not dropped from the pool. seq.t0 returns None and seq.t0_nearest_rank returns None for those individuals.

To inspect how many sequences are affected:

null_count = pool.t0_data()["_T0_"].isnull().sum()
print(f"{null_count}/{len(pool)} sequences with T0 = null")

See Also#