Note

Go to the end to download the full example code.

Sequence Level Zeroing#

Align an IntervalSequencePool to a reference date (T0) using each of the four built-in strategies.

Strategy	Description
`position`	T0 = temporal value at a given row index (`0` = first, `-1` = last)
`direct`	T0 = fixed scalar or per-id `dict`
`feature`	T0 = value of a static feature column
`query`	T0 = first/last row matching a Polars expression

After calling set_t0, inspect the results with pool.t0_data(), seq.t0, and seq.t0_nearest_rank.

See Zeroing & Alignment for the complete reference.

Imports#

import polars as pl
import pandas as pd
from datetime import datetime, timedelta

from tanat import build_intervals
from tanat.dataset import simulate_intervals, simulate_static

Simulate data#

Generate a small IntervalSequencePool with both temporal and static features so we can exercise all four strategies.

temporal = simulate_intervals(
    n_ids=40,
    features=["value", "status"],
    seed=42,
)
temporal.head()

	id	start	end	value	status
0	1	2001-04-19 12:22:42.057926	2001-05-14 23:30:23.456650	17	D
1	1	2002-06-24 02:25:44.449601	2002-07-21 06:40:32.167026	76	A
2	1	2023-02-14 15:44:33.896222	2023-03-16 01:31:26.193500	71	B
3	2	2003-05-13 17:32:18.288573	2003-05-25 21:33:38.530002	36	B
4	2	2011-07-28 06:48:40.890238	2011-08-16 19:56:50.622650	7	D

static = simulate_static(n_ids=40, features=["age"], seed=0)
static.head()

	id	age
0	1	86
1	2	64
2	3	52
3	4	27
4	5	31

Build the pool#

pool = build_intervals(
    temporal_data=temporal,
    id_column="id",
    start_column="start",
    end_column="end",
    static_data=static,
)
print(pool)

┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity, time index & static features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (40 sequences · 271 entities · 0.01s)
┌────────────────────────────────────────────────┐
│          IntervalSequencePool Summary          │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          40
  Store              /home/runner/.tanat/_quick_interval_356d7dc5
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-02-06 14:20:19.371107 → 2024-12-25 10:06:22.688850]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • status              String [len 1 → 1]
  • value               Numerical [1 → 100]

Static Features (1)
─────────────────────────
  • age                 Numerical [1 → 98]

Strategy 1 - position#

set_t0(position=N) selects the temporal value at row index N (0-based; negative indices count from the end). For interval and state pools the anchor parameter controls which end of the interval is used: "start" (default) or "end".

First row, start of interval

pool.set_t0(position=0, anchor="start")
print("position=0, anchor='start'")
pool.t0_data().head()

position=0, anchor='start'

	id	_T0_
0	1	2001-04-19 12:22:42.057926
1	2	2003-05-13 17:32:18.288573
2	3	2007-06-04 01:49:23.443822
3	4	2002-05-18 09:56:41.813009
4	5	2006-05-02 21:31:22.419217

Last row, end of interval

pool.set_t0(position=-1, anchor="end")
print("position=-1, anchor='end'")
pool.t0_data().head()

position=-1, anchor='end'

	id	_T0_	_T0_NEAREST_RANK_
0	1	2023-03-16 01:31:26.193500	2
1	2	2021-06-21 23:22:50.981274	8
2	3	2023-10-14 06:46:47.003936	7
3	4	2021-08-02 07:29:31.752257	5
4	5	2023-09-10 15:11:05.892085	5

Strategy 2 - direct#

set_t0(direct=value) assigns the same timestamp to every sequence. set_t0(direct={id: value, ...}) assigns a per-id timestamp; IDs absent from the dict receive _T0_ = null.

Scalar: same T0 for all sequences

pool.set_t0(direct=datetime(2020, 1, 1))
print("direct scalar")
pool.t0_data().head()

direct scalar

	id	_T0_	_T0_NEAREST_RANK_
0	1	2020-01-01 00:00:00	1
1	2	2020-01-01 00:00:00	6
2	3	2020-01-01 00:00:00	5
3	4	2020-01-01 00:00:00	4
4	5	2020-01-01 00:00:00	4

Dict: per-id mapping

first_ids = pool.unique_ids[:3]
per_id_map = {
    first_ids[0]: datetime(2020, 1, 10),
    first_ids[1]: datetime(2020, 2, 20),
    first_ids[2]: datetime(2020, 3, 15),
}
pool.set_t0(direct=per_id_map)
print("direct per-id (only 3 IDs in dict → remaining get null)")
pool.t0_data().head(6)

/home/runner/work/TanaT/TanaT/src/tanat/sequence/base/pool.py:509: UserWarning: 37 sequence(s) received _t0 = null (no valid row found): [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
  setter.compute_from_sequence(
direct per-id (only 3 IDs in dict → remaining get null)

	id	_T0_	_T0_NEAREST_RANK_
0	1	2020-01-10 00:00:00	1
1	2	2020-02-20 00:00:00	7
2	3	2020-03-15 00:00:00	5
3	4	<NA>	<NA>
4	5	<NA>	<NA>
5	6	<NA>	<NA>

Strategy 3 - feature#

set_t0(feature="col") reads T0 from a static feature column. The feature dtype must match the pool’s temporal dtype; cast it with cast_features() if needed.

We first attach a custom static column index_date whose dtype already matches the pool’s temporal dtype (Datetime[us]).

Build a per-id index_date column (Datetime[us] to match the pool’s time index)

n = len(pool)
index_dates = pd.DataFrame(
    {
        "id": pool.unique_ids,
        "index_date": pd.array(
            [datetime(2020, 1, 1) + timedelta(days=int(i * 7)) for i in range(n)],
            dtype="datetime64[us]",
        ),
    }
)
pool.add_static_features(index_dates)

pool.set_t0(feature="index_date")
print("feature='index_date'")
pool.t0_data().head()

feature='index_date'

	id	_T0_	_T0_NEAREST_RANK_
0	1	2020-01-01 00:00:00	1
1	2	2020-01-08 00:00:00	6
2	3	2020-01-15 00:00:00	5
3	4	2020-01-22 00:00:00	4
4	5	2020-01-29 00:00:00	4

All IDs have an index_date so no nulls for this strategy

null_count = pool.t0_data()["_T0_"].isnull().sum()
print(f"Sequences with _T0_ = null: {null_count}/{len(pool)}")

Sequences with _T0_ = null: 0/40

Strategy 4 - query#

set_t0(query=expr) scans entity rows and picks the first (or last with use_first=False) row where the Polars expression is True. The anchor parameter controls which end of the interval becomes T0. Sequences with no matching row receive _T0_ = null.

T0 = start of the first row where status == “D”

pool.set_t0(query=pl.col("status") == "D", anchor="start", use_first=True)
print("First 'D' row (start)")
pool.t0_data().head()

/home/runner/work/TanaT/TanaT/src/tanat/sequence/base/pool.py:509: UserWarning: 15 sequence(s) received _t0 = null (no valid row found): [3, 4, 6, 7, 9, 15, 19, 21, 30, 31, 32, 33, 34, 39, 40]
  setter.compute_from_sequence(
First 'D' row (start)

	id	_T0_	_T0_NEAREST_RANK_
0	1	2001-04-19 12:22:42.057926	0
1	2	2011-07-28 06:48:40.890238	1
2	3	<NA>	<NA>
3	4	<NA>	<NA>
4	5	2019-07-26 22:09:35.148414	3

T0 = end of the last row where value > 0.8

pool.set_t0(query=pl.col("value") > 0.8, anchor="end", use_first=False)
print("Last row with value > 0.8 (end)")
pool.t0_data().head()

Last row with value > 0.8 (end)

	id	_T0_	_T0_NEAREST_RANK_
0	1	2023-03-16 01:31:26.193500	2
1	2	2021-06-21 23:22:50.981274	8
2	3	2023-10-14 06:46:47.003936	7
3	4	2021-08-02 07:29:31.752257	5
4	5	2023-09-10 15:11:05.892085	5

Sequence-level properties#

After any set_t0 call, every Sequence exposes seq.t0 and seq.t0_nearest_rank.

Property	Type	Description
`seq.t0`	scalar \| `None`	T0 for this sequence; `None` when no T0 could be computed
`seq.t0_nearest_rank`	`int` \| `None`	0-based index of the entity at or just before T0

T0 is always set at the pool level via pool.set_t0(...) and propagated to every sequence automatically. There is no seq.set_t0(): the pool is the single source of truth, which prevents desynchronisation between sequences after filtering or iterating.

pool.set_t0(position=0, anchor="start")

seq = pool[pool.unique_ids[0]]
print(f"id              : {seq.id_value}")
print(f"t0              : {seq.t0}")
print(f"t0_nearest_rank : {seq.t0_nearest_rank}")

id              : 1
t0              : 2001-04-19 12:22:42.057926
t0_nearest_rank : 0

Null case: highly selective query → some sequences have no matching row

pool.set_t0(query=pl.col("value") > 0.999, anchor="start")

null_seqs = [seq.id_value for seq in pool if seq.t0 is None]
print(
    f"{len(null_seqs)}/{len(pool)} sequence(s) with t0 = None  "
    f"(no row matched value > 0.999)"
)

0/40 sequence(s) with t0 = None  (no row matched value > 0.999)

Total running time of the script: (0 minutes 0.112 seconds)

Gallery generated by Sphinx-Gallery