Note
Go to the end to download the full example code.
LengthCriterion#
Select sequences by their number of entity rows (sequence length).
Parameter |
Description |
|---|---|
|
Strictly greater than / greater than or equal to. |
|
Strictly less than / less than or equal to. |
At least one bound must be supplied. Contradictory bounds (e.g. gt=5,
lt=3) are rejected at construction time.
LengthCriterion supports SEQUENCE level only
(which(), match()); filter_entities() is not available.
See Criteria for the full reference.
Imports#
from tanat import build_intervals
from tanat.criterion import LengthCriterion
from tanat.dataset import simulate_intervals
Simulate data#
temporal = simulate_intervals(n_ids=50, features=["value", "status"], seed=42)
pool = build_intervals(
temporal_data=temporal,
id_column="id",
start_column="start",
end_column="end",
)
┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.00s)
print(pool)
┌────────────────────────────────────────────────┐
│ IntervalSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 50
Store /home/runner/.tanat/_quick_interval_50580529
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• status String [len 1 → 1]
• value Numerical [1 → 100]
Inspect length distribution or other summary statistics.
pool.describe(by_id=False)
which(): single-bound selection#
# Long sequences: more than 6 entities.
ids_long = pool.which(LengthCriterion(gt=6))
[which] LengthCriterion → 29 / 50 IDs (58.0%)
# Short sequences: at most 3 entities.
ids_short = pool.which(LengthCriterion(le=3))
print(f"Length ≤ 3 : {len(ids_short)} / {len(pool)} IDs")
[which] LengthCriterion → 6 / 50 IDs (12.0%)
Length ≤ 3 : 6 / 50 IDs
Range selection#
Combine bounds to select sequences whose length falls in a range.
# Length = ]3, 6]
ids_medium = pool.which(LengthCriterion(gt=3, le=6))
[which] LengthCriterion → 15 / 50 IDs (30.0%)
Subset the pool#
Use subset() to obtain a
restricted pool from the selected IDs.
pool_long = pool.subset(ids_long)
print(pool_long)
┌────────────────────────────────────────────────┐
│ IntervalSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 29
Store /home/runner/.tanat/_quick_interval_50580529
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-05 21:55:52.963626]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• status String [len 1 → 1]
• value Numerical [1 → 100]
# Inspect the length distribution in the subset.
pool_long.describe(by_id=False)
match(): single-sequence evaluation#
seq = pool[pool.unique_ids[0]]
seq_len = len(seq)
print(
f"Sequence {seq.id_value}: length={seq_len} "
f"gt=6? {seq.match(LengthCriterion(gt=6))} "
f"le=3? {seq.match(LengthCriterion(le=3))}"
)
Sequence 1: length=3 gt=6? False le=3? True
Total running time of the script: (0 minutes 0.051 seconds)