LengthCriterion#

Select sequences by their number of entity rows (sequence length).

Parameter

Description

gt / ge

Strictly greater than / greater than or equal to.

lt / le

Strictly less than / less than or equal to.

At least one bound must be supplied. Contradictory bounds (e.g. gt=5, lt=3) are rejected at construction time.

LengthCriterion supports SEQUENCE level only (which(), match()); filter_entities() is not available.

See Criteria for the full reference.

Imports#

from tanat import build_intervals
from tanat.criterion import LengthCriterion
from tanat.dataset import simulate_intervals

Simulate data#

temporal = simulate_intervals(n_ids=50, features=["value", "status"], seed=42)

pool = build_intervals(
    temporal_data=temporal,
    id_column="id",
    start_column="start",
    end_column="end",
)
┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.00s)
print(pool)
┌────────────────────────────────────────────────┐
│          IntervalSequencePool Summary          │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          50
  Store              /home/runner/.tanat/_quick_interval_50580529
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • status              String [len 1 → 1]
  • value               Numerical [1 → 100]

Inspect length distribution or other summary statistics.

pool.describe(by_id=False)
length n_unique_entities temporal_span mean_duration median_duration duration_std
count 50.0 50.0 50 50 50 50
mean 6.86 6.76 6480 days, 2:48:22.247079 15 days, 3:30:25.259960 15 days, 5:12:18.547281 7 days, 21:43:35.511771
std 2.285804 2.254791 1941 days, 6:52:13.531688 3 days, 5:28:15.660040 4 days, 17:46:58.276359 2 days, 14:25:50.553305
min 3.0 3.0 1706 days, 19:27:07.917732 5 days, 8:01:28.714157 5 days, 0:11:27.379177 1 day, 4:52:50.506351
25% 5.0 5.0 5254 days 23:37:06.250402 13 days 00:36:46.440355 11 days 18:29:55.648568 6 days 18:26:20.009336
50% 7.0 7.0 7335 days 12:11:54.677473 16 days 02:40:15.396103 15 days 21:25:56.404198 8 days 03:08:02.672516
75% 9.0 9.0 7857 days 05:37:09.423327 17 days 03:52:02.575867 18 days 10:58:30.878432 9 days 15:57:33.150707
max 10.0 10.0 9050 days, 11:50:31.178892 20 days, 0:24:03.241089 24 days, 11:07:49.202588 14 days, 11:20:26.388911


which(): single-bound selection#

# Long sequences: more than 6 entities.
ids_long = pool.which(LengthCriterion(gt=6))
[which]           LengthCriterion → 29 / 50 IDs (58.0%)
# Short sequences: at most 3 entities.
ids_short = pool.which(LengthCriterion(le=3))
print(f"Length ≤ 3 : {len(ids_short)} / {len(pool)} IDs")
[which]           LengthCriterion → 6 / 50 IDs (12.0%)
Length ≤ 3 : 6 / 50 IDs

Range selection#

Combine bounds to select sequences whose length falls in a range.

# Length = ]3, 6]
ids_medium = pool.which(LengthCriterion(gt=3, le=6))
[which]           LengthCriterion → 15 / 50 IDs (30.0%)

Subset the pool#

Use subset() to obtain a restricted pool from the selected IDs.

pool_long = pool.subset(ids_long)
print(pool_long)
┌────────────────────────────────────────────────┐
│          IntervalSequencePool Summary          │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          29
  Store              /home/runner/.tanat/_quick_interval_50580529
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-05 21:55:52.963626]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • status              String [len 1 → 1]
  • value               Numerical [1 → 100]
# Inspect the length distribution in the subset.
pool_long.describe(by_id=False)
length n_unique_entities temporal_span mean_duration median_duration duration_std
count 29.0 29.0 29 29 29 29
mean 8.551724 8.413793 7328 days, 8:26:00.520809 15 days, 20:58:02.933330 15 days, 23:06:09.672359 8 days, 1:20:48.519965
std 0.985111 1.052794 1134 days, 21:23:06.091385 2 days, 9:56:11.098362 3 days, 22:47:57.361909 1 day, 12:37:21.243486
min 7.0 6.0 4366 days, 10:32:34.098950 9 days, 8:30:53.548390 5 days, 0:11:27.379177 4 days, 16:56:01.574529
25% 8.0 8.0 6845 days 05:07:11.823338 14 days 12:02:41.460306 14 days 17:06:18.669219 7 days 02:18:28.157876
50% 9.0 9.0 7689 days 12:33:58.121600 16 days 03:03:48.430141 15 days 21:59:53.357110 8 days 00:15:55.894990
75% 9.0 9.0 8004 days 19:04:00.961682 17 days 09:54:28.346099 18 days 12:41:38.304297 9 days 06:20:38.367288
max 10.0 10.0 9050 days, 11:50:31.178892 19 days, 12:30:30.076833 24 days, 11:07:49.202588 11 days, 1:30:43.139872


match(): single-sequence evaluation#

seq = pool[pool.unique_ids[0]]
seq_len = len(seq)
print(
    f"Sequence {seq.id_value}: length={seq_len}  "
    f"gt=6? {seq.match(LengthCriterion(gt=6))}  "
    f"le=3? {seq.match(LengthCriterion(le=3))}"
)
Sequence 1: length=3  gt=6? False  le=3? True

Total running time of the script: (0 minutes 0.051 seconds)

Gallery generated by Sphinx-Gallery