Note
Go to the end to download the full example code.
EntityCriterion#
Select sequences or prune entity rows using any Polars expression evaluated against the temporal data.
Level |
Behaviour |
|---|---|
|
Returns IDs that have at least one row satisfying the expression. |
|
Keeps only the rows where the expression is |
|
Returns |
See Criteria for the full reference.
Imports#
import polars as pl
from tanat import build_intervals
from tanat.criterion import EntityCriterion
from tanat.dataset import simulate_intervals, simulate_static
Simulate data#
temporal = simulate_intervals(
n_ids=50,
features=["value", "status"],
seed=42,
)
static = simulate_static(n_ids=50, features=["age", "group"], seed=0)
pool = build_intervals(
temporal_data=temporal,
id_column="id",
start_column="start",
end_column="end",
static_data=static,
)
┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity, time index & static features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.01s)
print(pool)
┌────────────────────────────────────────────────┐
│ IntervalSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 50
Store /home/runner/.tanat/_quick_interval_26faba44
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• status String [len 1 → 1]
• value Numerical [1 → 100]
Static Features (2)
─────────────────────────
• age Numerical [1 → 98]
• group String [len 1 → 1]
# Inspect the unique status values present in the data.
pool.temporal_data()["status"].unique()
<ArrowExtensionArray>
['C', 'D', 'E', 'A', 'B']
Length: 5, dtype: large_string[pyarrow]
which() : sequence-level selection#
Return the IDs of all sequences that have at least one entity row satisfying the expression. The original pool is left unchanged.
# Pick a status value that exists in the data
target_status = "A"
# Select sequences that have at least one entity with that status.
ids_with_status = pool.which(EntityCriterion(query=pl.col("status") == target_status))
[which] EntityCriterion → 31 / 50 IDs (62.0%)
Numeric threshold: sequences with at least one high-value entity.
ids_high_value = pool.which(EntityCriterion(query=pl.col("value") > 80))
[which] EntityCriterion → 36 / 50 IDs (72.0%)
Combine conditions with a Polars expression.
ids_combined = pool.which(
EntityCriterion(query=(pl.col("status") == target_status) & (pl.col("value") > 80))
)
[which] EntityCriterion → 9 / 50 IDs (18.0%)
filter_entities(): entity-level pruning#
Return a new pool view that contains only the rows satisfying the expression. The original pool is unchanged. Sequences with zero surviving rows no longer appear in the filtered pool.
filtered = pool.filter_entities(
EntityCriterion(query=pl.col("status") == target_status)
)
[filter_entities] EntityCriterion → 73 / 343 entities (21.3%) · 19 IDs affected
# Combine two conditions in a single criterion to narrow further.
filtered2 = pool.filter_entities(
EntityCriterion(query=(pl.col("status") == target_status) & (pl.col("value") > 80))
)
[filter_entities] EntityCriterion → 11 / 343 entities (3.2%) · 41 IDs affected
match(): single-sequence evaluation#
criterion = EntityCriterion(query=pl.col("status") == target_status)
# Iterate to find the first sequence that matches.
first_match = next((s for s in pool if s.match(criterion)), None)
if first_match:
print(f"First matching sequence: id={first_match.id_value}")
First matching sequence: id=2
Total running time of the script: (0 minutes 0.058 seconds)