Note
Go to the end to download the full example code.
PatternCriterion#
Select sequences or extract witness rows based on an ordered pattern of string values in a feature column.
Sentinel |
Meaning |
|---|---|
|
Zero or more elements: free gap between adjacent sub-patterns. |
|
Exactly one element of any value at that position. |
Level |
Behaviour |
|---|---|
|
IDs whose temporal sequence contains ( |
|
Keeps the “witness” rows of the greedy first match
( |
|
Returns |
See Criteria for the full reference.
Imports#
from tanat import build_intervals
from tanat.criterion import ANY, WILDCARD, PatternCriterion
from tanat.dataset import simulate_intervals
Simulate data#
temporal = simulate_intervals(n_ids=50, features=["score", "status"], seed=42)
pool = build_intervals(
temporal_data=temporal,
id_column="id",
start_column="start",
end_column="end",
)
┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.00s)
print(pool)
┌────────────────────────────────────────────────┐
│ IntervalSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 50
Store /home/runner/.tanat/_quick_interval_83e4ce61
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• score Numerical [1 → 100]
• status String [len 1 → 1]
# Pick 2 status values existing in the temporal data.
A = "A"
B = "B"
Single-element pattern#
A plain string (or single-element list) selects sequences that contain at least one entity with that value.
ids_has_A = pool.which(PatternCriterion(feature="status", pattern=A))
[which] PatternCriterion → 31 / 50 IDs (62.0%)
# Exclusion: sequences that never show status A.
ids_no_A = pool.which(PatternCriterion(feature="status", pattern=A, present=False))
[which] PatternCriterion → 19 / 50 IDs (38.0%)
Adjacent pattern: A directly followed by B#
[A, B] matches only if B appears immediately after A in the ordered
sequence of entities.
ids_adj = pool.which(PatternCriterion(feature="status", pattern=[A, B]))
[which] PatternCriterion → 10 / 50 IDs (20.0%)
Free gap: A anywhere before B#
Insert ANY between elements to allow an arbitrary
number of rows in between.
ids_gap = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B]))
[which] PatternCriterion → 16 / 50 IDs (32.0%)
Wildcard: exactly one element between A and B#
WILDCARD matches exactly one entity of any value.
ids_wildcard = pool.which(PatternCriterion(feature="status", pattern=[A, WILDCARD, B]))
[which] PatternCriterion → 5 / 50 IDs (10.0%)
Combining sentinels#
You can mix ANY and WILDCARD freely.
Here: A, then any gap, then exactly two consecutive B’s.
ids_double_B = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B, B]))
[which] PatternCriterion → 4 / 50 IDs (8.0%)
Regex and case options#
By default elements are treated as regular expressions (regex=True).
Use regex=False for literal substring matching.
Add case_sensitive=False for case-insensitive matching.
# Literal, case-insensitive: same result as the exact match above.
a_lower = A.lower()
ids_ci = pool.which(
PatternCriterion(
feature="status", pattern=a_lower, regex=False, case_sensitive=False
)
)
[which] PatternCriterion → 31 / 50 IDs (62.0%)
filter_entities(): witness rows#
With present=True (default), only the greedy first-match witness rows
are kept. Each ID contributes at most len(pattern) rows.
pattern = [A, B]
filtered = pool.filter_entities(PatternCriterion(feature="status", pattern=pattern))
[filter_entities] PatternCriterion → 20 / 343 entities (5.8%) · 40 IDs affected
# inspect length of filtered sequences
filtered.describe(by_id=False)
match(): single-sequence evaluation#
criterion = PatternCriterion(feature="status", pattern=[A, B])
# Find all matching sequences by iterating.
matching_seqs = [s for s in pool if s.match(criterion)]
print(f"{len(matching_seqs)} sequence(s) contain A→B")
10 sequence(s) contain A→B
Total running time of the script: (0 minutes 0.113 seconds)