PatternCriterion#

Select sequences or extract witness rows based on an ordered pattern of string values in a feature column.

Sentinel

Meaning

ANY ("...")

Zero or more elements: free gap between adjacent sub-patterns.

WILDCARD ("*")

Exactly one element of any value at that position.

Level

Behaviour

which()

IDs whose temporal sequence contains (present=True) or does not contain (present=False) the ordered pattern.

filter_entities()

Keeps the “witness” rows of the greedy first match (present=True), or all non-witness rows (present=False).

match()

Returns True iff the pattern is found (resp. absent).

See Criteria for the full reference.

Imports#

from tanat import build_intervals
from tanat.criterion import ANY, WILDCARD, PatternCriterion
from tanat.dataset import simulate_intervals

Simulate data#

temporal = simulate_intervals(n_ids=50, features=["score", "status"], seed=42)

pool = build_intervals(
    temporal_data=temporal,
    id_column="id",
    start_column="start",
    end_column="end",
)
┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.00s)
print(pool)
┌────────────────────────────────────────────────┐
│          IntervalSequencePool Summary          │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          50
  Store              /home/runner/.tanat/_quick_interval_83e4ce61
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • score               Numerical [1 → 100]
  • status              String [len 1 → 1]
# Pick 2 status values existing in the temporal data.
A = "A"
B = "B"

Single-element pattern#

A plain string (or single-element list) selects sequences that contain at least one entity with that value.

ids_has_A = pool.which(PatternCriterion(feature="status", pattern=A))
[which]           PatternCriterion → 31 / 50 IDs (62.0%)
# Exclusion: sequences that never show status A.
ids_no_A = pool.which(PatternCriterion(feature="status", pattern=A, present=False))
[which]           PatternCriterion → 19 / 50 IDs (38.0%)

Adjacent pattern: A directly followed by B#

[A, B] matches only if B appears immediately after A in the ordered sequence of entities.

ids_adj = pool.which(PatternCriterion(feature="status", pattern=[A, B]))
[which]           PatternCriterion → 10 / 50 IDs (20.0%)

Free gap: A anywhere before B#

Insert ANY between elements to allow an arbitrary number of rows in between.

ids_gap = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B]))
[which]           PatternCriterion → 16 / 50 IDs (32.0%)

Wildcard: exactly one element between A and B#

WILDCARD matches exactly one entity of any value.

ids_wildcard = pool.which(PatternCriterion(feature="status", pattern=[A, WILDCARD, B]))
[which]           PatternCriterion → 5 / 50 IDs (10.0%)

Combining sentinels#

You can mix ANY and WILDCARD freely. Here: A, then any gap, then exactly two consecutive B’s.

ids_double_B = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B, B]))
[which]           PatternCriterion → 4 / 50 IDs (8.0%)

Regex and case options#

By default elements are treated as regular expressions (regex=True). Use regex=False for literal substring matching. Add case_sensitive=False for case-insensitive matching.

# Literal, case-insensitive: same result as the exact match above.
a_lower = A.lower()
ids_ci = pool.which(
    PatternCriterion(
        feature="status", pattern=a_lower, regex=False, case_sensitive=False
    )
)
[which]           PatternCriterion → 31 / 50 IDs (62.0%)

filter_entities(): witness rows#

With present=True (default), only the greedy first-match witness rows are kept. Each ID contributes at most len(pattern) rows.

pattern = [A, B]
filtered = pool.filter_entities(PatternCriterion(feature="status", pattern=pattern))
[filter_entities] PatternCriterion → 20 / 343 entities (5.8%) · 40 IDs affected
# inspect length of filtered sequences
filtered.describe(by_id=False)
length n_unique_entities temporal_span mean_duration median_duration duration_std
count 10.0 10.0 10 10 10 10
mean 2.0 2.0 1222 days, 16:41:01.780923 17 days, 9:12:34.286665 17 days, 9:12:34.286665 10 days, 21:20:52.699737
std 0.0 0.0 888 days, 11:31:10.541073 4 days, 9:57:07.565975 4 days, 9:57:07.565975 4 days, 13:03:56.596301
min 2.0 2.0 190 days, 10:47:26.668664 10 days, 5:27:50.340245 10 days, 5:27:50.340245 1 day, 0:52:28.011947
25% 2.0 2.0 776 days 23:51:01.762421 14 days 02:30:11.993972 14 days 02:30:11.993972 9 days 02:12:51.814717
50% 2.0 2.0 931 days 02:36:23.198788 17 days 22:58:39.097644 17 days 22:58:39.097644 12 days 08:39:24.691320
75% 2.0 2.0 1366 days 17:30:50.403257 20 days 08:55:46.355679 20 days 08:55:46.355679 13 days 15:42:30.296421
max 2.0 2.0 3302 days, 22:47:11.638428 24 days, 8:55:27.619821 24 days, 8:55:27.619821 16 days, 10:23:19.726099


match(): single-sequence evaluation#

criterion = PatternCriterion(feature="status", pattern=[A, B])
# Find all matching sequences by iterating.
matching_seqs = [s for s in pool if s.match(criterion)]
print(f"{len(matching_seqs)} sequence(s) contain A→B")
10 sequence(s) contain A→B

Total running time of the script: (0 minutes 0.113 seconds)

Gallery generated by Sphinx-Gallery