Note

Go to the end to download the full example code.

PatternCriterion#

Select sequences or extract witness rows based on an ordered pattern of string values in a feature column.

Sentinel	Meaning
`ANY` (`"..."`)	Zero or more elements: free gap between adjacent sub-patterns.
`WILDCARD` (`"*"`)	Exactly one element of any value at that position.

Level	Behaviour
`which()`	IDs whose temporal sequence contains (`present=True`) or does not contain (`present=False`) the ordered pattern.
`filter_entities()`	Keeps the “witness” rows of the greedy first match (`present=True`), or all non-witness rows (`present=False`).
`match()`	Returns `True` iff the pattern is found (resp. absent).

See Criteria for the full reference.

Imports#

from tanat import build_intervals
from tanat.criterion import ANY, WILDCARD, PatternCriterion
from tanat.dataset import simulate_intervals

Simulate data#

temporal = simulate_intervals(n_ids=50, features=["score", "status"], seed=42)

pool = build_intervals(
    temporal_data=temporal,
    id_column="id",
    start_column="start",
    end_column="end",
)

┌─ Interval SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 343 entities · 0.00s)

print(pool)

┌────────────────────────────────────────────────┐
│          IntervalSequencePool Summary          │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          50
  Store              /home/runner/.tanat/_quick_interval_c07fcc9f
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-01-20 05:35:23.188780]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • score               Numerical [1 → 100]
  • status              String [len 1 → 1]

# Pick 2 status values existing in the temporal data.
A = "A"
B = "B"

Single-element pattern#

A plain string (or single-element list) selects sequences that contain at least one entity with that value.

ids_has_A = pool.which(PatternCriterion(feature="status", pattern=A))

[which]           PatternCriterion → 31 / 50 IDs (62.0%)

# Exclusion: sequences that never show status A.
ids_no_A = pool.which(PatternCriterion(feature="status", pattern=A, present=False))

[which]           PatternCriterion → 19 / 50 IDs (38.0%)

Adjacent pattern: A directly followed by B#

[A, B] matches only if B appears immediately after A in the ordered sequence of entities.

ids_adj = pool.which(PatternCriterion(feature="status", pattern=[A, B]))

[which]           PatternCriterion → 10 / 50 IDs (20.0%)

Free gap: A anywhere before B#

Insert ANY between elements to allow an arbitrary number of rows in between.

ids_gap = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B]))

[which]           PatternCriterion → 16 / 50 IDs (32.0%)

Wildcard: exactly one element between A and B#

WILDCARD matches exactly one entity of any value.

ids_wildcard = pool.which(PatternCriterion(feature="status", pattern=[A, WILDCARD, B]))

[which]           PatternCriterion → 5 / 50 IDs (10.0%)

Combining sentinels#

You can mix ANY and WILDCARD freely. Here: A, then any gap, then exactly two consecutive B’s.

ids_double_B = pool.which(PatternCriterion(feature="status", pattern=[A, ANY, B, B]))

[which]           PatternCriterion → 4 / 50 IDs (8.0%)

Regex and case options#

By default elements are treated as regular expressions (regex=True). Use regex=False for literal substring matching. Add case_sensitive=False for case-insensitive matching.

# Literal, case-insensitive: same result as the exact match above.
a_lower = A.lower()
ids_ci = pool.which(
    PatternCriterion(
        feature="status", pattern=a_lower, regex=False, case_sensitive=False
    )
)

[which]           PatternCriterion → 31 / 50 IDs (62.0%)

`filter_entities()`: witness rows#

With present=True (default), only the greedy first-match witness rows are kept. Each ID contributes at most len(pattern) rows.

pattern = [A, B]
filtered = pool.filter_entities(PatternCriterion(feature="status", pattern=pattern))

[filter_entities] PatternCriterion → 20 / 343 entities (5.8%) · 40 IDs affected

# inspect length of filtered sequences
filtered.describe(by_id=False)

	length	n_unique_entities	temporal_span	mean_duration	median_duration	duration_std
count	10.0	10.0	10	10	10	10
mean	2.0	2.0	1222 days, 16:41:01.780923	17 days, 9:12:34.286665	17 days, 9:12:34.286665	10 days, 21:20:52.699737
std	0.0	0.0	888 days, 11:31:10.541073	4 days, 9:57:07.565975	4 days, 9:57:07.565975	4 days, 13:03:56.596301
min	2.0	2.0	190 days, 10:47:26.668664	10 days, 5:27:50.340245	10 days, 5:27:50.340245	1 day, 0:52:28.011947
25%	2.0	2.0	776 days 23:51:01.762421	14 days 02:30:11.993972	14 days 02:30:11.993972	9 days 02:12:51.814717
50%	2.0	2.0	931 days 02:36:23.198788	17 days 22:58:39.097644	17 days 22:58:39.097644	12 days 08:39:24.691320
75%	2.0	2.0	1366 days 17:30:50.403257	20 days 08:55:46.355679	20 days 08:55:46.355679	13 days 15:42:30.296421
max	2.0	2.0	3302 days, 22:47:11.638428	24 days, 8:55:27.619821	24 days, 8:55:27.619821	16 days, 10:23:19.726099

`match()`: single-sequence evaluation#

criterion = PatternCriterion(feature="status", pattern=[A, B])
# Find all matching sequences by iterating.
matching_seqs = [s for s in pool if s.match(criterion)]
print(f"{len(matching_seqs)} sequence(s) contain A→B")

10 sequence(s) contain A→B

Total running time of the script: (0 minutes 0.126 seconds)

Gallery generated by Sphinx-Gallery

PatternCriterion#

Imports#

Simulate data#

Single-element pattern#

Adjacent pattern: A directly followed by B#

Free gap: A anywhere before B#

Wildcard: exactly one element between A and B#

Combining sentinels#

Regex and case options#

filter_entities(): witness rows#

match(): single-sequence evaluation#

`filter_entities()`: witness rows#

`match()`: single-sequence evaluation#