Note
Go to the end to download the full example code.
Sequence Metrics: LinearPairwise#
This example demonstrates LinearPairwiseSequenceMetric,
which computes position-wise distances between sequences using an entity metric,
then aggregates them (mean by default).
Setup#
import matplotlib.pyplot as plt
import polars as pl
from tanat import build_states
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
from tanat.metric.sequence import LinearPairwiseSequenceMetric
Generate synthetic data#
SEED = 42
N_IDS = 80
raw_df = simulate_states(
n_ids=N_IDS,
seq_length_range=(3, 8),
features=["score", "status"],
seed=SEED,
)
pool = build_states(raw_df, id_column="id", start_column="start", end_column="end")
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (80 sequences · 450 entities · 0.01s)
# Cast features to categorical
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│ StateSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 80
Store /home/runner/.tanat/_quick_state_8fd28d80
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-12 06:14:52.240595 → 2025-05-09 18:37:51.409412]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• score Numerical [1 → 100]
• status Categorical (5 categories)
Define metric#
hamming = HammingEntityMetric(entity_feature="status")
metric = LinearPairwiseSequenceMetric(
entity_metric=hamming, agg_fun="mean", padding_penalty=1.0
)
print(metric)
LinearPairwiseSequenceMetric(settings=LinearPairwiseSettings(entity_metric=HammingEntityMetric(settings=HammingSettings(entity_feature='status', cost=None, mismatch_cost=1.0)), agg_fun='mean', padding_penalty=1.0))
Compute distance between a single pair#
ids = pool.unique_ids
dist = metric(pool[ids[0]], pool[ids[1]])
print(f"Distance between {ids[0]} and {ids[1]}: {dist:.4f}")
Distance between 1 and 2: 0.8571
Compute full pairwise distance matrix#
dm = metric.compute_matrix(pool)
print(f"Distance matrix shape: {dm.shape}")
┌─ LinearPairwiseSequenceMetric
│
│ Chunks: 0%| | 0/1 [00:00<?, ?it/s]
│ Chunks: 100%|██████████| 1/1 [00:01<00:00, 1.15s/it]
│ Chunks: 100%|██████████| 1/1 [00:01<00:00, 1.15s/it]
│
└─ Done (80 sequences · 1.16s)
Distance matrix shape: (80, 80)
Visualize distances#
arr = dm.to_numpy()
fig, ax = plt.subplots(figsize=(6.5, 5.5))
im = ax.imshow(arr, cmap="viridis_r", vmin=0, vmax=1)
ax.set_title("LinearPairwise distance matrix", fontsize=12, fontweight="bold")
ax.set_xlabel("Sequence index")
ax.set_ylabel("Sequence index")
cbar = plt.colorbar(im, ax=ax)
cbar.set_label("Distance")
plt.tight_layout()
plt.show()

Total running time of the script: (0 minutes 1.358 seconds)