Note
Go to the end to download the full example code.
Entity Metric: Hamming Distance#
This example demonstrates the Hamming entity metric, which measures the distance between two individual entities (single point-in-time observations) based on categorical feature equality.
Note
Most of sequence-level metrics require an entity metric as a building block. Hamming is the most common choice for categorical features.
Setup#
import polars as pl
from tanat import build_states
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
Generate synthetic data#
N_IDS = 50
SEED = 42
raw_df = simulate_states(
n_ids=N_IDS,
seq_length_range=(3, 8),
features=["score", "status"],
seed=SEED,
)
pool = build_states(
temporal_data=raw_df,
id_column="id",
start_column="start",
end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 288 entities · 0.00s)
# HammingEntityMetric requires Categorical features
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│ StateSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 50
Store /home/runner/.tanat/_quick_state_bd0c49ef
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-03-07 19:05:41.124579 → 2025-02-13 19:08:47.918854]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• score Numerical [1 → 100]
• status Categorical (5 categories)
Create Hamming entity metric#
hamming = HammingEntityMetric(entity_feature="status")
print(hamming)
HammingEntityMetric(settings=HammingSettings(entity_feature='status', cost=None, mismatch_cost=1.0))
Compute distance between individual entities#
ids = pool.unique_ids
seq_a = pool[ids[0]]
seq_b = pool[ids[1]]
# Extract first entity from each sequence
ent_a, ent_b = seq_a[0], seq_b[0]
# Entity A
print(ent_a)
┌────────────────────────────────────────────────┐
│ StateEntity Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequence ID 1
Rank 0
Entity Features
─────────────────────────
score 76
status B
# Entity B
print(ent_b)
┌────────────────────────────────────────────────┐
│ StateEntity Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequence ID 2
Rank 0
Entity Features
─────────────────────────
score 47
status B
# Compute Hamming distance
dist = hamming(ent_a, ent_b)
print(f"\nHamming distance: {dist}")
print(" Same categories → 0.0")
print(" Different categories → 1.0 (default mismatch_cost)")
Hamming distance: 0.0
Same categories → 0.0
Different categories → 1.0 (default mismatch_cost)
Try multiple pairs#
print("\nDistances between random entity pairs:")
print("-" * 50)
for i in range(5):
seq_1 = pool[ids[i]]
seq_2 = pool[ids[i + 1]]
# Compare first entities from each sequence
e1, e2 = seq_1[0], seq_2[0]
d = hamming(e1, e2)
print(f"Pair {i+1}: {e1['status']!r:10} vs {e2['status']!r:10} → {d:.1f}")
Distances between random entity pairs:
--------------------------------------------------
Pair 1: 'B' vs 'B' → 0.0
Pair 2: 'B' vs 'A' → 1.0
Pair 3: 'A' vs 'E' → 1.0
Pair 4: 'E' vs 'D' → 1.0
Pair 5: 'D' vs 'A' → 1.0
Total running time of the script: (0 minutes 0.048 seconds)