Entity Metric: Hamming Distance#

This example demonstrates the Hamming entity metric, which measures the distance between two individual entities (single point-in-time observations) based on categorical feature equality.

Note

Most of sequence-level metrics require an entity metric as a building block. Hamming is the most common choice for categorical features.

Setup#

import polars as pl
from tanat import build_states
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric

Generate synthetic data#

N_IDS = 50
SEED = 42

raw_df = simulate_states(
    n_ids=N_IDS,
    seq_length_range=(3, 8),
    features=["score", "status"],
    seed=SEED,
)

pool = build_states(
    temporal_data=raw_df,
    id_column="id",
    start_column="start",
    end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 288 entities · 0.00s)
# HammingEntityMetric requires Categorical features
pool.cast_features({"status": pl.Categorical})

print(pool)
┌────────────────────────────────────────────────┐
│           StateSequencePool Summary            │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          50
  Store              /home/runner/.tanat/_quick_state_bd0c49ef
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-03-07 19:05:41.124579 → 2025-02-13 19:08:47.918854]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • score               Numerical [1 → 100]
  • status              Categorical (5 categories)

Create Hamming entity metric#

hamming = HammingEntityMetric(entity_feature="status")
print(hamming)
HammingEntityMetric(settings=HammingSettings(entity_feature='status', cost=None, mismatch_cost=1.0))

Compute distance between individual entities#

ids = pool.unique_ids
seq_a = pool[ids[0]]
seq_b = pool[ids[1]]

# Extract first entity from each sequence
ent_a, ent_b = seq_a[0], seq_b[0]
# Entity A
print(ent_a)
┌────────────────────────────────────────────────┐
│              StateEntity Summary               │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequence ID        1
  Rank               0

Entity Features
─────────────────────────
  score              76
  status             B
# Entity B
print(ent_b)
┌────────────────────────────────────────────────┐
│              StateEntity Summary               │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequence ID        2
  Rank               0

Entity Features
─────────────────────────
  score              47
  status             B
# Compute Hamming distance
dist = hamming(ent_a, ent_b)
print(f"\nHamming distance: {dist}")
print("  Same categories → 0.0")
print("  Different categories → 1.0 (default mismatch_cost)")
Hamming distance: 0.0
  Same categories → 0.0
  Different categories → 1.0 (default mismatch_cost)

Try multiple pairs#

print("\nDistances between random entity pairs:")
print("-" * 50)

for i in range(5):
    seq_1 = pool[ids[i]]
    seq_2 = pool[ids[i + 1]]

    # Compare first entities from each sequence
    e1, e2 = seq_1[0], seq_2[0]
    d = hamming(e1, e2)

    print(f"Pair {i+1}: {e1['status']!r:10} vs {e2['status']!r:10}{d:.1f}")
Distances between random entity pairs:
--------------------------------------------------
Pair 1: 'B'        vs 'B'        → 0.0
Pair 2: 'B'        vs 'A'        → 1.0
Pair 3: 'A'        vs 'E'        → 1.0
Pair 4: 'E'        vs 'D'        → 1.0
Pair 5: 'D'        vs 'A'        → 1.0

Total running time of the script: (0 minutes 0.048 seconds)

Gallery generated by Sphinx-Gallery