Note
Go to the end to download the full example code.
Clustering: Hierarchical Clustering#
This example demonstrates hierarchical clustering on a pool of sequences.
import polars as pl
from tanat import build_states
from tanat.clustering import HierarchicalClusterer
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
from tanat.metric.sequence import EditSequenceMetric
Generate synthetic data#
N_IDS = 50
SEED = 42
raw_df = simulate_states(
n_ids=N_IDS,
seq_length_range=(3, 8),
features=["score", "status"],
seed=SEED,
)
pool = build_states(
temporal_data=raw_df,
id_column="id",
start_column="start",
end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 288 entities · 0.01s)
# Cast features to categorical
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│ StateSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 50
Store /home/runner/.tanat/_quick_state_ee93c800
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-03-07 19:05:41.124579 → 2025-02-13 19:08:47.918854]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• score Numerical [1 → 100]
• status Categorical (5 categories)
Define the metric used by the clusterer#
hamming = HammingEntityMetric(entity_feature="status")
metric = EditSequenceMetric(entity_metric=hamming, normalize=True)
Perform hierarchical clustering#
clusterer = HierarchicalClusterer(
metric=metric,
n_clusters=4,
)
clusterer.fit(pool)
┌─ HierarchicalClusterer
│
│ Step 1/2: Computing distance matrix
│
│ ┌─ EditSequenceMetric
│ │
│ │ Chunks: 0%| | 0/1 [00:00<?, ?it/s]
│ │ Chunks: 100%|██████████| 1/1 [00:01<00:00, 1.43s/it]
│ │ Chunks: 100%|██████████| 1/1 [00:01<00:00, 1.43s/it]
│ │
│ └─ Done (50 sequences · 2.00s)
│
│ Step 2/2: Clustering (HierarchicalClusterer)
│
└─ Done (50 items, 4 clusters · 2.03s)
HierarchicalClusterer(clusters=4)
# Clustering results
print(clusterer)
┌────────────────────────────────────────────────┐
│ HierarchicalClusterer │
└────────────────────────────────────────────────┘
Settings
─────────────────────────
n_clusters 4
distance_threshold None
metric EditSequenceMetric
linkage complete
cluster_column __HCLUSTERS__
Results
─────────────────────────
Clusters 4
Avg size 12.5
Min size 3
Max size 26
Clusters
─────────────────────────
#0 26 items
#1 12 items
#2 3 items
#3 9 items
Inspect cluster assignments#
print("\nCluster assignments injected as static features:")
print(pool.static_data().head())
Cluster assignments injected as static features:
id __HCLUSTERS__
0 1 0
1 2 3
2 3 0
3 4 3
4 5 0
Total running time of the script: (0 minutes 4.144 seconds)