Clustering: Hierarchical Clustering#

This example demonstrates hierarchical clustering on a pool of sequences.

import polars as pl

from tanat import build_states
from tanat.clustering import HierarchicalClusterer
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
from tanat.metric.sequence import EditSequenceMetric

Generate synthetic data#

N_IDS = 50
SEED = 42

raw_df = simulate_states(
    n_ids=N_IDS,
    seq_length_range=(3, 8),
    features=["score", "status"],
    seed=SEED,
)

pool = build_states(
    temporal_data=raw_df,
    id_column="id",
    start_column="start",
    end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (50 sequences · 288 entities · 0.01s)
# Cast features to categorical
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│           StateSequencePool Summary            │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          50
  Store              /home/runner/.tanat/_quick_state_ee93c800
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-03-07 19:05:41.124579 → 2025-02-13 19:08:47.918854]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • score               Numerical [1 → 100]
  • status              Categorical (5 categories)

Define the metric used by the clusterer#

hamming = HammingEntityMetric(entity_feature="status")
metric = EditSequenceMetric(entity_metric=hamming, normalize=True)

Perform hierarchical clustering#

clusterer = HierarchicalClusterer(
    metric=metric,
    n_clusters=4,
)

clusterer.fit(pool)
┌─ HierarchicalClusterer
│
│ Step 1/2: Computing distance matrix
│
│   ┌─ EditSequenceMetric
│   │

│   │ Chunks:   0%|          | 0/1 [00:00<?, ?it/s]
│   │ Chunks: 100%|██████████| 1/1 [00:01<00:00,  1.43s/it]
│   │ Chunks: 100%|██████████| 1/1 [00:01<00:00,  1.43s/it]
│   │
│   └─ Done (50 sequences · 2.00s)
│
│ Step 2/2: Clustering (HierarchicalClusterer)
│
└─ Done (50 items, 4 clusters · 2.03s)

HierarchicalClusterer(clusters=4)
# Clustering results
print(clusterer)
┌────────────────────────────────────────────────┐
│             HierarchicalClusterer              │
└────────────────────────────────────────────────┘

Settings
─────────────────────────
  n_clusters         4
  distance_threshold None
  metric             EditSequenceMetric
  linkage            complete
  cluster_column     __HCLUSTERS__

Results
─────────────────────────
  Clusters           4
  Avg size           12.5
  Min size           3
  Max size           26

Clusters
─────────────────────────
  #0                 26 items
  #1                 12 items
  #2                 3 items
  #3                 9 items

Inspect cluster assignments#

print("\nCluster assignments injected as static features:")
print(pool.static_data().head())
Cluster assignments injected as static features:
   id  __HCLUSTERS__
0   1              0
1   2              3
2   3              0
3   4              3
4   5              0

Total running time of the script: (0 minutes 4.144 seconds)

Gallery generated by Sphinx-Gallery