Clustering: CLARA (Clustering LARge Applications)#

This example demonstrates CLARA clustering, a scalable variant of PAM that works on large datasets by sampling subsets of the data for medoid selection.

Setup#

import polars as pl

from tanat import build_states
from tanat.clustering import CLARAClusterer
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
from tanat.metric.sequence import EditSequenceMetric

Generate synthetic data#

N_IDS = 100
SEED = 42

raw_df = simulate_states(
    n_ids=N_IDS,
    seq_length_range=(3, 8),
    features=["score", "status"],
    seed=SEED,
)

pool = build_states(
    temporal_data=raw_df,
    id_column="id",
    start_column="start",
    end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (100 sequences · 564 entities · 0.01s)
# Cast features to categorical
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│           StateSequencePool Summary            │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          100
  Store              /home/runner/.tanat/_quick_state_1959bbdb
  id_column          id

Time Index
─────────────────────────
  Type               Datetime(time_unit='us', time_zone=None) [2000-01-16 03:58:26.326814 → 2025-01-01 00:00:00]
  Columns            ['start', 'end']
  t0                 position=0, anchor=start

Entity Features (2)
─────────────────────────
  • score               Numerical [1 → 100]
  • status              Categorical (5 categories)

Define the metric used by the clusterer#

hamming = HammingEntityMetric(entity_feature="status")
metric = EditSequenceMetric(entity_metric=hamming, normalize=True)

Perform CLARA clustering#

n_clusters = 5
n_samples = 40  # subset size per PAM instance
n_iterations = 3  # number of PAM instances

clusterer = CLARAClusterer(
    metric=metric,
    n_clusters=n_clusters,
    sampling_ratio=n_samples / N_IDS,
    nb_pam_instances=n_iterations,
    random_state=SEED,
)
clusterer.fit(pool)
┌─ CLARAClusterer
│
│ Step 1/3: PAM instance 1/3 (sample: 40)
│
│   ┌─ PAMClusterer
│   │
│   │ Step 1/2: Computing distance matrix
│   │
│   │   ┌─ EditSequenceMetric
│   │   │

│   │   │ Chunks:   0%|          | 0/1 [00:00<?, ?it/s]
│   │   │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 4064.25it/s]
│   │   │
│   │   └─ Done (40 sequences · 0.00s)
│   │
│   │ Step 2/2: Clustering (PAMClusterer)
│   │
│   └─ Done (40 items, 5 clusters · 0.02s)
│
│ Step 2/3: PAM instance 2/3 (sample: 40)
│
│   ┌─ PAMClusterer
│   │
│   │ Step 1/2: Computing distance matrix
│   │
│   │   ┌─ EditSequenceMetric
│   │   │

│   │   │ Chunks:   0%|          | 0/1 [00:00<?, ?it/s]
│   │   │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 4288.65it/s]
│   │   │
│   │   └─ Done (40 sequences · 0.00s)
│   │
│   │ Step 2/2: Clustering (PAMClusterer)
│   │
│   └─ Done (40 items, 5 clusters · 0.02s)
│
│ Step 3/3: PAM instance 3/3 (sample: 40)
│
│   ┌─ PAMClusterer
│   │
│   │ Step 1/2: Computing distance matrix
│   │
│   │   ┌─ EditSequenceMetric
│   │   │

│   │   │ Chunks:   0%|          | 0/1 [00:00<?, ?it/s]
│   │   │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 196.79it/s]
│   │   │
│   │   └─ Done (40 sequences · 0.01s)
│   │
│   │ Step 2/2: Clustering (PAMClusterer)
│   │
│   └─ Done (40 items, 5 clusters · 0.03s)
│
└─ Done (100 items, 5 clusters · 0.16s)

CLARAClusterer(clusters=5)
# Clustering results
print(clusterer)
┌────────────────────────────────────────────────┐
│                 CLARAClusterer                 │
└────────────────────────────────────────────────┘

Settings
─────────────────────────
  sampling_ratio     0.4
  nb_pam_instances   3
  n_clusters         5
  max_iter           50
  metric             EditSequenceMetric
  random_state       42
  cluster_column     __CLARA_CLUSTERS__

Results
─────────────────────────
  Clusters           5
  Avg size           20.0
  Min size           16
  Max size           28

Clusters
─────────────────────────
  #0                 23 items
  #1                 28 items
  #2                 17 items
  #3                 16 items
  #4                 16 items

Inspect cluster assignments and medoids#

print("\nMedoids (representative sequences):")
for i, medoid_id in enumerate(clusterer.medoids):
    print(f"  Cluster {i}: {medoid_id}")

print("\nCluster assignments injected as static features:")
print(pool.static_data().head())
Medoids (representative sequences):
  Cluster 0: 22
  Cluster 1: 54
  Cluster 2: 17
  Cluster 3: 43
  Cluster 4: 30

Cluster assignments injected as static features:
   id  __CLARA_CLUSTERS__
0   1                   1
1   2                   2
2   3                   0
3   4                   2
4   5                   3

Total running time of the script: (0 minutes 0.185 seconds)

Gallery generated by Sphinx-Gallery