Note
Go to the end to download the full example code.
Clustering: CLARA (Clustering LARge Applications)#
This example demonstrates CLARA clustering, a scalable variant of PAM that works on large datasets by sampling subsets of the data for medoid selection.
Setup#
import polars as pl
from tanat import build_states
from tanat.clustering import CLARAClusterer
from tanat.dataset import simulate_states
from tanat.metric.entity import HammingEntityMetric
from tanat.metric.sequence import EditSequenceMetric
Generate synthetic data#
N_IDS = 100
SEED = 42
raw_df = simulate_states(
n_ids=N_IDS,
seq_length_range=(3, 8),
features=["score", "status"],
seed=SEED,
)
pool = build_states(
temporal_data=raw_df,
id_column="id",
start_column="start",
end_column="end",
)
┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity & time index features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (100 sequences · 564 entities · 0.01s)
# Cast features to categorical
pool.cast_features({"status": pl.Categorical})
print(pool)
┌────────────────────────────────────────────────┐
│ StateSequencePool Summary │
└────────────────────────────────────────────────┘
Overview
─────────────────────────
Sequences 100
Store /home/runner/.tanat/_quick_state_1959bbdb
id_column id
Time Index
─────────────────────────
Type Datetime(time_unit='us', time_zone=None) [2000-01-16 03:58:26.326814 → 2025-01-01 00:00:00]
Columns ['start', 'end']
t0 position=0, anchor=start
Entity Features (2)
─────────────────────────
• score Numerical [1 → 100]
• status Categorical (5 categories)
Define the metric used by the clusterer#
hamming = HammingEntityMetric(entity_feature="status")
metric = EditSequenceMetric(entity_metric=hamming, normalize=True)
Perform CLARA clustering#
n_clusters = 5
n_samples = 40 # subset size per PAM instance
n_iterations = 3 # number of PAM instances
clusterer = CLARAClusterer(
metric=metric,
n_clusters=n_clusters,
sampling_ratio=n_samples / N_IDS,
nb_pam_instances=n_iterations,
random_state=SEED,
)
clusterer.fit(pool)
┌─ CLARAClusterer
│
│ Step 1/3: PAM instance 1/3 (sample: 40)
│
│ ┌─ PAMClusterer
│ │
│ │ Step 1/2: Computing distance matrix
│ │
│ │ ┌─ EditSequenceMetric
│ │ │
│ │ │ Chunks: 0%| | 0/1 [00:00<?, ?it/s]
│ │ │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 4064.25it/s]
│ │ │
│ │ └─ Done (40 sequences · 0.00s)
│ │
│ │ Step 2/2: Clustering (PAMClusterer)
│ │
│ └─ Done (40 items, 5 clusters · 0.02s)
│
│ Step 2/3: PAM instance 2/3 (sample: 40)
│
│ ┌─ PAMClusterer
│ │
│ │ Step 1/2: Computing distance matrix
│ │
│ │ ┌─ EditSequenceMetric
│ │ │
│ │ │ Chunks: 0%| | 0/1 [00:00<?, ?it/s]
│ │ │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 4288.65it/s]
│ │ │
│ │ └─ Done (40 sequences · 0.00s)
│ │
│ │ Step 2/2: Clustering (PAMClusterer)
│ │
│ └─ Done (40 items, 5 clusters · 0.02s)
│
│ Step 3/3: PAM instance 3/3 (sample: 40)
│
│ ┌─ PAMClusterer
│ │
│ │ Step 1/2: Computing distance matrix
│ │
│ │ ┌─ EditSequenceMetric
│ │ │
│ │ │ Chunks: 0%| | 0/1 [00:00<?, ?it/s]
│ │ │ Chunks: 100%|██████████| 1/1 [00:00<00:00, 196.79it/s]
│ │ │
│ │ └─ Done (40 sequences · 0.01s)
│ │
│ │ Step 2/2: Clustering (PAMClusterer)
│ │
│ └─ Done (40 items, 5 clusters · 0.03s)
│
└─ Done (100 items, 5 clusters · 0.16s)
CLARAClusterer(clusters=5)
# Clustering results
print(clusterer)
┌────────────────────────────────────────────────┐
│ CLARAClusterer │
└────────────────────────────────────────────────┘
Settings
─────────────────────────
sampling_ratio 0.4
nb_pam_instances 3
n_clusters 5
max_iter 50
metric EditSequenceMetric
random_state 42
cluster_column __CLARA_CLUSTERS__
Results
─────────────────────────
Clusters 5
Avg size 20.0
Min size 16
Max size 28
Clusters
─────────────────────────
#0 23 items
#1 28 items
#2 17 items
#3 16 items
#4 16 items
Inspect cluster assignments and medoids#
print("\nMedoids (representative sequences):")
for i, medoid_id in enumerate(clusterer.medoids):
print(f" Cluster {i}: {medoid_id}")
print("\nCluster assignments injected as static features:")
print(pool.static_data().head())
Medoids (representative sequences):
Cluster 0: 22
Cluster 1: 54
Cluster 2: 17
Cluster 3: 43
Cluster 4: 30
Cluster assignments injected as static features:
id __CLARA_CLUSTERS__
0 1 1
1 2 2
2 3 0
3 4 2
4 5 3
Total running time of the script: (0 minutes 0.185 seconds)