Clustering learning sessions by action patterns#

Scenario: Building on Exploring learner activity sequences, you want to group sessions with similar action sequences together to identify recurring learning strategies. This is the second tutorial in the MOOC series.

Concepts covered:

Rebuild the session pool with build_states() (self-contained)
Compute edit distances with HammingEntityMetric + EditSequenceMetric (Optimal Matching)
Cluster with HierarchicalClusterer
Inspect cluster membership via clusters
Visualise per-cluster state distributions with a faceted plot

Imports#

import pandas as pd
import polars as pl

from tanat import build_states
from tanat.clustering import HierarchicalClusterer
from tanat.criterion import LengthCriterion
from tanat.dataset import access
from tanat.metric import EditSequenceMetric, HammingEntityMetric
from tanat.visualization import SequenceVisualizer

Rebuild the filtered session pool#

Self-contained rebuild (see Exploring learner activity sequences for details).

INACTIVITY = pd.Timedelta("2h")

df = access("mooc_events")
df["timecreated"] = pd.to_datetime(df["timecreated"])
df = df.sort_values(["user", "timecreated"])

df["session"] = (
    (df["user"] != df["user"].shift()) | (df["timecreated"].diff() > INACTIVITY)
).cumsum()

sessions = df[["user", "session"]].drop_duplicates()
df["position"] = df.groupby("session").cumcount()

pool = build_states(
    df[["session", "position", "Action"]],
    id_column="session",
    start_column="position",
    static_data=sessions,
    store_name="mooc_sessions_store",
)
# ``pl.Categorical`` enables consistent colour-coding across visualisations
# and is required by the metric module.
pool.cast_features({"Action": pl.Categorical}, is_static=False)

ids_keep = pool.which(LengthCriterion(ge=2, le=40))
pool_filtered = pool.subset(ids_keep)

┌─ State SequenceStore
│
│ Step 1/4: Sorting & preparing data
│
│ Step 2/4: Building sequence index
│
│ Step 3/4: Writing entity, time index & static features
│
│ Step 4/4: Computing & writing metadata
│
└─ Done (5,700 sequences · 95,626 entities · 0.02s)
[which]           LengthCriterion → 5,150 / 5,700 IDs (90.4%)

print(pool_filtered)

┌────────────────────────────────────────────────┐
│           StateSequencePool Summary            │
└────────────────────────────────────────────────┘

Overview
─────────────────────────
  Sequences          5,150
  Store              /home/runner/.tanat_workspace/building_pools_tutorial/mooc_sessions_store
  id_column          session

Time Index
─────────────────────────
  Type               Int64 (Timestep) [0 → 39]
  Columns            ['position', 'end']
  t0                 position=0, anchor=start

Entity Features (1)
─────────────────────────
  • Action              Categorical (12 categories)

Static Features (1)
─────────────────────────
  • user                String [len 9 → 9]

Step 1: Define the sequence metric#

We use Optimal Matching (edit distance), the standard metric for sequence analysis in the social sciences.

HammingEntityMetric compares two actions at the same position: distance 0 if they share the same type, 1 otherwise. EditSequenceMetric extends this to full sequences by counting insertions, deletions, and substitutions.

Tip

You can provide a custom substitution cost matrix to HammingEntityMetric to reflect domain knowledge about action similarity (e.g. “Course_view” is closer to “Group_work” than to “Feedback”). See the API reference for details.

entity_metric = HammingEntityMetric(entity_feature="Action")
sequence_metric = EditSequenceMetric(
    entity_metric=entity_metric,
    indel_cost=1.0,
)

Step 2: Cluster sessions#

We group sessions into 5 clusters using complete-linkage hierarchical clustering. After fit(), the cluster label is automatically added as a static feature under session_cluster.

clusterer = HierarchicalClusterer(
    metric=sequence_metric,
    n_clusters=5,
    linkage="complete",
    cluster_column="session_cluster",
)
clusterer.fit(pool_filtered)

┌─ HierarchicalClusterer
│
│ Step 1/2: Computing distance matrix
│
│   ┌─ EditSequenceMetric
│   │

│   │ Chunks:   0%|          | 0/1 [00:00<?, ?it/s]
│   │ Chunks: 100%|██████████| 1/1 [00:03<00:00,  3.28s/it]
│   │ Chunks: 100%|██████████| 1/1 [00:03<00:00,  3.28s/it]
│   │
│   └─ Done (5150 sequences · 3.34s)
│
│ Step 2/2: Clustering (HierarchicalClusterer)
│
└─ Done (5150 items, 5 clusters · 3.74s)

HierarchicalClusterer(clusters=5)

Step 3: Inspect cluster membership#

clusters exposes the fitted Cluster objects directly.

for cluster in clusterer.clusters:
    print(cluster)

Cluster(id=0, size=159)
Cluster(id=1, size=161)
Cluster(id=2, size=38)
Cluster(id=3, size=4523)
Cluster(id=4, size=269)

Cluster labels are also stored as a static feature for downstream use.

pool_filtered.static_data().head()

	session	session_cluster	user
0	1	3	00a05cc62
1	2	3	00a05cc62
2	3	3	00a05cc62
3	4	3	00a05cc62
4	5	3	00a05cc62

Step 4: Faceted distribution per cluster#

Each panel shows how action proportions evolve across positions for one cluster. Structural differences between clusters reveal distinct learning strategies.

# fmt: off
SequenceVisualizer.distribution(bin_size=1) \
    .title("Action distribution by learning cluster") \
    .x_axis(label="Position in session") \
    .facet(by="session_cluster", is_static=True, cols=3, share_y=True) \
    .draw(pool_filtered, entity_feature="Action") \
    .show()
# fmt: on

Action distribution by learning cluster, session_cluster = 0, session_cluster = 1, session_cluster = 2, session_cluster = 3, session_cluster = 4

/home/runner/work/TanaT/TanaT/src/tanat/visualization/sequence/base/builder.py:280: UserWarning: 5150 row(s) have a null time index (__END__) and will be excluded from the visualisation.
  return self._draw_faceted(

Total running time of the script: (0 minutes 4.957 seconds)

Gallery generated by Sphinx-Gallery