tanat.clustering package#

Subpackages#

Submodules#

tanat.clustering.base module#

Clusterer ABC: base class for all clustering algorithms.

class tanat.clustering.base.Clusterer(settings=None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base class for all clustering algorithms.

__init__(settings=None) None[source]#

Initialise with the given settings.

property clusters: list[Cluster] | None[source]#

Cluster results. None before fit().

fit(pool: SequencePool | TrajectoryPool) Self[source]#

Fit the clustering model to pool.

Returns:

self for method chaining.

Module contents#

Clustering module.

class tanat.clustering.CLARAClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

CLARA (Clustering Large Applications), sampling-based PAM.

Runs PAM on nb_pam_instances random sub-samples of the pool, evaluates each result on the full pool, and keeps the best medoids.

Example:

clara = CLARAClusterer(
    metric="linearpairwise",
    n_clusters=5,
    sampling_ratio=0.1,
    nb_pam_instances=5,
    random_state=42,
)
clara.fit(pool)
clara.medoids   # best medoids across all PAM instances
SETTINGS_CLASS[source]#

alias of CLARASettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.CLARASettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: object

Settings for CLARAClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • sampling_ratio – Fraction of the pool used per PAM instance. Must be in (0, 1]. Default: 0.1.

  • nb_pam_instances – Number of PAM runs on independent random samples. Default: 5.

  • n_clusters – Number of medoids per PAM run. Default: 2.

  • max_iter – Maximum SWAP iterations per PAM run. Default: 50.

  • random_state – Seed for the random number generator. None → non-reproducible.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__CLARA_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#
nb_pam_instances: int = 5[source]#
random_state: int | None = None[source]#
sampling_ratio: float = 0.1[source]#
class tanat.clustering.Cluster(cluster_id: int, items: list)[source]#

Bases: object

A cluster of items produced by a Clusterer.

Immutable value object: populated once after fit() and never mutated.

Parameters:
  • cluster_id – Integer identifier for the cluster.

  • items – List of item IDs belonging to this cluster.

__init__(cluster_id: int, items: list) None[source]#
property id: int[source]#

Cluster identifier.

property items: list[source]#

Item IDs belonging to this cluster.

property size: int[source]#

Number of items in the cluster.

class tanat.clustering.Clusterer(settings=None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base class for all clustering algorithms.

__init__(settings=None) None[source]#

Initialise with the given settings.

property clusters: list[Cluster] | None[source]#

Cluster results. None before fit().

fit(pool: SequencePool | TrajectoryPool) Self[source]#

Fit the clustering model to pool.

Returns:

self for method chaining.

class tanat.clustering.HierarchicalClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: Clusterer

Agglomerative hierarchical clustering (sklearn AgglomerativeClustering).

Consumes a precomputed DistanceMatrix with metric="precomputed".

Example:

clusterer = HierarchicalClusterer(metric="linearpairwise", n_clusters=5)
clusterer.fit(pool)
clusterer.clusters    # list[Cluster]
SETTINGS_CLASS[source]#

alias of HierarchicalSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.HierarchicalSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: object

Settings for HierarchicalClusterer.

Parameters:
  • metric – Metric name or instance.

  • n_clusters – Target number of clusters (ignored when distance_threshold is set).

  • distance_threshold – Cut-off distance for dendrogram trimming.

  • linkage"complete", "average", "single", or "ward".

  • cluster_column – Static-feature column injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__HCLUSTERS__'[source]#
distance_threshold: float | None = None[source]#
linkage: str = 'complete'[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#
class tanat.clustering.MedoidMixin[source]#

Bases: object

Mixin for clusterers that expose medoids (representative objects).

__init__() None[source]#
property medoids: list | None[source]#

Medoid item IDs. None before fit().

class tanat.clustering.PAMClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

Partition Around Medoids (PAM) clustering.

Two-phase algorithm:

  1. BUILD: greedy initial medoid selection.

  2. SWAP: iterative improvement by swapping medoid/non-medoid pairs.

Inner loops use Numba-compiled kernels for performance.

Example:

pam = PAMClusterer(metric="linearpairwise", n_clusters=3, max_iter=100)
pam.fit(pool)
pam.medoids   # list of representative item IDs
pam.clusters  # list[Cluster]
SETTINGS_CLASS[source]#

alias of PAMSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.PAMSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: object

Settings for PAMClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • n_clusters – Number of medoids to find. Must be > 0.

  • max_iter – Maximum number of SWAP iterations. Must be > 0.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__PAM_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#