tanat.clustering.type package#

Submodules#

tanat.clustering.type.clara module#

CLARAClusterer: sampling-based PAM for large datasets.

class tanat.clustering.type.clara.CLARAClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

CLARA (Clustering Large Applications), sampling-based PAM.

Runs PAM on nb_pam_instances random sub-samples of the pool, evaluates each result on the full pool, and keeps the best medoids.

Example:

clara = CLARAClusterer(
    metric="linearpairwise",
    n_clusters=5,
    sampling_ratio=0.1,
    nb_pam_instances=5,
    random_state=42,
)
clara.fit(pool)
clara.medoids   # best medoids across all PAM instances
SETTINGS_CLASS[source]#

alias of CLARASettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.clara.CLARASettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: object

Settings for CLARAClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • sampling_ratio – Fraction of the pool used per PAM instance. Must be in (0, 1]. Default: 0.1.

  • nb_pam_instances – Number of PAM runs on independent random samples. Default: 5.

  • n_clusters – Number of medoids per PAM run. Default: 2.

  • max_iter – Maximum SWAP iterations per PAM run. Default: 50.

  • random_state – Seed for the random number generator. None → non-reproducible.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__CLARA_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#
nb_pam_instances: int = 5[source]#
random_state: int | None = None[source]#
sampling_ratio: float = 0.1[source]#

tanat.clustering.type.hierarchical module#

HierarchicalClusterer: agglomerative hierarchical clustering via sklearn.

class tanat.clustering.type.hierarchical.HierarchicalClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: Clusterer

Agglomerative hierarchical clustering (sklearn AgglomerativeClustering).

Consumes a precomputed DistanceMatrix with metric="precomputed".

Example:

clusterer = HierarchicalClusterer(metric="linearpairwise", n_clusters=5)
clusterer.fit(pool)
clusterer.clusters    # list[Cluster]
SETTINGS_CLASS[source]#

alias of HierarchicalSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.hierarchical.HierarchicalSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: object

Settings for HierarchicalClusterer.

Parameters:
  • metric – Metric name or instance.

  • n_clusters – Target number of clusters (ignored when distance_threshold is set).

  • distance_threshold – Cut-off distance for dendrogram trimming.

  • linkage"complete", "average", "single", or "ward".

  • cluster_column – Static-feature column injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__HCLUSTERS__'[source]#
distance_threshold: float | None = None[source]#
linkage: str = 'complete'[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#

tanat.clustering.type.pam module#

PAMClusterer: Partition Around Medoids clustering with Numba-optimized kernels.

class tanat.clustering.type.pam.MedoidMixin[source]#

Bases: object

Mixin for clusterers that expose medoids (representative objects).

__init__() None[source]#
property medoids: list | None[source]#

Medoid item IDs. None before fit().

class tanat.clustering.type.pam.PAMClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

Partition Around Medoids (PAM) clustering.

Two-phase algorithm:

  1. BUILD: greedy initial medoid selection.

  2. SWAP: iterative improvement by swapping medoid/non-medoid pairs.

Inner loops use Numba-compiled kernels for performance.

Example:

pam = PAMClusterer(metric="linearpairwise", n_clusters=3, max_iter=100)
pam.fit(pool)
pam.medoids   # list of representative item IDs
pam.clusters  # list[Cluster]
SETTINGS_CLASS[source]#

alias of PAMSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.pam.PAMSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: object

Settings for PAMClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • n_clusters – Number of medoids to find. Must be > 0.

  • max_iter – Maximum number of SWAP iterations. Must be > 0.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__PAM_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#

Module contents#

Clustering subtypes.

class tanat.clustering.type.CLARAClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

CLARA (Clustering Large Applications), sampling-based PAM.

Runs PAM on nb_pam_instances random sub-samples of the pool, evaluates each result on the full pool, and keeps the best medoids.

Example:

clara = CLARAClusterer(
    metric="linearpairwise",
    n_clusters=5,
    sampling_ratio=0.1,
    nb_pam_instances=5,
    random_state=42,
)
clara.fit(pool)
clara.medoids   # best medoids across all PAM instances
SETTINGS_CLASS[source]#

alias of CLARASettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.CLARASettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', sampling_ratio: float = 0.1, nb_pam_instances: int = 5, n_clusters: int = 2, max_iter: int = 50, random_state: int | None = None, cluster_column: str = '__CLARA_CLUSTERS__')[source]#

Bases: object

Settings for CLARAClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • sampling_ratio – Fraction of the pool used per PAM instance. Must be in (0, 1]. Default: 0.1.

  • nb_pam_instances – Number of PAM runs on independent random samples. Default: 5.

  • n_clusters – Number of medoids per PAM run. Default: 2.

  • max_iter – Maximum SWAP iterations per PAM run. Default: 50.

  • random_state – Seed for the random number generator. None → non-reproducible.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__CLARA_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#
nb_pam_instances: int = 5[source]#
random_state: int | None = None[source]#
sampling_ratio: float = 0.1[source]#
class tanat.clustering.type.HierarchicalClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: Clusterer

Agglomerative hierarchical clustering (sklearn AgglomerativeClustering).

Consumes a precomputed DistanceMatrix with metric="precomputed".

Example:

clusterer = HierarchicalClusterer(metric="linearpairwise", n_clusters=5)
clusterer.fit(pool)
clusterer.clusters    # list[Cluster]
SETTINGS_CLASS[source]#

alias of HierarchicalSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.HierarchicalSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, distance_threshold: float | None = None, linkage: str = 'complete', cluster_column: str = '__HCLUSTERS__')[source]#

Bases: object

Settings for HierarchicalClusterer.

Parameters:
  • metric – Metric name or instance.

  • n_clusters – Target number of clusters (ignored when distance_threshold is set).

  • distance_threshold – Cut-off distance for dendrogram trimming.

  • linkage"complete", "average", "single", or "ward".

  • cluster_column – Static-feature column injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__HCLUSTERS__'[source]#
distance_threshold: float | None = None[source]#
linkage: str = 'complete'[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#
class tanat.clustering.type.MedoidMixin[source]#

Bases: object

Mixin for clusterers that expose medoids (representative objects).

__init__() None[source]#
property medoids: list | None[source]#

Medoid item IDs. None before fit().

class tanat.clustering.type.PAMClusterer(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: MedoidMixin, Clusterer

Partition Around Medoids (PAM) clustering.

Two-phase algorithm:

  1. BUILD: greedy initial medoid selection.

  2. SWAP: iterative improvement by swapping medoid/non-medoid pairs.

Inner loops use Numba-compiled kernels for performance.

Example:

pam = PAMClusterer(metric="linearpairwise", n_clusters=3, max_iter=100)
pam.fit(pool)
pam.medoids   # list of representative item IDs
pam.clusters  # list[Cluster]
SETTINGS_CLASS[source]#

alias of PAMSettings

__init__(metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__') None[source]#

Initialise with the given settings.

class tanat.clustering.type.PAMSettings(*, metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise', n_clusters: int = 2, max_iter: int = 50, cluster_column: str = '__PAM_CLUSTERS__')[source]#

Bases: object

Settings for PAMClusterer.

Parameters:
  • metric – Metric name or instance. Default: "linearpairwise".

  • n_clusters – Number of medoids to find. Must be > 0.

  • max_iter – Maximum number of SWAP iterations. Must be > 0.

  • cluster_column – Static-feature column name injected after fit().

__init__(*args: Any, **kwargs: Any) None[source]#
cluster_column: str = '__PAM_CLUSTERS__'[source]#
max_iter: int = 50[source]#
metric: str | SequenceMetric | TrajectoryMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

n_clusters: int = 2[source]#