tanat.metric package#

Subpackages#

Submodules#

tanat.metric.matrix module#

DistanceMatrix: thin numpy wrapper with associated IDs.

class tanat.metric.matrix.DistanceMatrix(data: ndarray, ids: list)[source]#

Bases: object

Pairwise distance matrix with associated IDs.

Wrapper around a square numpy.ndarray associating each row/column with a sequence (or trajectory) identifier.

Example:

dm = DistanceMatrix(np.zeros((3, 3), dtype="float32"), ids=[1, 2, 3])
dm.to_frame()           # pandas DataFrame (default)
dm.to_frame("polars")   # polars DataFrame
dm.to_numpy()           # raw array
__init__(data: ndarray, ids: list) None[source]#

Create a DistanceMatrix.

Parameters:
  • data – Square numpy array of shape (n, n).

  • ids – List of n identifiers matching the matrix rows/columns. Stored as-is (order preserved from pool.unique_ids).

property data: ndarray[source]#

Raw numpy array (shape (n, n)).

classmethod empty(ids: list, dtype: str = 'float32') DistanceMatrix[source]#

Create a zero-initialised square matrix.

Parameters:
  • ids – List of identifiers.

  • dtype – Numpy dtype string (default "float32").

Returns:

A DistanceMatrix of shape (n, n) filled with zeros.

classmethod from_path(path: str | Path) DistanceMatrix[source]#

Load a previously computed distance matrix from disk.

Uses resolve_path to resolve the storage directory (workspace name or filesystem path), then opens the memmap in read-only mode.

Parameters:

path – Storage directory. Same formats as StorageOptions.store_path: plain name ("distances"), relative path ("./distances"), or absolute path.

Returns:

A DistanceMatrix backed by a read-only memmap.

Raises:
  • FileNotFoundError – If the directory or required files don’t exist.

  • ValueError – If progress.json status is not "complete" (incomplete computation).

property ids: list[source]#

Ordered list of identifiers (same order as rows/columns).

property is_memmap: bool[source]#

True if the underlying data is a memory-mapped file.

property shape: tuple[int, int][source]#

Shape of the underlying array.

to_frame(fmt: Literal['pandas', 'polars'] = 'pandas') DataFrame | DataFrame[source]#

Return a labelled dataframe with IDs as index/columns.

Parameters:

fmt"pandas" (default) returns a pandas.DataFrame; "polars" returns a polars.DataFrame with an extra "id" column (Polars has no row index).

Returns:

Square dataframe of shape (n, n).

Raises:

ValueError – If fmt is not "pandas" or "polars".

to_numpy() ndarray[source]#

Return the underlying numpy array.

Returns:

Square array of shape (n, n).

Module contents#

Metric Module.

class tanat.metric.AggregationSettings(*, default_metric: SequenceMetric = 'linearpairwise', sequence_metrics: dict[str, SequenceMetric] | None = None, agg_fun: str = 'mean', weights: dict[str, float] | None = None)[source]#

Bases: object

Settings for AggregationTrajectoryMetric.

agg_fun and weights are orthogonal: "mean" + weights computes a weighted mean, "sum" + weights a weighted sum. Aliases absent from weights default to 1.0.

__init__(*args: Any, **kwargs: Any) None[source]#
agg_fun: str = 'mean'[source]#
default_metric: SequenceMetric = 'linearpairwise'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

sequence_metrics: dict[str, SequenceMetric] | None = None[source]#
weights: dict[str, float] | None = None[source]#
class tanat.metric.AggregationTrajectoryMetric(default_metric: SequenceMetric | str = 'linearpairwise', sequence_metrics: dict[str, SequenceMetric | str] | None = None, agg_fun: str = 'mean', weights: dict[str, float] | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: TrajectoryMetric

Trajectory distance by per-alias sequence distances, then weighted aggregation.

For each store alias visible on both trajectories, computes the sequence-level distance using the configured SequenceMetric. The resulting per-alias distances are aggregated (weighted mean/sum) into a scalar trajectory distance.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

agg = AggregationTrajectoryMetric(
    default_metric=lp,
    agg_fun="mean",
    weights={"events": 1.0, "states": 0.5},
)

dist = agg(traj_a, traj_b)
dm   = agg.compute_matrix(traj_pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation.

SETTINGS_CLASS[source]#

alias of AggregationSettings

__init__(default_metric: SequenceMetric | str = 'linearpairwise', sequence_metrics: dict[str, SequenceMetric | str] | None = None, agg_fun: str = 'mean', weights: dict[str, float] | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
class tanat.metric.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

  • Event sequences: each event contributes a weight of 1.

  • Interval / State sequences: each entity contributes end start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

  • Both empty0.0 (identical empty distributions).

  • One empty1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
prepare_batch_data(pool: SequencePool) tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:

(hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:
  • seq_a – Primary sequence.

  • seq_b – Optional second sequence.

Raises:
  • KeyError – If the feature is absent from a sequence.

  • TypeError – If the feature is not categorical.

class tanat.metric.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:

entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_feature: str | None = None[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.DTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Dynamic Time Warping distance between two sequences.

Uses a space-optimised 2-row DP. The Sakoe-Chiba band is applied when window is set, limiting the warping path to stay within window cells of the diagonal.

Empty-sequence behaviour:

  • Both emptynan (no alignment possible).

  • One emptynan (no alignment possible).

When normalize=True, divides the raw DTW cost by len_a + len_b (an approximation that does not require path backtracking).

Example:

dtw = DTWSequenceMetric(window=3, normalize=True)
d   = dtw(seq_a, seq_b)
dm  = dtw.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of DTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.DTWSettings(*, entity_metric: EntityMetric = 'hamming', window: int | None = None, normalize: bool = False)[source]#

Bases: object

Settings for DTWSequenceMetric.

Parameters:
  • entity_metric – Entity-level distance metric. Default: "hamming".

  • window – Sakoe-Chiba band width (number of cells off the diagonal). None means no constraint (full DTW). Must be > 0 when set.

  • normalize – When True, divide the DTW cost by len_a + len_b (approximation that avoids O(n×m) backtracking). Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#
window: int | None = None[source]#
class tanat.metric.DistanceMatrix(data: ndarray, ids: list)[source]#

Bases: object

Pairwise distance matrix with associated IDs.

Wrapper around a square numpy.ndarray associating each row/column with a sequence (or trajectory) identifier.

Example:

dm = DistanceMatrix(np.zeros((3, 3), dtype="float32"), ids=[1, 2, 3])
dm.to_frame()           # pandas DataFrame (default)
dm.to_frame("polars")   # polars DataFrame
dm.to_numpy()           # raw array
__init__(data: ndarray, ids: list) None[source]#

Create a DistanceMatrix.

Parameters:
  • data – Square numpy array of shape (n, n).

  • ids – List of n identifiers matching the matrix rows/columns. Stored as-is (order preserved from pool.unique_ids).

property data: ndarray[source]#

Raw numpy array (shape (n, n)).

classmethod empty(ids: list, dtype: str = 'float32') DistanceMatrix[source]#

Create a zero-initialised square matrix.

Parameters:
  • ids – List of identifiers.

  • dtype – Numpy dtype string (default "float32").

Returns:

A DistanceMatrix of shape (n, n) filled with zeros.

classmethod from_path(path: str | Path) DistanceMatrix[source]#

Load a previously computed distance matrix from disk.

Uses resolve_path to resolve the storage directory (workspace name or filesystem path), then opens the memmap in read-only mode.

Parameters:

path – Storage directory. Same formats as StorageOptions.store_path: plain name ("distances"), relative path ("./distances"), or absolute path.

Returns:

A DistanceMatrix backed by a read-only memmap.

Raises:
  • FileNotFoundError – If the directory or required files don’t exist.

  • ValueError – If progress.json status is not "complete" (incomplete computation).

property ids: list[source]#

Ordered list of identifiers (same order as rows/columns).

property is_memmap: bool[source]#

True if the underlying data is a memory-mapped file.

property shape: tuple[int, int][source]#

Shape of the underlying array.

to_frame(fmt: Literal['pandas', 'polars'] = 'pandas') DataFrame | DataFrame[source]#

Return a labelled dataframe with IDs as index/columns.

Parameters:

fmt"pandas" (default) returns a pandas.DataFrame; "polars" returns a polars.DataFrame with an extra "id" column (Polars has no row index).

Returns:

Square dataframe of shape (n, n).

Raises:

ValueError – If fmt is not "pandas" or "polars".

to_numpy() ndarray[source]#

Return the underlying numpy array.

Returns:

Square array of shape (n, n).

class tanat.metric.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (no edits needed).

  • One emptyn × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:
  • entity_metric – Entity-level substitution cost metric. Default: "hamming".

  • indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.

  • normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
indel_cost: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#
class tanat.metric.EntityMetric(settings: Any = None)[source]#

Bases: SettingsMixin, Registrable, ABC

Abstract base for entity-level distance metrics.

Computes a scalar distance between two Entity objects.

IS_SYMMETRIC: bool = True[source]#

Set to True when dist(a, b) == dist(b, a) for all inputs. Subclasses that implement a directional distance must set this to False so that the full n² kernel is used instead.

NUMBA_OPTIM: bool = False[source]#

Subclasses that provide prepare_batch_data / distance_kernel / prepare_cross_batch_data set this to True to opt into the Numba fast path.

abstractmethod validate_entity(ent_a: Entity, ent_b: Entity | None = None) None[source]#

Validate one or two entities against this metric’s requirements.

Called from __call__() and from validate_composition().

Implementations should call _validate_entity_instance() first for the type check, then add metric-specific checks.

Parameters:
  • ent_a – Primary entity.

  • ent_b – Optional second entity (None → probe single entity only).

Raises:
  • TypeError – Wrong argument type or incompatible feature dtype.

  • KeyError – Required feature absent from the entity.

class tanat.metric.HammingEntityMetric(entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0)[source]#

Bases: EntityMetric

Categorical Hamming distance between two entities.

Returns 0.0 when both entities share the same value for the configured feature, and mismatch_cost (default 1.0) when they differ. A custom cost dict enables partial costs.

Example:

hamming = HammingEntityMetric()
hamming(ent_a, ent_b)                           # 0.0 or 1.0

hamming = HammingEntityMetric(
    entity_feature="status",
    cost={("A", "B"): 0.5},
    mismatch_cost=0.8,
)
hamming(ent_a, ent_b)                           # looks up in cost dict
IS_SYMMETRIC: bool = True[source]#

Set to True when dist(a, b) == dist(b, a) for all inputs. Subclasses that implement a directional distance must set this to False so that the full n² kernel is used instead.

NUMBA_OPTIM: bool = True[source]#

Subclasses that provide prepare_batch_data / distance_kernel / prepare_cross_batch_data set this to True to opt into the Numba fast path.

SETTINGS_CLASS[source]#

alias of HammingSettings

__init__(entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0) None[source]#
property distance_kernel: Callable[source]#

Numba-compiled entity distance kernel (simple or weighted).

prepare_batch_data(pool: SequencePool) tuple[source]#

Extract and encode the categorical feature for Numba batch computation.

Returns:

(arrays, lengths, context)

prepare_cross_batch_data(pool_rows, pool_cols) tuple[source]#

Encode two pools with a shared vocabulary for cross-distance.

Returns:

(arrays_rows, lengths_rows, arrays_cols, lengths_cols, context)

validate_entity(ent_a: Entity, ent_b: Entity | None = None) None[source]#

Verify the configured feature exists and is categorical.

class tanat.metric.HammingSettings(*, entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0)[source]#

Bases: object

Settings for HammingEntityMetric.

Parameters:
  • entity_feature – Name of the categorical feature to compare. None - first entity feature from the pool/entity metadata.

  • cost – Pairwise cost lookup. Keys are (val_a, val_b) tuples; order does not matter (both (A, B) and (B, A) are checked). Conflicting entries are rejected at construction. Default: None (every mismatch uses mismatch_cost).

  • mismatch_cost – Default cost applied when the pair is not in cost and values differ (default: 1.0).

__init__(*args: Any, **kwargs: Any) None[source]#
cost: dict[tuple, float] | None = None[source]#
entity_feature: str | None = None[source]#
mismatch_cost: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

classmethod validate_cost_symmetry(v)[source]#

Reject cost dicts with conflicting asymmetric entries.

class tanat.metric.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

  • "length" → raw prefix length (not a proper distance).

  • "distance"len_a + len_b 2·LCP (always ≥ 0).

  • "normalized"1 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCP length (not a distance, can be > 1).

    • "distance" → additive distance: len_a + len_b 2·LCP.

    • "normalized"``→  Jaccard-like distance: ``1 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

  • "length" → raw LCS length (not a proper distance).

  • "distance"len_a + len_b 2·LCS (always ≥ 0).

  • "normalized"1 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCS length (not a proper distance).

    • "distance" → additive distance: len_a + len_b 2·LCS.

    • "normalized" → Jaccard-like: 1 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

  • Both emptynan (distance is undefined).

  • One empty, padding_penalty is set → all positions are padded.

  • One empty, padding_penalty is Nonenan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".

  • agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".

  • padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) None[source]#
agg_fun: str = 'mean'[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#
class tanat.metric.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:
  • pool_rows – Pool whose sequences form the rows (n items).

  • pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:
  • pool – A SequencePool.

  • store_path – Storage directory (None → in-memory).

  • chunk_size – Rows per flush chunk (default 500).

  • resume – Skip already-computed chunks (default True).

  • dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:

The resolved EntityMetric.

Raises:

AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Composition compatibility check between this metric and the given sequence(s).

Called from __call__() with both sequences, and from compute_matrix() with a single sample sequence (seq_b=None). Subclasses that compose with an EntityMetric probe a sample entity to surface compatibility errors early.

Parameters:
  • seq_a – Primary sequence to probe.

  • seq_b – Optional second sequence.

Raises:
  • TypeError – If the entity feature has an incompatible dtype.

  • KeyError – If a required feature is absent.

class tanat.metric.SoftDTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Soft Dynamic Time Warping distance between two sequences.

Replaces the min operator in the DTW recurrence with a differentiable soft-minimum parameterised by gamma:

\[\text{soft-min}(a, b, c; \gamma) = -\gamma \log\bigl( e^{-a/\gamma} + e^{-b/\gamma} + e^{-c/\gamma}\bigr)\]

As gamma 0, SoftDTW converges to standard DTW. As gamma , it approaches the mean of all alignment costs.

Empty-sequence behaviour:

  • Both emptynan (no alignment possible).

  • One emptynan (no alignment possible).

References

Cuturi & Blondel (2017) — Soft-DTW: a Differentiable Loss Function for Time-Series, ICML.

Example:

sdtw = SoftDTWSequenceMetric(gamma=0.1)
d    = sdtw(seq_a, seq_b)
dm   = sdtw.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of SoftDTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.SoftDTWSettings(*, entity_metric: EntityMetric = 'hamming', gamma: float = 1.0)[source]#

Bases: object

Settings for SoftDTWSequenceMetric.

Parameters:
  • entity_metric – Entity-level distance metric. Default: "hamming".

  • gamma – Regularisation parameter for the soft-min operator. Must be > 0. Large values produce a smoother (mean-like) approximation; small values approach standard DTW. Default: 1.0.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
gamma: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.StorageOptions(*, store_path: str | Path, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: object

Disk-backed storage options for distance matrix computation.

Parameters:
  • store_path

    Directory where the matrix file and metadata are stored (required). Accepts the same formats as resolve_path:

    • a plain name (e.g. "distances") - resolved via workspace store,

    • a relative or absolute path (e.g. "./distances", Path(...)).

  • chunk_size – Number of matrix rows computed per chunk before flushing to disk. Larger = fewer I/O ops, smaller = finer resume granularity. Default: 500.

  • resume – If True (default), skip chunks already computed. If False, delete and recompute from scratch.

  • dtype – Numpy dtype string for the matrix. Default: "float32".

__init__(*args: Any, **kwargs: Any) None[source]#
chunk_size: int = 500[source]#
dtype: str = 'float32'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

resume: bool = True[source]#
store_path: str | Path[source]#
class tanat.metric.TrajectoryMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for trajectory-level distance metrics.

Computes a scalar distance between two Trajectory objects and a full pairwise DistanceMatrix over a TrajectoryPool.

MEMMAP_SUPPORT: bool = False[source]#

Set to True in subclasses that implement disk-backed (memmap) computation.

__init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
compute_cross_matrix(pool_rows: TrajectoryPool, pool_cols: TrajectoryPool) ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ trajectory i in pool_rows; column j ↔ trajectory j in pool_cols.

Validates both pools, then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use optimised kernels.

Parameters:
  • pool_rows – Pool whose trajectories form the rows (n items).

  • pool_cols – Pool whose trajectories form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: TrajectoryPool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#

Compute full pairwise trajectory distance matrix.

Storage kwargs are forwarded to StorageOptions.

Parameters:
  • pool – A TrajectoryPool.

  • store_path – Storage directory (None → in-memory).

  • chunk_size – Rows per flush chunk (default 500).

  • resume – Skip already-computed chunks (default True).

  • dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix.