tanat.metric package#

Subpackages#

Submodules#

tanat.metric.matrix module#

DistanceMatrix: thin numpy wrapper with associated IDs.

class tanat.metric.matrix.DistanceMatrix(data: ndarray, ids: list)[source]#

Bases: object

Pairwise distance matrix with associated IDs.

Wrapper around a square numpy.ndarray associating each row/column with a sequence (or trajectory) identifier.

Example:

dm = DistanceMatrix(np.zeros((3, 3), dtype="float32"), ids=[1, 2, 3])
dm.to_frame()           # pandas DataFrame (default)
dm.to_frame("polars")   # polars DataFrame
dm.to_numpy()           # raw array

__init__(data: ndarray, ids: list) → None[source]#

Create a DistanceMatrix.

Parameters:

data – Square numpy array of shape (n, n).
ids – List of n identifiers matching the matrix rows/columns. Stored as-is (order preserved from pool.unique_ids).

property data: ndarray[source]#: Raw numpy array (shape (n, n)).

classmethod empty(ids: list, dtype: str = 'float32') → DistanceMatrix[source]#

Create a zero-initialised square matrix.

Parameters:

ids – List of identifiers.
dtype – Numpy dtype string (default "float32").

Returns:

A DistanceMatrix of shape (n, n) filled with zeros.

classmethod from_path(path: str | Path) → DistanceMatrix[source]#

Load a previously computed distance matrix from disk.

Uses resolve_path to resolve the storage directory (workspace name or filesystem path), then opens the memmap in read-only mode.

Parameters:

path – Storage directory. Same formats as StorageOptions.store_path: plain name ("distances"), relative path ("./distances"), or absolute path.

Returns:

A DistanceMatrix backed by a read-only memmap.

Raises:

FileNotFoundError – If the directory or required files don’t exist.
ValueError – If progress.json status is not "complete" (incomplete computation).

property ids: list[source]#: Ordered list of identifiers (same order as rows/columns).

property is_memmap: bool[source]#: True if the underlying data is a memory-mapped file.

property shape: tuple[int, int][source]#: Shape of the underlying array.

to_frame(fmt: Literal['pandas', 'polars'] = 'pandas') → DataFrame | DataFrame[source]#

Return a labelled dataframe with IDs as index/columns.

Parameters:: fmt – "pandas" (default) returns a pandas.DataFrame; "polars" returns a polars.DataFrame with an extra "id" column (Polars has no row index).
Returns:: Square dataframe of shape (n, n).
Raises:: ValueError – If fmt is not "pandas" or "polars".

to_numpy() → ndarray[source]#

Return the underlying numpy array.

Returns:: Square array of shape (n, n).

Module contents#

Metric Module.

class tanat.metric.AggregationSettings(*, sequence_metrics: dict[str, SequenceMetric] | None = None, agg_fun: str = 'mean', weights: dict[str, float] | None = None)[source]#

Bases: object

Settings for AggregationTrajectoryMetric.

agg_fun and weights are orthogonal: "mean" + weights computes a weighted mean, "sum" + weights a weighted sum. Aliases absent from weights default to 1.0.

__init__(*args: Any, **kwargs: Any) → None[source]#

agg_fun: str = 'mean'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

sequence_metrics: dict[str, SequenceMetric] | None = None[source]#

weights: dict[str, float] | None = None[source]#

class tanat.metric.AggregationTrajectoryMetric(sequence_metrics: dict[str, SequenceMetric | str] | None = None, static_metric: StaticMetric | Callable | None = None, static_metric_weight: float = 1.0, agg_fun: str = 'mean', weights: dict[str, float] | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: TrajectoryMetric

Trajectory distance by per-alias sequence distances, then weighted aggregation.

For each store alias visible on both trajectories, computes the sequence-level distance using the configured SequenceMetric. The resulting per-alias distances are aggregated (weighted mean/sum) into a scalar trajectory distance.

Example:

lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

agg = AggregationTrajectoryMetric(
    sequence_metric={"event": lp; "states": lp},
    agg_fun="mean",
    weights={"events": 1.0, "states": 0.5},
)

dist = agg(traj_a, traj_b)
dm   = agg.compute_matrix(traj_pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation.

SETTINGS_CLASS[source]#: alias of AggregationSettings

__init__(sequence_metrics: dict[str, SequenceMetric | str] | None = None, static_metric: StaticMetric | Callable | None = None, static_metric_weight: float = 1.0, agg_fun: str = 'mean', weights: dict[str, float] | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

class tanat.metric.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes end − start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

Both empty → 0.0 (identical empty distributions).
One empty → 1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

prepare_batch_data(pool: SequencePool) → tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:: (hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:

seq_a – Primary sequence.
seq_b – Optional second sequence.

Raises:

KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.

class tanat.metric.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:: entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_feature: str | None = None[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.DTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Dynamic Time Warping distance between two sequences.

Uses a space-optimised 2-row DP. The Sakoe-Chiba band is applied when window is set, limiting the warping path to stay within window cells of the diagonal.

Empty-sequence behaviour:

Both empty → nan (no alignment possible).
One empty → nan (no alignment possible).

When normalize=True, divides the raw DTW cost by len_a + len_b (an approximation that does not require path backtracking).

Example:

dtw = DTWSequenceMetric(window=3, normalize=True)
d   = dtw(seq_a, seq_b)
dm  = dtw.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of DTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.DTWSettings(*, entity_metric: EntityMetric = 'hamming', window: int | None = None, normalize: bool = False)[source]#

Bases: object

Settings for DTWSequenceMetric.

Parameters:

entity_metric – Entity-level distance metric. Default: "hamming".
window – Sakoe-Chiba band width (number of cells off the diagonal). None means no constraint (full DTW). Must be > 0 when set.
normalize – When True, divide the DTW cost by len_a + len_b (approximation that avoids O(n×m) backtracking). Default: False.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#

window: int | None = None[source]#

class tanat.metric.DistanceMatrix(data: ndarray, ids: list)[source]#

Bases: object

Pairwise distance matrix with associated IDs.

Wrapper around a square numpy.ndarray associating each row/column with a sequence (or trajectory) identifier.

Example:

dm = DistanceMatrix(np.zeros((3, 3), dtype="float32"), ids=[1, 2, 3])
dm.to_frame()           # pandas DataFrame (default)
dm.to_frame("polars")   # polars DataFrame
dm.to_numpy()           # raw array

__init__(data: ndarray, ids: list) → None[source]#

Create a DistanceMatrix.

Parameters:

data – Square numpy array of shape (n, n).
ids – List of n identifiers matching the matrix rows/columns. Stored as-is (order preserved from pool.unique_ids).

property data: ndarray[source]#: Raw numpy array (shape (n, n)).

classmethod empty(ids: list, dtype: str = 'float32') → DistanceMatrix[source]#

Create a zero-initialised square matrix.

Parameters:

ids – List of identifiers.
dtype – Numpy dtype string (default "float32").

Returns:

A DistanceMatrix of shape (n, n) filled with zeros.

classmethod from_path(path: str | Path) → DistanceMatrix[source]#

Load a previously computed distance matrix from disk.

Uses resolve_path to resolve the storage directory (workspace name or filesystem path), then opens the memmap in read-only mode.

Parameters:

path – Storage directory. Same formats as StorageOptions.store_path: plain name ("distances"), relative path ("./distances"), or absolute path.

Returns:

A DistanceMatrix backed by a read-only memmap.

Raises:

FileNotFoundError – If the directory or required files don’t exist.
ValueError – If progress.json status is not "complete" (incomplete computation).

property ids: list[source]#: Ordered list of identifiers (same order as rows/columns).

property is_memmap: bool[source]#: True if the underlying data is a memory-mapped file.

property shape: tuple[int, int][source]#: Shape of the underlying array.

to_frame(fmt: Literal['pandas', 'polars'] = 'pandas') → DataFrame | DataFrame[source]#

Return a labelled dataframe with IDs as index/columns.

Parameters:: fmt – "pandas" (default) returns a pandas.DataFrame; "polars" returns a polars.DataFrame with an extra "id" column (Polars has no row index).
Returns:: Square dataframe of shape (n, n).
Raises:: ValueError – If fmt is not "pandas" or "polars".

to_numpy() → ndarray[source]#

Return the underlying numpy array.

Returns:: Square array of shape (n, n).

class tanat.metric.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (no edits needed).
One empty → n × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:

entity_metric – Entity-level substitution cost metric. Default: "hamming".
indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.
normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

indel_cost: float = 1.0[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#

class tanat.metric.EntityMetric(settings: Any = None)[source]#

Bases: SettingsMixin, Registrable, ABC

Abstract base for entity-level distance metrics.

Computes a scalar distance between two Entity objects.

IS_SYMMETRIC: bool = True[source]#: Set to True when dist(a, b) == dist(b, a) for all inputs. Subclasses that implement a directional distance must set this to False so that the full n² kernel is used instead.

NUMBA_OPTIM: bool = False[source]#: Subclasses that provide prepare_batch_data / distance_kernel / prepare_cross_batch_data set this to True to opt into the Numba fast path.

abstractmethod validate_entity(ent_a: Entity, ent_b: Entity | None = None) → None[source]#

Validate one or two entities against this metric’s requirements.

Called from __call__() and from validate_composition().

Implementations should call _validate_entity_instance() first for the type check, then add metric-specific checks.

Parameters:

ent_a – Primary entity.
ent_b – Optional second entity (None → probe single entity only).

Raises:

TypeError – Wrong argument type or incompatible feature dtype.
KeyError – Required feature absent from the entity.

class tanat.metric.HammingEntityMetric(entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0)[source]#

Bases: EntityMetric

Categorical Hamming distance between two entities.

Returns 0.0 when both entities share the same value for the configured feature, and mismatch_cost (default 1.0) when they differ. A custom cost dict enables partial costs.

Example:

hamming = HammingEntityMetric()
hamming(ent_a, ent_b)                           # 0.0 or 1.0

hamming = HammingEntityMetric(
    entity_feature="status",
    cost={("A", "B"): 0.5},
    mismatch_cost=0.8,
)
hamming(ent_a, ent_b)                           # looks up in cost dict

IS_SYMMETRIC: bool = True[source]#: Set to True when dist(a, b) == dist(b, a) for all inputs. Subclasses that implement a directional distance must set this to False so that the full n² kernel is used instead.

NUMBA_OPTIM: bool = True[source]#: Subclasses that provide prepare_batch_data / distance_kernel / prepare_cross_batch_data set this to True to opt into the Numba fast path.

SETTINGS_CLASS[source]#: alias of HammingSettings

__init__(entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0) → None[source]#

property distance_kernel: Callable[source]#: Numba-compiled entity distance kernel (simple or weighted).

prepare_batch_data(pool: SequencePool) → tuple[source]#

Extract and encode the categorical feature for Numba batch computation.

Returns:: (arrays, lengths, context)

prepare_cross_batch_data(pool_rows, pool_cols) → tuple[source]#

Encode two pools with a shared vocabulary for cross-distance.

Returns:: (arrays_rows, lengths_rows, arrays_cols, lengths_cols, context)

validate_entity(ent_a: Entity, ent_b: Entity | None = None) → None[source]#: Verify the configured feature exists and is categorical.

class tanat.metric.HammingSettings(*, entity_feature: str | None = None, cost: dict[tuple, float] | None = None, mismatch_cost: float = 1.0)[source]#

Bases: EntityMetricSettings

Settings for HammingEntityMetric.

Parameters:

entity_feature – Name of the categorical feature to compare. None - first categorical entity feature from the pool/entity metadata.
cost – Pairwise cost lookup. Keys are (val_a, val_b) tuples; order does not matter (both (A, B) and (B, A) are checked). Conflicting entries are rejected at construction. Default: None (every mismatch uses mismatch_cost).
mismatch_cost – Default cost applied when the pair is not in cost and values differ (default: 1.0).

__init__(*args: Any, **kwargs: Any) → None[source]#

cost: dict[tuple, float] | None = None[source]#

entity_feature: str | None = None[source]#

mismatch_cost: float = 1.0[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

classmethod validate_cost_symmetry(v)[source]#: Reject cost dicts with conflicting asymmetric entries.

class tanat.metric.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

"length" → raw prefix length (not a proper distance).
"distance" → len_a + len_b − 2·LCP (always ≥ 0).
"normalized" → 1 − 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
- "length" → raw LCP length (not a distance, can be > 1).
- "distance" → additive distance: len_a + len_b − 2·LCP.
- "normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

"length" → raw LCS length (not a proper distance).
"distance" → len_a + len_b − 2·LCS (always ≥ 0).
"normalized" → 1 − 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.
mode –
Output mode.
- "length" → raw LCS length (not a proper distance).
- "distance" → additive distance: len_a + len_b − 2·LCS.
- "normalized" → Jaccard-like: 1 − 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

Both empty → nan (distance is undefined).
One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None → nan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".
agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".
padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) → None[source]#

agg_fun: str = 'mean'[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#

class tanat.metric.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) → None[source]#

compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) → ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:

pool_rows – Pool whose sequences form the rows (n items).
pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:

pool – A SequencePool.
store_path – Storage directory (None → in-memory).
chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default True).
dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:: The resolved EntityMetric.
Raises:: AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Composition compatibility check between this metric and the given sequence(s).

Called from __call__() with both sequences, and from compute_matrix() with a single sample sequence (seq_b=None). Subclasses that compose with an EntityMetric probe a sample entity to surface compatibility errors early.

Parameters:

seq_a – Primary sequence to probe.
seq_b – Optional second sequence.

Raises:

TypeError – If the entity feature has an incompatible dtype.
KeyError – If a required feature is absent.

class tanat.metric.SoftDTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Soft Dynamic Time Warping distance between two sequences.

Replaces the min operator in the DTW recurrence with a differentiable soft-minimum parameterised by gamma:

\[\text{soft-min}(a, b, c; \gamma) = -\gamma \log\bigl( e^{-a/\gamma} + e^{-b/\gamma} + e^{-c/\gamma}\bigr)\]

As gamma → 0, SoftDTW converges to standard DTW. As gamma → ∞, it approaches the mean of all alignment costs.

Empty-sequence behaviour:

Both empty → nan (no alignment possible).
One empty → nan (no alignment possible).

References

Cuturi & Blondel (2017) — Soft-DTW: a Differentiable Loss Function for Time-Series, ICML.

Example:

sdtw = SoftDTWSequenceMetric(gamma=0.1)
d    = sdtw(seq_a, seq_b)
dm   = sdtw.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of SoftDTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.SoftDTWSettings(*, entity_metric: EntityMetric = 'hamming', gamma: float = 1.0)[source]#

Bases: object

Settings for SoftDTWSequenceMetric.

Parameters:

entity_metric – Entity-level distance metric. Default: "hamming".
gamma – Regularisation parameter for the soft-min operator. Must be > 0. Large values produce a smoother (mean-like) approximation; small values approach standard DTW. Default: 1.0.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

gamma: float = 1.0[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.StaticMetric(cmp_fnct: Callable | None = None)[source]#

Bases: object

Class for comparing static attributes between sequences or trajectories.

The comparison function between static information is to be provided by the user while creating the StaticMetric object. If not defined this class implements a binary similarity function that returns 0 if all attributes values are similars and 1 otherwise. It is based on the shared static attributes between compared objects. If none, it raises an error.

Computes a scalar distance between two Trajectory or Sequence objects.

Warning:

The comparison function must have two parameters that are dictionaries.
Use brackets with attribute names to access the value of a static
feature.

Static metric are objects that made to compute metrics between pairs of trajectories or sequences. It can be defined as it.

Example:

# Comparison between two dictionaries representing
# the static attributes of two trajectories/sequences.
# Values of the attributes of static features are accessed
# using named brackets only.

# This function must have two parameters and return a
# float.

# In this exemple, it compares the ages of the individuals
def static_metric(static1, static2) -> float:
    return abs(float(static1["age"]) - float(static2["age"]))


# Definition of a metric based on static features
csmet = StaticMetric(smet)

# Evaluate the metrics between two trajectories
val = csmet(traj1, traj2)

Static metric are more useful while hidden in the AggregationTrajectoryMetric. A static object comparison function or a StaticMetric can be defined to take into account static features in the aggregated metric.

Example:

# Same function to compare the `age` static attribute
def static_metric(static1, static2) -> float:
    return abs(float(static1["age"]) - float(static2["age"]))

# When an aggregation trajectory metric is defined, the user can
# add an additional component based on the static features by
# providing the function used to compare static features and also
# a weight in the weighted sum of metrics.
metric = AggregationTrajectoryMetric(static_metric=static_metric, static_metric_weight=0.5)
value = metric(traj1, traj2)

Information

StaticMetric can evaluate cross-distance matrix from a pool of trajectories
or sequences. Nonetheless, their computation is not optimized for large
datasets.

Warning ::: Contrary to the other functions that compute cross-distance matrix it does not provide a DistanceMatrix

__init__(cmp_fnct: Callable | None = None) → None[source]#

Parameters:

cmp_fnct – function to define the comparison between static
data (must have two dictionaries as parameters)

compute_cross_matrix(pool_rows: TrajectoryPool | SequencePool, pool_cols: TrajectoryPool | SequencePool) → ndarray[source]#

Compute cross distance matrix based on the static data between two pools.

Parameters:

pool_rows – Trajectory or Sequence pool for rows (N trajectories).
pool_cols – Trajectory or Sequence pool for columns (M trajectories).

Returns:

(N x M) matrix containing the pairwise distances computed based on static data.

Return type:

static_matrix

compute_matrix(pool: TrajectoryPool | SequencePool) → ndarray[source]#

class tanat.metric.StorageOptions(*, store_path: str | Path, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: object

Disk-backed storage options for distance matrix computation.

Parameters:

store_path –
Directory where the matrix file and metadata are stored (required). Accepts the same formats as resolve_path:
- a plain name (e.g. "distances") - resolved via workspace store,
- a relative or absolute path (e.g. "./distances", Path(...)).
chunk_size – Number of matrix rows computed per chunk before flushing to disk. Larger = fewer I/O ops, smaller = finer resume granularity. Default: 500.
resume – If True (default), skip chunks already computed. If False, delete and recompute from scratch.
dtype – Numpy dtype string for the matrix. Default: "float32".

__init__(*args: Any, **kwargs: Any) → None[source]#

chunk_size: int = 500[source]#

dtype: str = 'float32'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

resume: bool = True[source]#

store_path: str | Path[source]#

class tanat.metric.TrajectoryMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for trajectory-level distance metrics.

Computes a scalar distance between two Trajectory objects and a full pairwise DistanceMatrix over a TrajectoryPool.

MEMMAP_SUPPORT: bool = False[source]#: Set to True in subclasses that implement disk-backed (memmap) computation.

__init__(settings=None, storage: StorageOptions | dict | None = None) → None[source]#

compute_cross_matrix(pool_rows: TrajectoryPool, pool_cols: TrajectoryPool) → ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ trajectory i in pool_rows; column j ↔ trajectory j in pool_cols.

Validates both pools, then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use optimised kernels.

Parameters:

pool_rows – Pool whose trajectories form the rows (n items).
pool_cols – Pool whose trajectories form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: TrajectoryPool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → DistanceMatrix[source]#

Compute full pairwise trajectory distance matrix.

Storage kwargs are forwarded to StorageOptions.

Parameters:

pool – A TrajectoryPool.
store_path – Storage directory (None → in-memory).
chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default True).
dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix.