tanat.metric.sequence package#

Subpackages#

tanat.metric.sequence.type package
- Subpackages
- Module contents

Submodules#

tanat.metric.sequence.base module#

SequenceMetric ABC: base class for all sequence-level distance metrics.

class tanat.metric.sequence.base.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) → None[source]#

compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) → ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:

pool_rows – Pool whose sequences form the rows (n items).
pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:

pool – A SequencePool.
store_path – Storage directory (None → in-memory).
chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default True).
dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:: The resolved EntityMetric.
Raises:: AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Composition compatibility check between this metric and the given sequence(s).

Called from __call__() with both sequences, and from compute_matrix() with a single sample sequence (seq_b=None). Subclasses that compose with an EntityMetric probe a sample entity to surface compatibility errors early.

Parameters:

seq_a – Primary sequence to probe.
seq_b – Optional second sequence.

Raises:

TypeError – If the entity feature has an incompatible dtype.
KeyError – If a required feature is absent.

Module contents#

Sequence metric sub-package.

class tanat.metric.sequence.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes end − start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

Both empty → 0.0 (identical empty distributions).
One empty → 1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

prepare_batch_data(pool: SequencePool) → tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:: (hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:

seq_a – Primary sequence.
seq_b – Optional second sequence.

Raises:

KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.

class tanat.metric.sequence.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:: entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_feature: str | None = None[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.DTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Dynamic Time Warping distance between two sequences.

Uses a space-optimised 2-row DP. The Sakoe-Chiba band is applied when window is set, limiting the warping path to stay within window cells of the diagonal.

Empty-sequence behaviour:

Both empty → nan (no alignment possible).
One empty → nan (no alignment possible).

When normalize=True, divides the raw DTW cost by len_a + len_b (an approximation that does not require path backtracking).

Example:

dtw = DTWSequenceMetric(window=3, normalize=True)
d   = dtw(seq_a, seq_b)
dm  = dtw.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of DTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.DTWSettings(*, entity_metric: EntityMetric = 'hamming', window: int | None = None, normalize: bool = False)[source]#

Bases: object

Settings for DTWSequenceMetric.

Parameters:

entity_metric – Entity-level distance metric. Default: "hamming".
window – Sakoe-Chiba band width (number of cells off the diagonal). None means no constraint (full DTW). Must be > 0 when set.
normalize – When True, divide the DTW cost by len_a + len_b (approximation that avoids O(n×m) backtracking). Default: False.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#

window: int | None = None[source]#

class tanat.metric.sequence.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (no edits needed).
One empty → n × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:

entity_metric – Entity-level substitution cost metric. Default: "hamming".
indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.
normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

indel_cost: float = 1.0[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#

class tanat.metric.sequence.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

"length" → raw prefix length (not a proper distance).
"distance" → len_a + len_b − 2·LCP (always ≥ 0).
"normalized" → 1 − 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
- "length" → raw LCP length (not a distance, can be > 1).
- "distance" → additive distance: len_a + len_b − 2·LCP.
- "normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

"length" → raw LCS length (not a proper distance).
"distance" → len_a + len_b − 2·LCS (always ≥ 0).
"normalized" → 1 − 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.
mode –
Output mode.
- "length" → raw LCS length (not a proper distance).
- "distance" → additive distance: len_a + len_b − 2·LCS.
- "normalized" → Jaccard-like: 1 − 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

Both empty → nan (distance is undefined).
One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None → nan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".
agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".
padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) → None[source]#

agg_fun: str = 'mean'[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#

class tanat.metric.sequence.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) → None[source]#

compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) → ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:

pool_rows – Pool whose sequences form the rows (n items).
pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:

pool – A SequencePool.
store_path – Storage directory (None → in-memory).
chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default True).
dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:: The resolved EntityMetric.
Raises:: AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Composition compatibility check between this metric and the given sequence(s).

Parameters:

seq_a – Primary sequence to probe.
seq_b – Optional second sequence.

Raises:

TypeError – If the entity feature has an incompatible dtype.
KeyError – If a required feature is absent.

class tanat.metric.sequence.SoftDTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Soft Dynamic Time Warping distance between two sequences.

Replaces the min operator in the DTW recurrence with a differentiable soft-minimum parameterised by gamma:

\[\text{soft-min}(a, b, c; \gamma) = -\gamma \log\bigl( e^{-a/\gamma} + e^{-b/\gamma} + e^{-c/\gamma}\bigr)\]

As gamma → 0, SoftDTW converges to standard DTW. As gamma → ∞, it approaches the mean of all alignment costs.

Empty-sequence behaviour:

Both empty → nan (no alignment possible).
One empty → nan (no alignment possible).

References

Cuturi & Blondel (2017) — Soft-DTW: a Differentiable Loss Function for Time-Series, ICML.

Example:

sdtw = SoftDTWSequenceMetric(gamma=0.1)
d    = sdtw(seq_a, seq_b)
dm   = sdtw.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of SoftDTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.SoftDTWSettings(*, entity_metric: EntityMetric = 'hamming', gamma: float = 1.0)[source]#

Bases: object

Settings for SoftDTWSequenceMetric.

Parameters:

entity_metric – Entity-level distance metric. Default: "hamming".
gamma – Regularisation parameter for the soft-min operator. Must be > 0. Large values produce a smoother (mean-like) approximation; small values approach standard DTW. Default: 1.0.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

gamma: float = 1.0[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.