tanat.metric.sequence package#

Subpackages#

Submodules#

tanat.metric.sequence.base module#

SequenceMetric ABC: base class for all sequence-level distance metrics.

class tanat.metric.sequence.base.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:
  • pool_rows – Pool whose sequences form the rows (n items).

  • pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:
  • pool – A SequencePool.

  • store_path – Storage directory (None → in-memory).

  • chunk_size – Rows per flush chunk (default 500).

  • resume – Skip already-computed chunks (default True).

  • dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:

The resolved EntityMetric.

Raises:

AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Composition compatibility check between this metric and the given sequence(s).

Called from __call__() with both sequences, and from compute_matrix() with a single sample sequence (seq_b=None). Subclasses that compose with an EntityMetric probe a sample entity to surface compatibility errors early.

Parameters:
  • seq_a – Primary sequence to probe.

  • seq_b – Optional second sequence.

Raises:
  • TypeError – If the entity feature has an incompatible dtype.

  • KeyError – If a required feature is absent.

Module contents#

Sequence metric sub-package.

class tanat.metric.sequence.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

  • Event sequences: each event contributes a weight of 1.

  • Interval / State sequences: each entity contributes end start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

  • Both empty0.0 (identical empty distributions).

  • One empty1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
prepare_batch_data(pool: SequencePool) tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:

(hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:
  • seq_a – Primary sequence.

  • seq_b – Optional second sequence.

Raises:
  • KeyError – If the feature is absent from a sequence.

  • TypeError – If the feature is not categorical.

class tanat.metric.sequence.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:

entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_feature: str | None = None[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.DTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Dynamic Time Warping distance between two sequences.

Uses a space-optimised 2-row DP. The Sakoe-Chiba band is applied when window is set, limiting the warping path to stay within window cells of the diagonal.

Empty-sequence behaviour:

  • Both emptynan (no alignment possible).

  • One emptynan (no alignment possible).

When normalize=True, divides the raw DTW cost by len_a + len_b (an approximation that does not require path backtracking).

Example:

dtw = DTWSequenceMetric(window=3, normalize=True)
d   = dtw(seq_a, seq_b)
dm  = dtw.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of DTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.DTWSettings(*, entity_metric: EntityMetric = 'hamming', window: int | None = None, normalize: bool = False)[source]#

Bases: object

Settings for DTWSequenceMetric.

Parameters:
  • entity_metric – Entity-level distance metric. Default: "hamming".

  • window – Sakoe-Chiba band width (number of cells off the diagonal). None means no constraint (full DTW). Must be > 0 when set.

  • normalize – When True, divide the DTW cost by len_a + len_b (approximation that avoids O(n×m) backtracking). Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#
window: int | None = None[source]#
class tanat.metric.sequence.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (no edits needed).

  • One emptyn × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:
  • entity_metric – Entity-level substitution cost metric. Default: "hamming".

  • indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.

  • normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
indel_cost: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#
class tanat.metric.sequence.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

  • "length" → raw prefix length (not a proper distance).

  • "distance"len_a + len_b 2·LCP (always ≥ 0).

  • "normalized"1 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCP length (not a distance, can be > 1).

    • "distance" → additive distance: len_a + len_b 2·LCP.

    • "normalized"``→  Jaccard-like distance: ``1 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

  • "length" → raw LCS length (not a proper distance).

  • "distance"len_a + len_b 2·LCS (always ≥ 0).

  • "normalized"1 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCS length (not a proper distance).

    • "distance" → additive distance: len_a + len_b 2·LCS.

    • "normalized" → Jaccard-like: 1 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

class tanat.metric.sequence.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

  • Both emptynan (distance is undefined).

  • One empty, padding_penalty is set → all positions are padded.

  • One empty, padding_penalty is Nonenan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".

  • agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".

  • padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) None[source]#
agg_fun: str = 'mean'[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#
class tanat.metric.sequence.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#

Bases: SettingsMixin, Registrable, DisplayMixin, ABC

Abstract base for sequence-level distance metrics.

Computes a scalar distance between two Sequence objects and a full pairwise DistanceMatrix over a pool.

MEMMAP_SUPPORT: bool = False[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

__init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) ndarray[source]#

Compute an asymmetric (n × k) distance matrix between two pools.

Row i ↔ sequence i in pool_rows; column j ↔ sequence j in pool_cols. The result is not symmetric.

Validates both pools (type-check + composition probe), then delegates to _compute_cross_matrix_impl(). Subclasses override _compute_cross_matrix_impl() to use Numba kernels when available.

Parameters:
  • pool_rows – Pool whose sequences form the rows (n items).

  • pool_cols – Pool whose sequences form the columns (k items).

Returns:

float32 numpy array of shape (n, k).

compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#

Compute the full pairwise distance matrix for pool.

Parameters:
  • pool – A SequencePool.

  • store_path – Storage directory (None → in-memory).

  • chunk_size – Rows per flush chunk (default 500).

  • resume – Skip already-computed chunks (default True).

  • dtype – Numpy dtype for the matrix (default "float32").

Returns:

A DistanceMatrix of shape (n, n).

property entity_metric: EntityMetric[source]#

Resolve settings.entity_metric (string → instance or pass-through).

Returns:

The resolved EntityMetric.

Raises:

AttributeError – If the concrete settings class has no entity_metric field.

abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Composition compatibility check between this metric and the given sequence(s).

Called from __call__() with both sequences, and from compute_matrix() with a single sample sequence (seq_b=None). Subclasses that compose with an EntityMetric probe a sample entity to surface compatibility errors early.

Parameters:
  • seq_a – Primary sequence to probe.

  • seq_b – Optional second sequence.

Raises:
  • TypeError – If the entity feature has an incompatible dtype.

  • KeyError – If a required feature is absent.

class tanat.metric.sequence.SoftDTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Soft Dynamic Time Warping distance between two sequences.

Replaces the min operator in the DTW recurrence with a differentiable soft-minimum parameterised by gamma:

\[\text{soft-min}(a, b, c; \gamma) = -\gamma \log\bigl( e^{-a/\gamma} + e^{-b/\gamma} + e^{-c/\gamma}\bigr)\]

As gamma 0, SoftDTW converges to standard DTW. As gamma , it approaches the mean of all alignment costs.

Empty-sequence behaviour:

  • Both emptynan (no alignment possible).

  • One emptynan (no alignment possible).

References

Cuturi & Blondel (2017) — Soft-DTW: a Differentiable Loss Function for Time-Series, ICML.

Example:

sdtw = SoftDTWSequenceMetric(gamma=0.1)
d    = sdtw(seq_a, seq_b)
dm   = sdtw.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of SoftDTWSettings

__init__(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.SoftDTWSettings(*, entity_metric: EntityMetric = 'hamming', gamma: float = 1.0)[source]#

Bases: object

Settings for SoftDTWSequenceMetric.

Parameters:
  • entity_metric – Entity-level distance metric. Default: "hamming".

  • gamma – Regularisation parameter for the soft-min operator. Must be > 0. Large values produce a smoother (mean-like) approximation; small values approach standard DTW. Default: 1.0.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
gamma: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.