tanat.metric.sequence package#
Subpackages#
- tanat.metric.sequence.type package
- Subpackages
- Module contents
Submodules#
tanat.metric.sequence.base module#
SequenceMetric ABC: base class for all sequence-level distance metrics.
- class tanat.metric.sequence.base.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#
Bases:
SettingsMixin,Registrable,DisplayMixin,ABCAbstract base for sequence-level distance metrics.
Computes a scalar distance between two
Sequenceobjects and a full pairwiseDistanceMatrixover a pool.- MEMMAP_SUPPORT: bool = False[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- __init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
- compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) ndarray[source]#
Compute an asymmetric (n × k) distance matrix between two pools.
Row
i↔ sequenceiin pool_rows; columnj↔ sequencejin pool_cols. The result is not symmetric.Validates both pools (type-check + composition probe), then delegates to
_compute_cross_matrix_impl(). Subclasses override_compute_cross_matrix_impl()to use Numba kernels when available.- Parameters:
pool_rows – Pool whose sequences form the rows (n items).
pool_cols – Pool whose sequences form the columns (k items).
- Returns:
float32 numpy array of shape
(n, k).
- compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#
Compute the full pairwise distance matrix for pool.
- Parameters:
pool – A
SequencePool.store_path – Storage directory (
None→ in-memory).chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default
True).dtype – Numpy dtype for the matrix (default
"float32").
- Returns:
A
DistanceMatrixof shape(n, n).
- property entity_metric: EntityMetric[source]#
Resolve
settings.entity_metric(string → instance or pass-through).- Returns:
The resolved
EntityMetric.- Raises:
AttributeError – If the concrete settings class has no
entity_metricfield.
- abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#
Composition compatibility check between this metric and the given sequence(s).
Called from
__call__()with both sequences, and fromcompute_matrix()with a single sample sequence (seq_b=None). Subclasses that compose with anEntityMetricprobe a sample entity to surface compatibility errors early.- Parameters:
seq_a – Primary sequence to probe.
seq_b – Optional second sequence.
- Raises:
TypeError – If the entity feature has an incompatible dtype.
KeyError – If a required feature is absent.
Module contents#
Sequence metric sub-package.
- class tanat.metric.sequence.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricChi-squared distance between the state-time distributions of two sequences.
Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.
Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes
end − startas its weight.
The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:
\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]Note
Chi² does not use an entity metric;
entity_metricis absent from its settings. Thevalidate_compositionmethod checks only that the requested feature is present.Empty-sequence behaviour:
Both empty →
0.0(identical empty distributions).One empty →
1.0(maximally different distributions).
Example:
chi2 = Chi2SequenceMetric(entity_feature="status") d = chi2(seq_a, seq_b) dm = chi2.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
Chi2Settings
- __init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- prepare_batch_data(pool: SequencePool) tuple[source]#
Build histogram arrays for all sequences in pool.
- Returns:
(hists, n_cats)where hists is a float32 numpy array of shape(n, n_cats)containing raw (unnormalised) weights, with rows ordered to matchpool.unique_ids.
- validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#
Resolve and validate the target feature.
If
entity_featurewas not specified, resolves to the first entity feature ofseq_aand stores it intarget_feature. Then checks that the feature is present and categorical in every provided sequence.- Parameters:
seq_a – Primary sequence.
seq_b – Optional second sequence.
- Raises:
KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.
- class tanat.metric.sequence.Chi2Settings(*, entity_feature: str | None = None)[source]#
Bases:
objectSettings for
Chi2SequenceMetric.- Parameters:
entity_feature – Categorical feature name used as the histogram key (same semantics as
HammingEntityMetric).None→ resolved from the first entity feature of the sequence atvalidate_composition()time.
- class tanat.metric.sequence.DTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricDynamic Time Warping distance between two sequences.
Uses a space-optimised 2-row DP. The Sakoe-Chiba band is applied when
windowis set, limiting the warping path to stay withinwindowcells of the diagonal.Empty-sequence behaviour:
Both empty →
nan(no alignment possible).One empty →
nan(no alignment possible).
When
normalize=True, divides the raw DTW cost bylen_a + len_b(an approximation that does not require path backtracking).Example:
dtw = DTWSequenceMetric(window=3, normalize=True) d = dtw(seq_a, seq_b) dm = dtw.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
DTWSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', window: int | None = None, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.DTWSettings(*, entity_metric: EntityMetric = 'hamming', window: int | None = None, normalize: bool = False)[source]#
Bases:
objectSettings for
DTWSequenceMetric.- Parameters:
entity_metric – Entity-level distance metric. Default:
"hamming".window – Sakoe-Chiba band width (number of cells off the diagonal).
Nonemeans no constraint (full DTW). Must be > 0 when set.normalize – When
True, divide the DTW cost bylen_a + len_b(approximation that avoids O(n×m) backtracking). Default:False.
- entity_metric: EntityMetric = 'hamming'[source]#
- class tanat.metric.sequence.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricNeedleman-Wunsch edit distance between two sequences.
Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost
indel_costeach.When
normalize=True, the raw distance is divided bymax(len_a, len_b)so the result lies in[0, 1].Empty-sequence behaviour:
Both empty →
0.0(no edits needed).One empty →
n × indel_cost(all insertions/deletions).
Example:
edit = EditSequenceMetric(indel_cost=0.5, normalize=True) d = edit(seq_a, seq_b) dm = edit.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
EditSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#
Bases:
objectSettings for
EditSequenceMetric.- Parameters:
entity_metric – Entity-level substitution cost metric. Default:
"hamming".indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.
normalize – When
True, divide the raw edit distance bymax(len_a, len_b)to obtain a value in[0, 1]. Default:False.
- entity_metric: EntityMetric = 'hamming'[source]#
- class tanat.metric.sequence.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricLongest Common Prefix distance between two sequences.
Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤
equality_threshold). The scan stops at the first mismatch.Three output
modes are available:"length"→ raw prefix length (not a proper distance)."distance"→len_a + len_b − 2·LCP(always ≥ 0)."normalized"→1 − 2·LCP / (len_a + len_b)∈ [0, 1].
Empty-sequence behaviour:
Both empty →
0.0(for all modes).One empty (length n vs 0) → length:
0.0, distance:n, normalized:1.0.
Example:
lcp = LCPSequenceMetric(mode="normalized") d = lcp(seq_a, seq_b) dm = lcp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LCPSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#
Bases:
objectSettings for
LCPSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Default:
"hamming".equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
"length"→ raw LCP length (not a distance, can be > 1)."distance"→ additive distance:len_a + len_b − 2·LCP."normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in[0, 1](default).
- entity_metric: EntityMetric = 'hamming'[source]#
- class tanat.metric.sequence.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricLongest Common Subsequence distance between two sequences.
Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤
equality_threshold.Three output
modes are available:"length"→ raw LCS length (not a proper distance)."distance"→len_a + len_b − 2·LCS(always ≥ 0)."normalized"→1 − 2·LCS / (len_a + len_b)∈ [0, 1].
Empty-sequence behaviour:
Both empty →
0.0(for all modes).One empty (length n vs 0) → length:
0.0, distance:n, normalized:1.0.
Example:
lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1) d = lcs(seq_a, seq_b) dm = lcs.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LCSSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#
Bases:
objectSettings for
LCSSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Default:
"hamming".equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.
mode –
Output mode.
"length"→ raw LCS length (not a proper distance)."distance"→ additive distance:len_a + len_b − 2·LCS."normalized"→ Jaccard-like:1 − 2·LCS / (len_a + len_b), in [0, 1].
- entity_metric: EntityMetric = 'hamming'[source]#
- class tanat.metric.sequence.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricSequence metric by linear (position-wise) alignment of entities.
Aligns
seq_aandseq_brank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.When sequences differ in length,
padding_penaltyis applied for each unmatched position of the longer sequence. Ifpadding_penaltyisNone, only the overlapping prefix is used.Empty-sequence behaviour:
Both empty →
nan(distance is undefined).One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None →
nanwith a warning suggesting to setpadding_penalty.
Example:
hamming = HammingEntityMetric(entity_feature="status") lp = LinearPairwiseSequenceMetric(entity_metric=hamming) dist = lp(seq_a, seq_b) dm = lp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LinearPairwiseSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#
Bases:
objectSettings for
LinearPairwiseSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via
Registrable.__get_pydantic_core_schema__. Default:"hamming".agg_fun – Aggregation function applied to the vector of entity distances. One of
"mean"(default) or"sum".padding_penalty – Distance value used for unmatched positions when sequences have different lengths.
None→ unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0),Nonemakes the distance undefined (nanin a matrix,ValueErroron a direct call).
- entity_metric: EntityMetric = 'hamming'[source]#
- class tanat.metric.sequence.SequenceMetric(settings=None, storage: StorageOptions | dict | None = None)[source]#
Bases:
SettingsMixin,Registrable,DisplayMixin,ABCAbstract base for sequence-level distance metrics.
Computes a scalar distance between two
Sequenceobjects and a full pairwiseDistanceMatrixover a pool.- MEMMAP_SUPPORT: bool = False[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- __init__(settings=None, storage: StorageOptions | dict | None = None) None[source]#
- compute_cross_matrix(pool_rows: SequencePool, pool_cols: SequencePool) ndarray[source]#
Compute an asymmetric (n × k) distance matrix between two pools.
Row
i↔ sequenceiin pool_rows; columnj↔ sequencejin pool_cols. The result is not symmetric.Validates both pools (type-check + composition probe), then delegates to
_compute_cross_matrix_impl(). Subclasses override_compute_cross_matrix_impl()to use Numba kernels when available.- Parameters:
pool_rows – Pool whose sequences form the rows (n items).
pool_cols – Pool whose sequences form the columns (k items).
- Returns:
float32 numpy array of shape
(n, k).
- compute_matrix(pool: SequencePool, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') DistanceMatrix[source]#
Compute the full pairwise distance matrix for pool.
- Parameters:
pool – A
SequencePool.store_path – Storage directory (
None→ in-memory).chunk_size – Rows per flush chunk (default 500).
resume – Skip already-computed chunks (default
True).dtype – Numpy dtype for the matrix (default
"float32").
- Returns:
A
DistanceMatrixof shape(n, n).
- property entity_metric: EntityMetric[source]#
Resolve
settings.entity_metric(string → instance or pass-through).- Returns:
The resolved
EntityMetric.- Raises:
AttributeError – If the concrete settings class has no
entity_metricfield.
- abstractmethod validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#
Composition compatibility check between this metric and the given sequence(s).
Called from
__call__()with both sequences, and fromcompute_matrix()with a single sample sequence (seq_b=None). Subclasses that compose with anEntityMetricprobe a sample entity to surface compatibility errors early.- Parameters:
seq_a – Primary sequence to probe.
seq_b – Optional second sequence.
- Raises:
TypeError – If the entity feature has an incompatible dtype.
KeyError – If a required feature is absent.
- class tanat.metric.sequence.SoftDTWSequenceMetric(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricSoft Dynamic Time Warping distance between two sequences.
Replaces the
minoperator in the DTW recurrence with a differentiable soft-minimum parameterised bygamma:\[\text{soft-min}(a, b, c; \gamma) = -\gamma \log\bigl( e^{-a/\gamma} + e^{-b/\gamma} + e^{-c/\gamma}\bigr)\]As
gamma → 0, SoftDTW converges to standard DTW. Asgamma → ∞, it approaches the mean of all alignment costs.Empty-sequence behaviour:
Both empty →
nan(no alignment possible).One empty →
nan(no alignment possible).
References
Cuturi & Blondel (2017) — Soft-DTW: a Differentiable Loss Function for Time-Series, ICML.
Example:
sdtw = SoftDTWSequenceMetric(gamma=0.1) d = sdtw(seq_a, seq_b) dm = sdtw.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
SoftDTWSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', gamma: float = 1.0, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.SoftDTWSettings(*, entity_metric: EntityMetric = 'hamming', gamma: float = 1.0)[source]#
Bases:
objectSettings for
SoftDTWSequenceMetric.- Parameters:
entity_metric – Entity-level distance metric. Default:
"hamming".gamma – Regularisation parameter for the soft-min operator. Must be > 0. Large values produce a smoother (mean-like) approximation; small values approach standard DTW. Default: 1.0.
- entity_metric: EntityMetric = 'hamming'[source]#