tanat.metric.sequence.type.linear_pairwise package#
Submodules#
tanat.metric.sequence.type.linear_pairwise.kernels module#
Numba kernels for LinearPairwiseSequenceMetric.
All functions are @njit (no Python objects). They operate on int32-encoded feature arrays and float32 distance buffers.
- tanat.metric.sequence.type.linear_pairwise.kernels.compute_pairwise_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, aggregator, padding_penalty, symmetric)[source]#
Unified chunk kernel for memmap paths.
Processes rows
[start, end)of result. Behaviour mirrorscompute_pairwise_matrix(): whensymmetricisTrue, only the upper triangle of the chunk is computed and values are mirrored; whenFalse, every cell in the chunk rows is computed.The diagonal is set by this kernel when
symmetricisTrue.- Parameters:
result – The full (n × n) or (chunk × k) memmap/array.
start – First row index of the chunk (inclusive).
end – Last row index of the chunk (exclusive).
arrays_a – Encoded sequences for the row pool.
lengths_a – Sequence lengths for the row pool.
arrays_b – Encoded sequences for the column pool.
lengths_b – Sequence lengths for the column pool.
dist_kernel – Entity distance kernel.
context – Opaque context for the kernel.
aggregator – Aggregation kernel.
padding_penalty – Padding value (NaN = no padding).
symmetric – When
True, exploit the upper-triangle + mirror optimisation (same-pool square matrix only). WhenFalse, compute every cell in the row range Required for rectangular cross-pool chunks.
- tanat.metric.sequence.type.linear_pairwise.kernels.compute_single_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, aggregator, padding_penalty)[source]#
Compute distance for a single pair of int32-encoded sequences.
Empty-sequence rules (mirrors the Python path):
Both empty →
nan(undefined distance).One empty +
padding_penaltyisnan→nan(undefined).One empty +
padding_penaltyis set → all positions are padded.
Normal path:
Iterate over
min(len_a, len_b)aligned positions and calldist_kernelfor each pair.If
padding_penaltyis notnanand lengths differ, appendpadding_penaltyfor each extra position of the longer sequence.Aggregate the distance buffer with
aggregator.
- Parameters:
arr_a – int32 array for the first sequence.
arr_b – int32 array for the second sequence.
len_a – Length of the first sequence.
len_b – Length of the second sequence.
dist_kernel – Numba-compiled entity-level distance function.
context – Opaque tuple forwarded to
dist_kernel.aggregator – Numba-compiled aggregation function
(values, n) -> float32.padding_penalty –
float32penalty for unmatched positions. Usenp.nanto disable padding (undefined result when lengths differ).
- Returns:
Aggregated float32 distance, or
nanfor undefined pairs.
tanat.metric.sequence.type.linear_pairwise.metric module#
LinearPairwiseSequenceMetric: align sequences position-by-position and aggregate entity distances.
- class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricSequence metric by linear (position-wise) alignment of entities.
Aligns
seq_aandseq_brank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.When sequences differ in length,
padding_penaltyis applied for each unmatched position of the longer sequence. Ifpadding_penaltyisNone, only the overlapping prefix is used.Empty-sequence behaviour:
Both empty →
nan(distance is undefined).One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None →
nanwith a warning suggesting to setpadding_penalty.
Example:
hamming = HammingEntityMetric(entity_feature="status") lp = LinearPairwiseSequenceMetric(entity_metric=hamming) dist = lp(seq_a, seq_b) dm = lp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LinearPairwiseSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#
Bases:
objectSettings for
LinearPairwiseSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via
Registrable.__get_pydantic_core_schema__. Default:"hamming".agg_fun – Aggregation function applied to the vector of entity distances. One of
"mean"(default) or"sum".padding_penalty – Distance value used for unmatched positions when sequences have different lengths.
None→ unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0),Nonemakes the distance undefined (nanin a matrix,ValueErroron a direct call).
- entity_metric: EntityMetric = 'hamming'[source]#
Module contents#
LinearPairwiseSequenceMetric package.
- class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricSequence metric by linear (position-wise) alignment of entities.
Aligns
seq_aandseq_brank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.When sequences differ in length,
padding_penaltyis applied for each unmatched position of the longer sequence. Ifpadding_penaltyisNone, only the overlapping prefix is used.Empty-sequence behaviour:
Both empty →
nan(distance is undefined).One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None →
nanwith a warning suggesting to setpadding_penalty.
Example:
hamming = HammingEntityMetric(entity_feature="status") lp = LinearPairwiseSequenceMetric(entity_metric=hamming) dist = lp(seq_a, seq_b) dm = lp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LinearPairwiseSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#
Bases:
objectSettings for
LinearPairwiseSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via
Registrable.__get_pydantic_core_schema__. Default:"hamming".agg_fun – Aggregation function applied to the vector of entity distances. One of
"mean"(default) or"sum".padding_penalty – Distance value used for unmatched positions when sequences have different lengths.
None→ unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0),Nonemakes the distance undefined (nanin a matrix,ValueErroron a direct call).
- entity_metric: EntityMetric = 'hamming'[source]#