tanat.metric.sequence.type.edit package#

Submodules#

tanat.metric.sequence.type.edit.kernels module#

Numba kernels for EditSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays produced by the entity metric’s prepare_batch_data.

Implementation: 2-row rolling Needleman-Wunsch DP. Only two rows of the full (n+1) × (m+1) matrix are kept in memory at any time.

tanat.metric.sequence.type.edit.kernels.compute_edit_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, indel_cost, normalize, symmetric)[source]#

Parallel Edit matrix kernel.

Processes rows [start, end).

tanat.metric.sequence.type.edit.kernels.compute_edit_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, indel_cost, normalize)[source]#

Compute Needleman-Wunsch edit distance for a single pair.

Uses a 2-row rolling DP (O(m) space, O(n×m) time).

Parameters:
  • arr_a – int32-encoded sequence A.

  • arr_b – int32-encoded sequence B.

  • len_a – Length of A.

  • len_b – Length of B.

  • dist_kernel – Numba entity distance kernel (substitution cost).

  • context – Opaque context tuple forwarded to dist_kernel.

  • indel_cost – Cost per insertion / deletion (float32).

  • normalize – When True, divide result by max(len_a, len_b).

Returns:

float32 edit distance.

tanat.metric.sequence.type.edit.metric module#

EditSequenceMetric: Needleman-Wunsch edit distance between sequences.

class tanat.metric.sequence.type.edit.metric.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (no edits needed).

  • One emptyn × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.edit.metric.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:
  • entity_metric – Entity-level substitution cost metric. Default: "hamming".

  • indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.

  • normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
indel_cost: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#

Module contents#

EditSequenceMetric package.

class tanat.metric.sequence.type.edit.EditSequenceMetric(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Needleman-Wunsch edit distance between two sequences.

Computes the minimum-cost alignment between two sequences using a full O(n × m) DP matrix (Needleman-Wunsch). Substitution cost comes from the entity metric; insertions and deletions cost indel_cost each.

When normalize=True, the raw distance is divided by max(len_a, len_b) so the result lies in [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (no edits needed).

  • One emptyn × indel_cost (all insertions/deletions).

Example:

edit = EditSequenceMetric(indel_cost=0.5, normalize=True)
d    = edit(seq_a, seq_b)
dm   = edit.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of EditSettings

__init__(entity_metric: EntityMetric | str = 'hamming', indel_cost: float = 1.0, normalize: bool = False, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.edit.EditSettings(*, entity_metric: EntityMetric = 'hamming', indel_cost: float = 1.0, normalize: bool = False)[source]#

Bases: object

Settings for EditSequenceMetric.

Parameters:
  • entity_metric – Entity-level substitution cost metric. Default: "hamming".

  • indel_cost – Cost per insertion or deletion. Must be > 0. Default: 1.0.

  • normalize – When True, divide the raw edit distance by max(len_a, len_b) to obtain a value in [0, 1]. Default: False.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
indel_cost: float = 1.0[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

normalize: bool = False[source]#