tanat.metric.sequence.type.linear_pairwise package#

Submodules#

tanat.metric.sequence.type.linear_pairwise.kernels module#

Numba kernels for LinearPairwiseSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays and float32 distance buffers.

tanat.metric.sequence.type.linear_pairwise.kernels.compute_pairwise_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, aggregator, padding_penalty, symmetric)[source]#

Unified chunk kernel for memmap paths.

Processes rows [start, end) of result. Behaviour mirrors compute_pairwise_matrix(): when symmetric is True, only the upper triangle of the chunk is computed and values are mirrored; when False, every cell in the chunk rows is computed.

The diagonal is set by this kernel when symmetric is True.

Parameters:
  • result – The full (n × n) or (chunk × k) memmap/array.

  • start – First row index of the chunk (inclusive).

  • end – Last row index of the chunk (exclusive).

  • arrays_a – Encoded sequences for the row pool.

  • lengths_a – Sequence lengths for the row pool.

  • arrays_b – Encoded sequences for the column pool.

  • lengths_b – Sequence lengths for the column pool.

  • dist_kernel – Entity distance kernel.

  • context – Opaque context for the kernel.

  • aggregator – Aggregation kernel.

  • padding_penalty – Padding value (NaN = no padding).

  • symmetric – When True, exploit the upper-triangle + mirror optimisation (same-pool square matrix only). When False, compute every cell in the row range Required for rectangular cross-pool chunks.

tanat.metric.sequence.type.linear_pairwise.kernels.compute_single_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, aggregator, padding_penalty)[source]#

Compute distance for a single pair of int32-encoded sequences.

Empty-sequence rules (mirrors the Python path):

  1. Both empty → nan (undefined distance).

  2. One empty + padding_penalty is nannan (undefined).

  3. One empty + padding_penalty is set → all positions are padded.

Normal path:

  1. Iterate over min(len_a, len_b) aligned positions and call dist_kernel for each pair.

  2. If padding_penalty is not nan and lengths differ, append padding_penalty for each extra position of the longer sequence.

  3. Aggregate the distance buffer with aggregator.

Parameters:
  • arr_a – int32 array for the first sequence.

  • arr_b – int32 array for the second sequence.

  • len_a – Length of the first sequence.

  • len_b – Length of the second sequence.

  • dist_kernel – Numba-compiled entity-level distance function.

  • context – Opaque tuple forwarded to dist_kernel.

  • aggregator – Numba-compiled aggregation function (values, n) -> float32.

  • padding_penaltyfloat32 penalty for unmatched positions. Use np.nan to disable padding (undefined result when lengths differ).

Returns:

Aggregated float32 distance, or nan for undefined pairs.

tanat.metric.sequence.type.linear_pairwise.metric module#

LinearPairwiseSequenceMetric: align sequences position-by-position and aggregate entity distances.

class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

  • Both emptynan (distance is undefined).

  • One empty, padding_penalty is set → all positions are padded.

  • One empty, padding_penalty is Nonenan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".

  • agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".

  • padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) None[source]#
agg_fun: str = 'mean'[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#

Module contents#

LinearPairwiseSequenceMetric package.

class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

  • Both emptynan (distance is undefined).

  • One empty, padding_penalty is set → all positions are padded.

  • One empty, padding_penalty is Nonenan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".

  • agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".

  • padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) None[source]#
agg_fun: str = 'mean'[source]#
entity_metric: EntityMetric = 'hamming'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#