tanat.metric.sequence.type.linear_pairwise package#

Submodules#

tanat.metric.sequence.type.linear_pairwise.kernels module#

Numba kernels for LinearPairwiseSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays and float32 distance buffers.

tanat.metric.sequence.type.linear_pairwise.kernels.compute_pairwise_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, aggregator, padding_penalty, symmetric)[source]#

Unified chunk kernel for memmap paths.

Processes rows [start, end) of result. Behaviour mirrors compute_pairwise_matrix(): when symmetric is True, only the upper triangle of the chunk is computed and values are mirrored; when False, every cell in the chunk rows is computed.

The diagonal is set by this kernel when symmetric is True.

Parameters:

result – The full (n × n) or (chunk × k) memmap/array.
start – First row index of the chunk (inclusive).
end – Last row index of the chunk (exclusive).
arrays_a – Encoded sequences for the row pool.
lengths_a – Sequence lengths for the row pool.
arrays_b – Encoded sequences for the column pool.
lengths_b – Sequence lengths for the column pool.
dist_kernel – Entity distance kernel.
context – Opaque context for the kernel.
aggregator – Aggregation kernel.
padding_penalty – Padding value (NaN = no padding).
symmetric – When True, exploit the upper-triangle + mirror optimisation (same-pool square matrix only). When False, compute every cell in the row range Required for rectangular cross-pool chunks.

tanat.metric.sequence.type.linear_pairwise.kernels.compute_single_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, aggregator, padding_penalty)[source]#

Compute distance for a single pair of int32-encoded sequences.

Empty-sequence rules (mirrors the Python path):

Both empty → nan (undefined distance).
One empty + padding_penalty is nan → nan (undefined).
One empty + padding_penalty is set → all positions are padded.

Normal path:

Iterate over min(len_a, len_b) aligned positions and call dist_kernel for each pair.
If padding_penalty is not nan and lengths differ, append padding_penalty for each extra position of the longer sequence.
Aggregate the distance buffer with aggregator.

Parameters:

arr_a – int32 array for the first sequence.
arr_b – int32 array for the second sequence.
len_a – Length of the first sequence.
len_b – Length of the second sequence.
dist_kernel – Numba-compiled entity-level distance function.
context – Opaque tuple forwarded to dist_kernel.
aggregator – Numba-compiled aggregation function (values, n) -> float32.
padding_penalty – float32 penalty for unmatched positions. Use np.nan to disable padding (undefined result when lengths differ).

Returns:

Aggregated float32 distance, or nan for undefined pairs.

tanat.metric.sequence.type.linear_pairwise.metric module#

LinearPairwiseSequenceMetric: align sequences position-by-position and aggregate entity distances.

class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

Aligns seq_a and seq_b rank-by-rank and applies the configured entity metric to each aligned pair. The resulting vector of entity distances is aggregated (mean, sum, …) to produce a single scalar sequence distance.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

Both empty → nan (distance is undefined).
One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None → nan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.linear_pairwise.metric.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".
agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".
padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) → None[source]#

agg_fun: str = 'mean'[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#

Module contents#

LinearPairwiseSequenceMetric package.

class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSequenceMetric(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Sequence metric by linear (position-wise) alignment of entities.

When sequences differ in length, padding_penalty is applied for each unmatched position of the longer sequence. If padding_penalty is None, only the overlapping prefix is used.

Empty-sequence behaviour:

Both empty → nan (distance is undefined).
One empty, padding_penalty is set → all positions are padded.
One empty, padding_penalty is None → nan with a warning suggesting to set padding_penalty.

Example:

hamming = HammingEntityMetric(entity_feature="status")
lp = LinearPairwiseSequenceMetric(entity_metric=hamming)

dist = lp(seq_a, seq_b)
dm   = lp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LinearPairwiseSettings

__init__(entity_metric: EntityMetric | str = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.linear_pairwise.LinearPairwiseSettings(*, entity_metric: EntityMetric = 'hamming', agg_fun: str = 'mean', padding_penalty: float | None = None)[source]#

Bases: object

Settings for LinearPairwiseSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Accepts a registration name (string) or an instance. Pydantic auto-resolves strings via Registrable.__get_pydantic_core_schema__. Default: "hamming".
agg_fun – Aggregation function applied to the vector of entity distances. One of "mean" (default) or "sum".
padding_penalty – Distance value used for unmatched positions when sequences have different lengths. None → unmatched positions are ignored (only the overlap is aggregated). When the overlap is empty (one sequence has length 0), None makes the distance undefined (nan in a matrix, ValueError on a direct call).

__init__(*args: Any, **kwargs: Any) → None[source]#

agg_fun: str = 'mean'[source]#

entity_metric: EntityMetric = 'hamming'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

padding_penalty: float | None = None[source]#