tanat.metric.sequence.type.lcs package#

Submodules#

tanat.metric.sequence.type.lcs.kernels module#

Numba kernels for LCSSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays produced by the entity metric’s prepare_batch_data.

Output-mode integer encoding (mirrors _MODE_MAP in metric.py):

0 → length 1 → distance 2 → normalized

tanat.metric.sequence.type.lcs.kernels.compute_lcs_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, threshold, mode, symmetric)[source]#

Parallel LCS matrix kernel.

Processes rows [start, end).

tanat.metric.sequence.type.lcs.kernels.compute_lcs_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, threshold, mode)[source]#

Compute LCS distance for a single pair of int32-encoded sequences.

Uses a space-optimised 2-row rolling DP. Two entities are considered equal when their distance ≤ threshold.

Parameters:
  • arr_a – int32-encoded sequence A.

  • arr_b – int32-encoded sequence B.

  • len_a – Length of A.

  • len_b – Length of B.

  • dist_kernel – Numba entity distance kernel.

  • context – Opaque context tuple forwarded to dist_kernel.

  • threshold – Equality threshold (float32).

  • mode – Integer output mode (0/1/2).

Returns:

float32 result.

tanat.metric.sequence.type.lcs.metric module#

LCSSequenceMetric: Longest Common Subsequence distance between sequences.

class tanat.metric.sequence.type.lcs.metric.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

  • "length" → raw LCS length (not a proper distance).

  • "distance"len_a + len_b 2·LCS (always ≥ 0).

  • "normalized"1 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcs.metric.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCS length (not a proper distance).

    • "distance" → additive distance: len_a + len_b 2·LCS.

    • "normalized" → Jaccard-like: 1 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

Module contents#

LCSSequenceMetric package.

class tanat.metric.sequence.type.lcs.LCSSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Subsequence distance between two sequences.

Computes the LCS length using a space-optimised DP (2-row rolling array). Two entities are considered equal when their entity distance ≤ equality_threshold.

Three output modes are available:

  • "length" → raw LCS length (not a proper distance).

  • "distance"len_a + len_b 2·LCS (always ≥ 0).

  • "normalized"1 2·LCS / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcs = LCSSequenceMetric(mode="normalized", equality_threshold=0.1)
d   = lcs(seq_a, seq_b)
dm  = lcs.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCSSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcs.LCSSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCSSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCS length (not a proper distance).

    • "distance" → additive distance: len_a + len_b 2·LCS.

    • "normalized" → Jaccard-like: 1 2·LCS / (len_a + len_b), in [0, 1].

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.