tanat.metric.sequence.type.lcp package#

Submodules#

tanat.metric.sequence.type.lcp.kernels module#

Numba kernels for LCPSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays produced by the entity metric’s prepare_batch_data.

Output-mode integer encoding (mirrors _MODE_MAP in metric.py):: 0 → length 1 → distance 2 → normalized

tanat.metric.sequence.type.lcp.kernels.compute_lcp_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, threshold, mode, symmetric)[source]#

Parallel LCP matrix kernel.

Processes rows [start, end).

Parameters:

result – Full (n × n) or (chunk × k) memmap/array.
start – First row index (inclusive).
end – Last row index (exclusive).
arrays_a – Encoded sequences for the row pool.
lengths_a – Lengths for the row pool.
arrays_b – Encoded sequences for the column pool.
lengths_b – Lengths for the column pool.
dist_kernel – Entity distance kernel.
context – Opaque context tuple.
threshold – Equality threshold.
mode – Integer output mode (0/1/2).
symmetric – When True, exploit upper-triangle + mirror.

tanat.metric.sequence.type.lcp.kernels.compute_lcp_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, threshold, mode)[source]#

Compute LCP distance for a single pair of int32-encoded sequences.

Scans aligned positions until the first mismatch (entity distance > threshold) and counts the common prefix length. Then applies the requested output mode:

0 (length) → raw prefix count.
1 (distance) → len_a + len_b − 2 × lcp.
2 (normalized) → 1 − 2 × lcp / (len_a + len_b).

Parameters:

arr_a – int32-encoded sequence A.
arr_b – int32-encoded sequence B.
len_a – Length of A.
len_b – Length of B.
dist_kernel – Numba entity distance kernel.
context – Opaque context tuple forwarded to dist_kernel.
threshold – Equality threshold (float32).
mode – Integer output mode (0/1/2).

Returns:

float32 result.

tanat.metric.sequence.type.lcp.metric module#

LCPSequenceMetric: Longest Common Prefix distance between sequences.

class tanat.metric.sequence.type.lcp.metric.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

"length" → raw prefix length (not a proper distance).
"distance" → len_a + len_b − 2·LCP (always ≥ 0).
"normalized" → 1 − 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcp.metric.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
- "length" → raw LCP length (not a distance, can be > 1).
- "distance" → additive distance: len_a + len_b − 2·LCP.
- "normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

Module contents#

LCPSequenceMetric package.

class tanat.metric.sequence.type.lcp.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

"length" → raw prefix length (not a proper distance).
"distance" → len_a + len_b − 2·LCP (always ≥ 0).
"normalized" → 1 − 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

Both empty → 0.0 (for all modes).
One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#: Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcp.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:

entity_metric – Entity-level metric. Default: "hamming".
equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
- "length" → raw LCP length (not a distance, can be > 1).
- "distance" → additive distance: len_a + len_b − 2·LCP.
- "normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_metric: EntityMetric = 'hamming'[source]#

equality_threshold: float = 0.0[source]#

mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.