tanat.metric.sequence.type.lcp package#

Submodules#

tanat.metric.sequence.type.lcp.kernels module#

Numba kernels for LCPSequenceMetric.

All functions are @njit (no Python objects). They operate on int32-encoded feature arrays produced by the entity metric’s prepare_batch_data.

Output-mode integer encoding (mirrors _MODE_MAP in metric.py):

0 → length 1 → distance 2 → normalized

tanat.metric.sequence.type.lcp.kernels.compute_lcp_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, threshold, mode, symmetric)[source]#

Parallel LCP matrix kernel.

Processes rows [start, end).

Parameters:
  • result – Full (n × n) or (chunk × k) memmap/array.

  • start – First row index (inclusive).

  • end – Last row index (exclusive).

  • arrays_a – Encoded sequences for the row pool.

  • lengths_a – Lengths for the row pool.

  • arrays_b – Encoded sequences for the column pool.

  • lengths_b – Lengths for the column pool.

  • dist_kernel – Entity distance kernel.

  • context – Opaque context tuple.

  • threshold – Equality threshold.

  • mode – Integer output mode (0/1/2).

  • symmetric – When True, exploit upper-triangle + mirror.

tanat.metric.sequence.type.lcp.kernels.compute_lcp_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, threshold, mode)[source]#

Compute LCP distance for a single pair of int32-encoded sequences.

Scans aligned positions until the first mismatch (entity distance > threshold) and counts the common prefix length. Then applies the requested output mode:

  • 0 (length) → raw prefix count.

  • 1 (distance) → len_a + len_b 2 × lcp.

  • 2 (normalized) → 1 2 × lcp / (len_a + len_b).

Parameters:
  • arr_a – int32-encoded sequence A.

  • arr_b – int32-encoded sequence B.

  • len_a – Length of A.

  • len_b – Length of B.

  • dist_kernel – Numba entity distance kernel.

  • context – Opaque context tuple forwarded to dist_kernel.

  • threshold – Equality threshold (float32).

  • mode – Integer output mode (0/1/2).

Returns:

float32 result.

tanat.metric.sequence.type.lcp.metric module#

LCPSequenceMetric: Longest Common Prefix distance between sequences.

class tanat.metric.sequence.type.lcp.metric.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

  • "length" → raw prefix length (not a proper distance).

  • "distance"len_a + len_b 2·LCP (always ≥ 0).

  • "normalized"1 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcp.metric.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCP length (not a distance, can be > 1).

    • "distance" → additive distance: len_a + len_b 2·LCP.

    • "normalized"``→  Jaccard-like distance: ``1 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

Module contents#

LCPSequenceMetric package.

class tanat.metric.sequence.type.lcp.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Longest Common Prefix distance between two sequences.

Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤ equality_threshold). The scan stops at the first mismatch.

Three output modes are available:

  • "length" → raw prefix length (not a proper distance).

  • "distance"len_a + len_b 2·LCP (always ≥ 0).

  • "normalized"1 2·LCP / (len_a + len_b) ∈ [0, 1].

Empty-sequence behaviour:

  • Both empty0.0 (for all modes).

  • One empty (length n vs 0) → length: 0.0, distance: n, normalized: 1.0.

Example:

lcp = LCPSequenceMetric(mode="normalized")
d   = lcp(seq_a, seq_b)
dm  = lcp.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of LCPSettings

__init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Probe the first entity of each sequence through the entity metric.

class tanat.metric.sequence.type.lcp.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#

Bases: object

Settings for LCPSequenceMetric.

Parameters:
  • entity_metric – Entity-level metric. Default: "hamming".

  • equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.

  • mode

    Output mode.

    • "length" → raw LCP length (not a distance, can be > 1).

    • "distance" → additive distance: len_a + len_b 2·LCP.

    • "normalized"``→  Jaccard-like distance: ``1 2·LCP / (len_a + len_b), in [0, 1] (default).

__init__(*args: Any, **kwargs: Any) None[source]#
entity_metric: EntityMetric = 'hamming'[source]#
equality_threshold: float = 0.0[source]#
mode: Literal['length', 'distance', 'normalized'] = 'distance'[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.