tanat.metric.sequence.type.lcp package#
Submodules#
tanat.metric.sequence.type.lcp.kernels module#
Numba kernels for LCPSequenceMetric.
All functions are @njit (no Python objects). They operate on int32-encoded
feature arrays produced by the entity metric’s prepare_batch_data.
- Output-mode integer encoding (mirrors
_MODE_MAPin metric.py): 0 → length 1 → distance 2 → normalized
- tanat.metric.sequence.type.lcp.kernels.compute_lcp_matrix(result, start, end, arrays_a, lengths_a, arrays_b, lengths_b, dist_kernel, context, threshold, mode, symmetric)[source]#
Parallel LCP matrix kernel.
Processes rows
[start, end).- Parameters:
result – Full (n × n) or (chunk × k) memmap/array.
start – First row index (inclusive).
end – Last row index (exclusive).
arrays_a – Encoded sequences for the row pool.
lengths_a – Lengths for the row pool.
arrays_b – Encoded sequences for the column pool.
lengths_b – Lengths for the column pool.
dist_kernel – Entity distance kernel.
context – Opaque context tuple.
threshold – Equality threshold.
mode – Integer output mode (0/1/2).
symmetric – When
True, exploit upper-triangle + mirror.
- tanat.metric.sequence.type.lcp.kernels.compute_lcp_pair(arr_a, arr_b, len_a, len_b, dist_kernel, context, threshold, mode)[source]#
Compute LCP distance for a single pair of int32-encoded sequences.
Scans aligned positions until the first mismatch (entity distance >
threshold) and counts the common prefix length. Then applies the requested outputmode:0(length) → raw prefix count.1(distance) →len_a + len_b − 2 × lcp.2(normalized) →1 − 2 × lcp / (len_a + len_b).
- Parameters:
arr_a – int32-encoded sequence A.
arr_b – int32-encoded sequence B.
len_a – Length of A.
len_b – Length of B.
dist_kernel – Numba entity distance kernel.
context – Opaque context tuple forwarded to
dist_kernel.threshold – Equality threshold (float32).
mode – Integer output mode (0/1/2).
- Returns:
float32 result.
tanat.metric.sequence.type.lcp.metric module#
LCPSequenceMetric: Longest Common Prefix distance between sequences.
- class tanat.metric.sequence.type.lcp.metric.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricLongest Common Prefix distance between two sequences.
Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤
equality_threshold). The scan stops at the first mismatch.Three output
modes are available:"length"→ raw prefix length (not a proper distance)."distance"→len_a + len_b − 2·LCP(always ≥ 0)."normalized"→1 − 2·LCP / (len_a + len_b)∈ [0, 1].
Empty-sequence behaviour:
Both empty →
0.0(for all modes).One empty (length n vs 0) → length:
0.0, distance:n, normalized:1.0.
Example:
lcp = LCPSequenceMetric(mode="normalized") d = lcp(seq_a, seq_b) dm = lcp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LCPSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.type.lcp.metric.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#
Bases:
objectSettings for
LCPSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Default:
"hamming".equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
"length"→ raw LCP length (not a distance, can be > 1)."distance"→ additive distance:len_a + len_b − 2·LCP."normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in[0, 1](default).
- entity_metric: EntityMetric = 'hamming'[source]#
Module contents#
LCPSequenceMetric package.
- class tanat.metric.sequence.type.lcp.LCPSequenceMetric(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricLongest Common Prefix distance between two sequences.
Scans the sequences from the start and counts consecutive positions where the two entities are equal (i.e. their entity distance ≤
equality_threshold). The scan stops at the first mismatch.Three output
modes are available:"length"→ raw prefix length (not a proper distance)."distance"→len_a + len_b − 2·LCP(always ≥ 0)."normalized"→1 − 2·LCP / (len_a + len_b)∈ [0, 1].
Empty-sequence behaviour:
Both empty →
0.0(for all modes).One empty (length n vs 0) → length:
0.0, distance:n, normalized:1.0.
Example:
lcp = LCPSequenceMetric(mode="normalized") d = lcp(seq_a, seq_b) dm = lcp.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
LCPSettings
- __init__(entity_metric: EntityMetric | str = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance', *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- class tanat.metric.sequence.type.lcp.LCPSettings(*, entity_metric: EntityMetric = 'hamming', equality_threshold: float = 0.0, mode: Literal['length', 'distance', 'normalized'] = 'distance')[source]#
Bases:
objectSettings for
LCPSequenceMetric.- Parameters:
entity_metric – Entity-level metric. Default:
"hamming".equality_threshold – Two entities are considered equal when their entity distance is ≤ this threshold. Must be ≥ 0. Default: 0.0.
mode –
Output mode.
"length"→ raw LCP length (not a distance, can be > 1)."distance"→ additive distance:len_a + len_b − 2·LCP."normalized"``→ Jaccard-like distance: ``1 − 2·LCP / (len_a + len_b), in[0, 1](default).
- entity_metric: EntityMetric = 'hamming'[source]#