tanat.metric.sequence.type.chi2 package#

Submodules#

tanat.metric.sequence.type.chi2.kernels module#

Numba kernels for Chi2SequenceMetric.

Chi² operates on pre-computed histogram arrays (one float32 row per sequence), built in Python via Polars. The Numba kernels only perform the O(n² × V) pairwise chi-squared distance computations.

Histogram layout: hists[i] is a float32 vector of raw (unnormalised) weights of length n_cats for sequence i. The kernel normalises internally (division by sum(hists[i])).

tanat.metric.sequence.type.chi2.kernels.compute_chi2_cross_matrix(result, hists_rows, hists_cols, n_cats)[source]#

Parallel Chi2 cross-matrix kernel.

Computes the full asymmetric matrix between two histogram sets.

tanat.metric.sequence.type.chi2.kernels.compute_chi2_matrix(result, start, end, hists, n_cats, symmetric)[source]#

Parallel Chi2 matrix kernel.

Processes rows [start, end).

tanat.metric.sequence.type.chi2.kernels.compute_chi2_pair(hist_a, hist_b, n_cats)[source]#

Compute chi-squared distance between two (unnormalised) histograms.

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

where p values are proportions (sum-normalised weights).

Parameters:

hist_a – float32 weight array of length n_cats for sequence A.
hist_b – float32 weight array of length n_cats for sequence B.
n_cats – Number of categories (length of the histogram vectors).

Returns:

float32 chi-squared distance.

tanat.metric.sequence.type.chi2.metric module#

Chi2SequenceMetric: Chi-squared distance between state-time distributions.

class tanat.metric.sequence.type.chi2.metric.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes end − start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

Both empty → 0.0 (identical empty distributions).
One empty → 1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

prepare_batch_data(pool: SequencePool) → tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:: (hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:

seq_a – Primary sequence.
seq_b – Optional second sequence.

Raises:

KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.

class tanat.metric.sequence.type.chi2.metric.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:: entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_feature: str | None = None[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.

Module contents#

Chi2SequenceMetric package.

class tanat.metric.sequence.type.chi2.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes end − start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

Both empty → 0.0 (identical empty distributions).
One empty → 1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)

MEMMAP_SUPPORT: bool = True[source]#: Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#: alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') → None[source]#

prepare_batch_data(pool: SequencePool) → tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:: (hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) → None[source]#

Resolve and validate the target feature.

Parameters:

seq_a – Primary sequence.
seq_b – Optional second sequence.

Raises:

KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.

class tanat.metric.sequence.type.chi2.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:: entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) → None[source]#

entity_feature: str | None = None[source]#

model_dump(*, mode='python', **dump_kwargs)[source]#: Dump settings to a dict via Pydantic serialization.