tanat.metric.sequence.type.chi2 package#

Submodules#

tanat.metric.sequence.type.chi2.kernels module#

Numba kernels for Chi2SequenceMetric.

Chi² operates on pre-computed histogram arrays (one float32 row per sequence), built in Python via Polars. The Numba kernels only perform the O(n² × V) pairwise chi-squared distance computations.

Histogram layout: hists[i] is a float32 vector of raw (unnormalised) weights of length n_cats for sequence i. The kernel normalises internally (division by sum(hists[i])).

tanat.metric.sequence.type.chi2.kernels.compute_chi2_cross_matrix(result, hists_rows, hists_cols, n_cats)[source]#

Parallel Chi2 cross-matrix kernel.

Computes the full asymmetric matrix between two histogram sets.

tanat.metric.sequence.type.chi2.kernels.compute_chi2_matrix(result, start, end, hists, n_cats, symmetric)[source]#

Parallel Chi2 matrix kernel.

Processes rows [start, end).

tanat.metric.sequence.type.chi2.kernels.compute_chi2_pair(hist_a, hist_b, n_cats)[source]#

Compute chi-squared distance between two (unnormalised) histograms.

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

where p values are proportions (sum-normalised weights).

Parameters:
  • hist_a – float32 weight array of length n_cats for sequence A.

  • hist_b – float32 weight array of length n_cats for sequence B.

  • n_cats – Number of categories (length of the histogram vectors).

Returns:

float32 chi-squared distance.

tanat.metric.sequence.type.chi2.metric module#

Chi2SequenceMetric: Chi-squared distance between state-time distributions.

class tanat.metric.sequence.type.chi2.metric.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

  • Event sequences: each event contributes a weight of 1.

  • Interval / State sequences: each entity contributes end start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

  • Both empty0.0 (identical empty distributions).

  • One empty1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
prepare_batch_data(pool: SequencePool) tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:

(hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:
  • seq_a – Primary sequence.

  • seq_b – Optional second sequence.

Raises:
  • KeyError – If the feature is absent from a sequence.

  • TypeError – If the feature is not categorical.

class tanat.metric.sequence.type.chi2.metric.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:

entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_feature: str | None = None[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.

Module contents#

Chi2SequenceMetric package.

class tanat.metric.sequence.type.chi2.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#

Bases: SequenceMetric

Chi-squared distance between the state-time distributions of two sequences.

Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.

  • Event sequences: each event contributes a weight of 1.

  • Interval / State sequences: each entity contributes end start as its weight.

The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:

\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]

Note

Chi² does not use an entity metric; entity_metric is absent from its settings. The validate_composition method checks only that the requested feature is present.

Empty-sequence behaviour:

  • Both empty0.0 (identical empty distributions).

  • One empty1.0 (maximally different distributions).

Example:

chi2 = Chi2SequenceMetric(entity_feature="status")
d    = chi2(seq_a, seq_b)
dm   = chi2.compute_matrix(pool)
MEMMAP_SUPPORT: bool = True[source]#

Set to True in subclasses that implement disk-backed (memmap) computation. When False, passing store_path or an instance-level StorageOptions raises NotImplementedError early with a clear message.

SETTINGS_CLASS[source]#

alias of Chi2Settings

__init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
prepare_batch_data(pool: SequencePool) tuple[source]#

Build histogram arrays for all sequences in pool.

Returns:

(hists, n_cats) where hists is a float32 numpy array of shape (n, n_cats) containing raw (unnormalised) weights, with rows ordered to match pool.unique_ids.

validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#

Resolve and validate the target feature.

If entity_feature was not specified, resolves to the first entity feature of seq_a and stores it in target_feature. Then checks that the feature is present and categorical in every provided sequence.

Parameters:
  • seq_a – Primary sequence.

  • seq_b – Optional second sequence.

Raises:
  • KeyError – If the feature is absent from a sequence.

  • TypeError – If the feature is not categorical.

class tanat.metric.sequence.type.chi2.Chi2Settings(*, entity_feature: str | None = None)[source]#

Bases: object

Settings for Chi2SequenceMetric.

Parameters:

entity_feature – Categorical feature name used as the histogram key (same semantics as HammingEntityMetric). None → resolved from the first entity feature of the sequence at validate_composition() time.

__init__(*args: Any, **kwargs: Any) None[source]#
entity_feature: str | None = None[source]#
model_dump(*, mode='python', **dump_kwargs)[source]#

Dump settings to a dict via Pydantic serialization.