tanat.metric.sequence.type.chi2 package#
Submodules#
tanat.metric.sequence.type.chi2.kernels module#
Numba kernels for Chi2SequenceMetric.
Chi² operates on pre-computed histogram arrays (one float32 row per sequence), built in Python via Polars. The Numba kernels only perform the O(n² × V) pairwise chi-squared distance computations.
Histogram layout: hists[i] is a float32 vector of raw (unnormalised)
weights of length n_cats for sequence i. The kernel normalises
internally (division by sum(hists[i])).
- tanat.metric.sequence.type.chi2.kernels.compute_chi2_cross_matrix(result, hists_rows, hists_cols, n_cats)[source]#
Parallel Chi2 cross-matrix kernel.
Computes the full asymmetric matrix between two histogram sets.
- tanat.metric.sequence.type.chi2.kernels.compute_chi2_matrix(result, start, end, hists, n_cats, symmetric)[source]#
Parallel Chi2 matrix kernel.
Processes rows
[start, end).
- tanat.metric.sequence.type.chi2.kernels.compute_chi2_pair(hist_a, hist_b, n_cats)[source]#
Compute chi-squared distance between two (unnormalised) histograms.
\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]where
pvalues are proportions (sum-normalised weights).- Parameters:
hist_a – float32 weight array of length
n_catsfor sequence A.hist_b – float32 weight array of length
n_catsfor sequence B.n_cats – Number of categories (length of the histogram vectors).
- Returns:
float32 chi-squared distance.
tanat.metric.sequence.type.chi2.metric module#
Chi2SequenceMetric: Chi-squared distance between state-time distributions.
- class tanat.metric.sequence.type.chi2.metric.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricChi-squared distance between the state-time distributions of two sequences.
Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.
Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes
end − startas its weight.
The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:
\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]Note
Chi² does not use an entity metric;
entity_metricis absent from its settings. Thevalidate_compositionmethod checks only that the requested feature is present.Empty-sequence behaviour:
Both empty →
0.0(identical empty distributions).One empty →
1.0(maximally different distributions).
Example:
chi2 = Chi2SequenceMetric(entity_feature="status") d = chi2(seq_a, seq_b) dm = chi2.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
Chi2Settings
- __init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- prepare_batch_data(pool: SequencePool) tuple[source]#
Build histogram arrays for all sequences in pool.
- Returns:
(hists, n_cats)where hists is a float32 numpy array of shape(n, n_cats)containing raw (unnormalised) weights, with rows ordered to matchpool.unique_ids.
- validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#
Resolve and validate the target feature.
If
entity_featurewas not specified, resolves to the first entity feature ofseq_aand stores it intarget_feature. Then checks that the feature is present and categorical in every provided sequence.- Parameters:
seq_a – Primary sequence.
seq_b – Optional second sequence.
- Raises:
KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.
- class tanat.metric.sequence.type.chi2.metric.Chi2Settings(*, entity_feature: str | None = None)[source]#
Bases:
objectSettings for
Chi2SequenceMetric.- Parameters:
entity_feature – Categorical feature name used as the histogram key (same semantics as
HammingEntityMetric).None→ resolved from the first entity feature of the sequence atvalidate_composition()time.
Module contents#
Chi2SequenceMetric package.
- class tanat.metric.sequence.type.chi2.Chi2SequenceMetric(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32')[source]#
Bases:
SequenceMetricChi-squared distance between the state-time distributions of two sequences.
Rather than comparing sequences element-by-element, Chi² compares the proportion of time (or event count) spent in each categorical state.
Event sequences: each event contributes a weight of 1.
Interval / State sequences: each entity contributes
end − startas its weight.
The per-category proportions are computed independently for each sequence, then compared via the Chi-squared distance formula:
\[d(a, b) = \sqrt{\sum_j \frac{(p_{aj} - p_{bj})^2}{p_{aj} + p_{bj}}}\]Note
Chi² does not use an entity metric;
entity_metricis absent from its settings. Thevalidate_compositionmethod checks only that the requested feature is present.Empty-sequence behaviour:
Both empty →
0.0(identical empty distributions).One empty →
1.0(maximally different distributions).
Example:
chi2 = Chi2SequenceMetric(entity_feature="status") d = chi2(seq_a, seq_b) dm = chi2.compute_matrix(pool)
- MEMMAP_SUPPORT: bool = True[source]#
Set to
Truein subclasses that implement disk-backed (memmap) computation. WhenFalse, passingstore_pathor an instance-levelStorageOptionsraisesNotImplementedErrorearly with a clear message.
- SETTINGS_CLASS[source]#
alias of
Chi2Settings
- __init__(entity_feature: str | None = None, *, store_path: str | Path | None = None, chunk_size: int = 500, resume: bool = True, dtype: str = 'float32') None[source]#
- prepare_batch_data(pool: SequencePool) tuple[source]#
Build histogram arrays for all sequences in pool.
- Returns:
(hists, n_cats)where hists is a float32 numpy array of shape(n, n_cats)containing raw (unnormalised) weights, with rows ordered to matchpool.unique_ids.
- validate_composition(seq_a: Sequence, seq_b: Sequence | None = None) None[source]#
Resolve and validate the target feature.
If
entity_featurewas not specified, resolves to the first entity feature ofseq_aand stores it intarget_feature. Then checks that the feature is present and categorical in every provided sequence.- Parameters:
seq_a – Primary sequence.
seq_b – Optional second sequence.
- Raises:
KeyError – If the feature is absent from a sequence.
TypeError – If the feature is not categorical.
- class tanat.metric.sequence.type.chi2.Chi2Settings(*, entity_feature: str | None = None)[source]#
Bases:
objectSettings for
Chi2SequenceMetric.- Parameters:
entity_feature – Categorical feature name used as the histogram key (same semantics as
HammingEntityMetric).None→ resolved from the first entity feature of the sequence atvalidate_composition()time.