tanat.store.base package#

Submodules#

tanat.store.base.static module#

StaticStoreMixin: shared static-feature logic.

Both SequenceStore and TrajectoryStore follow the same pattern:

  • A physical static_features.arrow on disk (optional).

  • A virtual layer under tmp/<virtual_id>/static_features.arrow.

  • Merge at read time via horizontal concatenation.

  • Split physical / virtual at write / drop time.

This mixin extracts that shared logic. It is a pure mixin that relies on attributes provided by BaseStore (_root_path, _virtual, main_index, main_id_col).

class tanat.store.base.static.StaticStoreMixin[source]#

Bases: object

Mixin that manages static features (physical + virtual).

Expects the host class to provide:

  • self._root_path: Path

  • self._virtual: VirtualStore

  • self.main_index: pl.LazyFrame

  • self.main_id_col: str

FILE_STATIC_FEATURES: Final[str] = 'static_features.arrow'[source]#
add_static(virtual_id: str, df: DataFrame | LazyFrame | DataFrame, *, id_col: str) list[str][source]#

Add static feature columns into a virtual context via a LEFT JOIN.

The input df must contain the internal ID column named id_col. A LEFT JOIN against the main index guarantees that the output has exactly one row per ID (nulls for IDs absent from df), so no height validation is required by the virtual store.

Parameters:
  • virtual_id – Virtual context identifier.

  • df – DataFrame carrying id_col plus one or more feature columns.

  • id_col – Name of the ID column in df (already renamed to the internal store ID before this call).

Returns:

The list of feature column names written (excludes id_col).

drop_static(features: list[str], virtual_id: str | None = None) None[source]#

Permanently removes static-feature columns (physical and/or virtual).

get_static_data(virtual_id: str | None = None) LazyFrame | None[source]#

Static features with the ID column prepended.

static(virtual_id: str | None = None) LazyFrame | None[source]#

Static feature columns only (physical + virtual), without the id column.

Virtual features take precedence: any physical column whose name is also present in the virtual context is silently shadowed, so the virtual value is always returned.

static_features(virtual_id: str | None = None) list[str][source]#

List of static-feature column names (physical + virtual when virtual_id is set).

tanat.store.base.store module#

BaseStore: abstract base class shared by SequenceStore and TrajectoryStore.

class tanat.store.base.store.BaseStore(root_path: str | Path)[source]#

Bases: ABC, StaticStoreMixin

Abstract base class for all stores.

Subclasses must define two class-level attributes:

  • _MAIN_INDEX_PROPERTY: str - name of the property that returns the main navigation index (e.g. "sequence_index" or "trajectory_index").

  • _MAIN_ID_PROPERTY: str - name of the property that returns the ID column name (e.g. "seq_id_col" or "traj_id_col").

__init__(root_path: str | Path) None[source]#
clear_virtual_context(virtual_id: str) None[source]#

Removes a virtual context directory and all its feature files.

property core: dict[source]#

Returns the static store facts written once at build time.

fork_virtual_context(source_virtual_id: str | None) str | None[source]#

Fork source_virtual_id into a new context, or None if nothing to inherit.

Returns None immediately when source_virtual_id is None; otherwise delegates to fork_context().

property main_id_col: str[source]#

Name of the ID column in the main index.

property main_index: LazyFrame[source]#

Primary ID index (e.g. sequence_index or trajectory_index).

property root_path: Path[source]#

Root directory of this store.

static write_metadata_json(metadata, path: Path, filename: str = 'metadata.json') None[source]#

Writes metadata to path / filename.

Works with any metadata object that implements to_json_dict().

tanat.store.base.utils module#

Shared Arrow I/O utilities for the store layer.

All functions are stateless and operate directly on paths / LazyFrames.

tanat.store.base.utils.atomic_write(lf: LazyFrame, path: Path) None[source]#

Atomic Arrow file writing (write to tmp then rename).

tanat.store.base.utils.check_no_reserved_names(cols: list[str], reserved: frozenset[str], *, context: str = 'reserved') None[source]#

Raise ValueError if any column in cols collides with reserved.

Used at both the store layer (internal schema names) and the pool layer (user-facing structural names such as the ID or time columns).

Parameters:
  • cols – Incoming column names to validate.

  • reserved – Set of forbidden names.

  • context – Human-readable description of the reserved set, shown in the error message (e.g. "internal store columns" or "time columns").

Raises:

ValueError – If any name in cols is in reserved.

tanat.store.base.utils.drop_columns_from_file(path: Path, columns: list[str]) bool[source]#

Removes columns from an IPC file on disk.

Columns not present in the file are silently ignored. If no columns remain after the drop, the file is deleted.

Returns:

True if the file was modified, False otherwise.

tanat.store.base.utils.get_column_names(df: DataFrame | LazyFrame | DataFrame) list[str][source]#

Return column names without triggering a PerformanceWarning on LazyFrames.

Parameters:

df – Any supported tabular format.

Returns:

Ordered list of column names.

tanat.store.base.utils.hconcat_datasets(*datasets: LazyFrame | None) LazyFrame | None[source]#

Horizontally concatenates non-None, non-empty LazyFrames.

Returns None when no data is available.

tanat.store.base.utils.hconcat_physical_virtual(physical_lf: LazyFrame | None, virtual_lf: LazyFrame | None) LazyFrame | None[source]#

Horizontal concat of physical and virtual feature frames.

Virtual columns take precedence: any physical column whose name is also present in virtual_lf is silently shadowed via select, so the virtual value is always returned. This avoids DuplicateError when a feature has been overridden in the virtual context after a save() baked it into the physical file.

Parameters:
  • physical_lf – Physical feature frame (may be None).

  • virtual_lf – Virtual feature frame (may be None).

Returns:

A merged LazyFrame, or None when both inputs are absent / empty.

tanat.store.base.utils.infer_features(df: DataFrame | LazyFrame | DataFrame, *, exclude: set[str]) list[str][source]#

Return all column names from df that are not in exclude.

Parameters:
  • df – Input DataFrame or LazyFrame.

  • exclude – Structural column names to exclude.

Returns:

Ordered list of feature column names.

Raises:

ValueError – If no feature columns remain after exclusion.

tanat.store.base.utils.normalise_to_lazyframe(df: DataFrame | LazyFrame | DataFrame) LazyFrame[source]#

Convert a DataFrame (pandas, Polars eager, or Polars lazy) to a LazyFrame.

Parameters:

df – Input data in any supported tabular format.

Returns:

A polars.LazyFrame wrapping df.

tanat.store.base.utils.scan_if_exists(path: Path) LazyFrame | None[source]#

Scans an IPC file if it exists, otherwise returns None.

tanat.store.base.utils.validate_required_columns(df: DataFrame | LazyFrame | DataFrame, required: set[str]) None[source]#

Raise ValueError if any column in required is missing from df.

Parameters:
  • df – Input DataFrame or LazyFrame.

  • required – Set of column names that must be present.

Raises:

ValueError – If any column in required is absent from df.

tanat.store.base.virtual module#

VirtualStore: manages temporary feature engineering in ./tmp/<virtual_id>/.

class tanat.store.base.virtual.VirtualStore(root_path: Path)[source]#

Bases: object

Manages temporary (virtual) feature storage under <root>/tmp/<virtual_id>/.

The VirtualStore is used by stores (SequenceStore, TrajectoryStore) to handle ephemeral feature engineering contexts: - Creating / clearing virtual contexts - Time-index overrides (sequence store only, e.g. for time-index conversions) - Reading / writing virtual feature files - Listing virtual feature names

__init__(root_path: Path) None[source]#
add_entity_features(virtual_id: str, df: DataFrame | LazyFrame | DataFrame, *, expected_height: int) list[str][source]#

Add positional entity features into the virtual context.

Validates that the input row count matches expected_height before writing. The virtual context directory is created if absent. Existing columns with the same name are always replaced.

Parameters:
  • virtual_id – Virtual context identifier.

  • df – Feature-only DataFrame (no ID column), positionally aligned with the entity rows in the store.

  • expected_height – Expected row count (validated against the time index length).

Returns:

The list of column names written.

Raises:

ValueError – If df height does not match expected_height.

add_static_features(virtual_id: str, df: DataFrame | LazyFrame | DataFrame) list[str][source]#

Add pre-aligned static features into the virtual context.

No height validation is performed: the caller (StaticStoreMixin) is responsible for producing a correctly shaped frame via a LEFT JOIN against the main index. The virtual context directory is created if absent. Existing columns with the same name are always replaced.

Parameters:
  • virtual_id – Virtual context identifier.

  • df – Feature-only DataFrame already aligned to the main index (one row per ID, nulls for absent IDs).

Returns:

The list of column names written.

clear() None[source]#

Removes the entire ./tmp/ directory tree.

clear_context(virtual_id: str) None[source]#

Removes a single virtual context directory.

create(virtual_id: str) None[source]#

Initialise a virtual context directory for virtual_id.

Raises nothing if the directory already exists.

drop_features(virtual_id: str, features: list[str], is_static: bool = False) None[source]#

Physically removes feature columns from a virtual context.

Columns not present in the virtual file are silently ignored.

Parameters:
  • virtual_id – The virtual context identifier.

  • features – Column names to remove.

  • is_static – Static or entity features.

exists(virtual_id: str) bool[source]#

Returns True if the virtual context directory exists.

features(virtual_id: str, is_static: bool = False) LazyFrame | None[source]#

Scans the virtual feature file; returns None when absent.

fork_context(source_virtual_id: str) str | None[source]#

Copy source_virtual_id into a fresh context; return None if nothing to inherit.

Returns None when the source directory is absent or empty, preserving the invariant that _virtual_id is None means “no virtual content”. The caller is responsible for the source_virtual_id is None guard (see fork_virtual_context()). Callers that unconditionally need a real UUID (e.g. temporal-conversion helpers that are about to write) should chain with new_context():

uuid = self.fork_virtual_context(virtual_id) or self._virtual.new_context()
Parameters:

source_virtual_id – Existing non-null context UUID to inherit from.

Returns:

The new virtual context identifier (a UUID string), or None.

new_context() str[source]#

Create a fresh empty virtual context directory and return its UUID.

Use this when you need a real UUID to write into and there is no source context to inherit from (or fork_context() returned None).

Returns:

A new UUID string; the corresponding tmp/<uuid>/ directory is guaranteed to exist.

time_index(virtual_id: str) LazyFrame | None[source]#

Scan the virtual time-index override for virtual_id.

Returns None when no time_index.arrow has been written for this context.

Parameters:

virtual_id – Virtual context identifier.

Returns:

A polars.LazyFrame of the override time-index rows, or None if absent.

write_time_index(virtual_id: str, time_index_lf: LazyFrame) None[source]#

Write a time-index override into the virtual context.

Creates the context directory if it does not already exist, then writes time_index_lf as time_index.arrow inside tmp/<virtual_id>/.

Parameters:
  • virtual_id – Virtual context identifier.

  • time_index_lf – LazyFrame containing the new time index columns.

Module contents#

Package stub.