vxdf package

Submodules

vxdf.auth module

Centralised helpers for authentication credentials.

This module avoids importing heavy cloud SDKs at import-time; it only touches os.environ and lightweight stdlib modules. Third-party cloud libraries still perform their own credential resolution, but we surface friendly errors early so users know what to configure.

vxdf.auth.ensure_aws_credentials() → None

Raise AuthenticationError if AWS credentials cannot be resolved.

We rely on boto3’s default credential chain (env vars, AWS CLI config, IAM roles). For a fast-fail and nicer error UX we test resolution early so the user gets an actionable message before a download starts.

vxdf.auth.ensure_gcp_credentials() → None: Raise AuthenticationError if Application Default Credentials are missing.

vxdf.auth.get_openai_api_key(cli_key: str | None = None) → str

Return an OpenAI API key from CLI flag, env var, or user config.

Lookup order (first win): 1. cli_key argument passed by caller. 2. Environment variable OPENAI_API_KEY. 3. openai.api_key field in ~/.vxdf/config.toml.

Raises:: AuthenticationError – If no key is found by any method.

vxdf.auth.prompt_and_save_openai_key() → str | None

Interactively prompt the user for an OpenAI API key and save it.

Returns the key if one was entered, otherwise None. No prompt is shown if stdin is not a TTY (e.g., running in CI).

vxdf.cli module

Command-line utilities for working with VXDF files.

Usage examples:

# Show basic info about a VXDF file python -m vxdf info sample.vxdf

# List document IDs in a VXDF file python -m vxdf list sample.vxdf

# Extract a document by ID to stdout (pretty-printed JSON) python -m vxdf get sample.vxdf doc_123 > doc.json

# Pack JSON lines into a VXDF file (expects each line to be a JSON object with id,text,vector) python -m vxdf pack input.jsonl output.vxdf –embedding-dim 768 –compression zlib

vxdf.cli.build_parser() → ArgumentParser

vxdf.cli.cmd_convert(args: Namespace) → None

vxdf.cli.cmd_get(args: Namespace) → None

vxdf.cli.cmd_info(args: Namespace) → None

vxdf.cli.cmd_list(args: Namespace) → None

vxdf.cli.cmd_merge(args: Namespace) → None

vxdf.cli.cmd_pack(args: Namespace) → None: Pack newline-delimited JSON into a VXDF file.

vxdf.cli.cmd_split(args: Namespace) → None

vxdf.cli.cmd_update(args: Namespace) → None

vxdf.cli.main(argv: List[str] | None = None) → None

vxdf.errors module

Centralised VXDF exception hierarchy.

All public APIs raise these exceptions instead of bare ValueError, KeyError, etc. This makes it straightforward for calling code to handle specific failure modes.

exception vxdf.errors.AuthenticationError

Bases: VXDFError

Authentication credentials are missing or invalid.

exception vxdf.errors.ChecksumMismatchError

Bases: VXDFError

File checksum does not match footer value (file may be corrupted).

exception vxdf.errors.ChunkNotFoundError

Bases: VXDFError

Requested document ID not present in the offset index.

exception vxdf.errors.CompressionError

Bases: VXDFError

Error occurred during compression or decompression.

exception vxdf.errors.DuplicateDocumentIDError

Bases: VXDFError

Attempted to add a chunk with a duplicate document ID.

exception vxdf.errors.EncryptionError

Bases: VXDFError

Chunk is encrypted but key is missing or decryption failed.

exception vxdf.errors.InvalidChunkError

Bases: VXDFError

Chunk data failed validation (e.g., wrong embedding dimension).

exception vxdf.errors.InvalidFooterError

Bases: VXDFError

Footer or end marker is missing / malformed.

exception vxdf.errors.InvalidHeaderError

Bases: VXDFError

Header is missing or malformed.

exception vxdf.errors.MissingDependencyError

Bases: VXDFError

Optional library needed for this operation is not installed.

exception vxdf.errors.NetworkError

Bases: VXDFError

Network request failed (e.g., download error, timeout).

exception vxdf.errors.VXDFError

Bases: Exception

Base class for all VXDF-related errors.

vxdf.ingest module

High-level helpers for converting common data files into VXDF.

This module is intentionally dependency-light: heavy optional deps are imported lazily so that users who only need core VXDF functionality are not forced to install PDF or ML libraries.

vxdf.ingest.convert(input_path: str | Path, output_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', compression: str = 'none', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True, resume: bool = False, workers: int = 1, detect_pii: bool = True, pii_patterns: List[str] | None = None) → None

Convert input_path to a VXDF file at output_path.

Each row / paragraph becomes a chunk with id/text/vector.

vxdf.ingest.detect_type(path: str | Path) → str

vxdf.merge_split module

Utilities to merge multiple VXDF files and split a large VXDF.

Phase-1 advanced CLI helpers.

Public functions

merge(out_file, *inputs, dedupe=”skip”, show_progress=True) split(in_file, *, size_bytes=None, chunks_per_file=None, show_progress=True)

vxdf.merge_split.merge(output_path: str | Path, input_paths: Sequence[str | Path], *, dedupe: str = 'skip', show_progress: bool = True) → None

Merge input_paths into output_path.

If dedupe is skip (default) duplicate IDs after the first occurrence are skipped. error aborts on duplicates. firstwins keeps the first seen (same as skip, but explicit).

vxdf.merge_split.split(input_path: str | Path, *, size_bytes: int | None = None, chunks_per_file: int | None = None, show_progress: bool = True) → None

Split input_path into numbered shards.

Exactly one of size_bytes or chunks_per_file must be provided. Output files are written as <stem>-partNN.vxdf beside the input file.

vxdf.pii module

Lightweight regex-based PII detector used during ingestion.

The detector is deliberately simple and dependency-free so it can run on any machine with the Python stdlib. It searches for well-formatted patterns such as email addresses, US Social-Security numbers, credit-card numbers, IPv4/IPv6 addresses and US phone numbers.

The goal is good enough recall for an initial tag-only pass; downstream tools or stricter detectors can refine the result.

vxdf.pii.compile_patterns(patterns: Iterable[str]) → List[Pattern[str]]: Compile a list of string patterns into regex Pattern objects.

vxdf.pii.contains_pii(text: str, *, patterns: List[Pattern[str]] | None = None) → bool

Return True if text matches any of the PII regexes.

This helper is intentionally fast; it stops at the first match.

vxdf.reader module

class vxdf.reader.VXDFReader(file_path: str)

Bases: object

Reads and parses a VXDF file.

close() → None: Closes the file handle.

property embedding_dim: int | None

get_chunk(doc_id: str) → Dict[str, Any]: Retrieves a single chunk by its document ID.

iter_chunks() → Iterator[Dict[str, Any]]: Yields all data chunks in the order they appear in the file.

property vxdf_version: str | None

vxdf.update module

Incremental update helpers: append new documents to an existing VXDF.

This is phase-1 functionality covering the incremental updates scenario.

Usage (Python):: from vxdf.update import update update(“corpus.vxdf”, “new_docs”, recursive=True)
CLI:: python -m vxdf update corpus.vxdf new_docs/ -r

By default duplicate document IDs are skipped; use --dedupe overwrite to replace existing chunks or --dedupe error to abort on duplicates.

vxdf.update.update(vxdf_path: str | Path, input_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', dedupe: str = 'skip', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True) → None

Append new documents from input_path into an existing VXDF.

Parameters:

vxdf_path – Path to an existing .vxdf file (will be modified in-place using an atomic temp-file swap).
input_path – File or directory containing new data (same formats as vxdf ingest convert).
model – Embedding model name (ignored if documents contain vector already – not currently supported).
dedupe – How to handle duplicate IDs – skip (default), overwrite (replace old chunk), or error.

vxdf.writer module

class vxdf.writer.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)

Bases: object

Writes data to a VXDF file.

add_chunk(chunk_data: Dict[str, Any]) → None

Adds a single data chunk to the file.

Parameters:: chunk_data (dict) – A dictionary containing the data for one chunk (e.g., id, text, vector).

close() → None: Finalizes the VXDF file by writing the offset index and footer.

Module contents

VXDF Python library.

High-level public API re-exports for easy import:

from vxdf import VXDFWriter, VXDFReader

class vxdf.VXDFReader(file_path: str)

Bases: object

Reads and parses a VXDF file.

close() → None: Closes the file handle.

property embedding_dim: int | None

get_chunk(doc_id: str) → Dict[str, Any]: Retrieves a single chunk by its document ID.

iter_chunks() → Iterator[Dict[str, Any]]: Yields all data chunks in the order they appear in the file.

property vxdf_version: str | None

class vxdf.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)

Bases: object

Writes data to a VXDF file.

add_chunk(chunk_data: Dict[str, Any]) → None

Adds a single data chunk to the file.

Parameters:: chunk_data (dict) – A dictionary containing the data for one chunk (e.g., id, text, vector).

close() → None: Finalizes the VXDF file by writing the offset index and footer.