vxdf package

Submodules

vxdf.auth module

Centralised helpers for authentication credentials.

This module avoids importing heavy cloud SDKs at import-time; it only touches os.environ and lightweight stdlib modules. Third-party cloud libraries still perform their own credential resolution, but we surface friendly errors early so users know what to configure.

vxdf.auth.ensure_aws_credentials() None

Raise AuthenticationError if AWS credentials cannot be resolved.

We rely on boto3’s default credential chain (env vars, AWS CLI config, IAM roles). For a fast-fail and nicer error UX we test resolution early so the user gets an actionable message before a download starts.

vxdf.auth.ensure_gcp_credentials() None

Raise AuthenticationError if Application Default Credentials are missing.

vxdf.auth.get_openai_api_key(cli_key: str | None = None) str

Return an OpenAI API key from CLI flag, env var, or user config.

Lookup order (first win): 1. cli_key argument passed by caller. 2. Environment variable OPENAI_API_KEY. 3. openai.api_key field in ~/.vxdf/config.toml.

Raises:

AuthenticationError – If no key is found by any method.

vxdf.auth.prompt_and_save_openai_key() str | None

Interactively prompt the user for an OpenAI API key and save it.

Returns the key if one was entered, otherwise None. No prompt is shown if stdin is not a TTY (e.g., running in CI).

vxdf.cli module

Command-line utilities for working with VXDF files.

Usage examples:

# Show basic info about a VXDF file python -m vxdf info sample.vxdf

# List document IDs in a VXDF file python -m vxdf list sample.vxdf

# Extract a document by ID to stdout (pretty-printed JSON) python -m vxdf get sample.vxdf doc_123 > doc.json

# Pack JSON lines into a VXDF file (expects each line to be a JSON object with id,text,vector) python -m vxdf pack input.jsonl output.vxdf –embedding-dim 768 –compression zlib

vxdf.cli.build_parser() ArgumentParser
vxdf.cli.cmd_convert(args: Namespace) None
vxdf.cli.cmd_get(args: Namespace) None
vxdf.cli.cmd_info(args: Namespace) None
vxdf.cli.cmd_list(args: Namespace) None
vxdf.cli.cmd_merge(args: Namespace) None
vxdf.cli.cmd_pack(args: Namespace) None

Pack newline-delimited JSON into a VXDF file.

vxdf.cli.cmd_split(args: Namespace) None
vxdf.cli.cmd_update(args: Namespace) None
vxdf.cli.main(argv: List[str] | None = None) None

vxdf.errors module

Centralised VXDF exception hierarchy.

All public APIs raise these exceptions instead of bare ValueError, KeyError, etc. This makes it straightforward for calling code to handle specific failure modes.

exception vxdf.errors.AuthenticationError

Bases: VXDFError

Authentication credentials are missing or invalid.

exception vxdf.errors.ChecksumMismatchError

Bases: VXDFError

File checksum does not match footer value (file may be corrupted).

exception vxdf.errors.ChunkNotFoundError

Bases: VXDFError

Requested document ID not present in the offset index.

exception vxdf.errors.CompressionError

Bases: VXDFError

Error occurred during compression or decompression.

exception vxdf.errors.DuplicateDocumentIDError

Bases: VXDFError

Attempted to add a chunk with a duplicate document ID.

exception vxdf.errors.EncryptionError

Bases: VXDFError

Chunk is encrypted but key is missing or decryption failed.

exception vxdf.errors.InvalidChunkError

Bases: VXDFError

Chunk data failed validation (e.g., wrong embedding dimension).

exception vxdf.errors.InvalidFooterError

Bases: VXDFError

Footer or end marker is missing / malformed.

exception vxdf.errors.InvalidHeaderError

Bases: VXDFError

Header is missing or malformed.

exception vxdf.errors.MissingDependencyError

Bases: VXDFError

Optional library needed for this operation is not installed.

exception vxdf.errors.NetworkError

Bases: VXDFError

Network request failed (e.g., download error, timeout).

exception vxdf.errors.VXDFError

Bases: Exception

Base class for all VXDF-related errors.

vxdf.ingest module

High-level helpers for converting common data files into VXDF.

This module is intentionally dependency-light: heavy optional deps are imported lazily so that users who only need core VXDF functionality are not forced to install PDF or ML libraries.

vxdf.ingest.convert(input_path: str | Path, output_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', compression: str = 'none', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True, resume: bool = False, workers: int = 1, detect_pii: bool = True, pii_patterns: List[str] | None = None) None

Convert input_path to a VXDF file at output_path.

Each row / paragraph becomes a chunk with id/text/vector.

vxdf.ingest.detect_type(path: str | Path) str

vxdf.merge_split module

Utilities to merge multiple VXDF files and split a large VXDF.

Phase-1 advanced CLI helpers.

Public functions

merge(out_file, *inputs, dedupe=”skip”, show_progress=True) split(in_file, *, size_bytes=None, chunks_per_file=None, show_progress=True)

vxdf.merge_split.merge(output_path: str | Path, input_paths: Sequence[str | Path], *, dedupe: str = 'skip', show_progress: bool = True) None

Merge input_paths into output_path.

If dedupe is skip (default) duplicate IDs after the first occurrence are skipped. error aborts on duplicates. firstwins keeps the first seen (same as skip, but explicit).

vxdf.merge_split.split(input_path: str | Path, *, size_bytes: int | None = None, chunks_per_file: int | None = None, show_progress: bool = True) None

Split input_path into numbered shards.

Exactly one of size_bytes or chunks_per_file must be provided. Output files are written as <stem>-partNN.vxdf beside the input file.

vxdf.pii module

Lightweight regex-based PII detector used during ingestion.

The detector is deliberately simple and dependency-free so it can run on any machine with the Python stdlib. It searches for well-formatted patterns such as email addresses, US Social-Security numbers, credit-card numbers, IPv4/IPv6 addresses and US phone numbers.

The goal is good enough recall for an initial tag-only pass; downstream tools or stricter detectors can refine the result.

vxdf.pii.compile_patterns(patterns: Iterable[str]) List[Pattern[str]]

Compile a list of string patterns into regex Pattern objects.

vxdf.pii.contains_pii(text: str, *, patterns: List[Pattern[str]] | None = None) bool

Return True if text matches any of the PII regexes.

This helper is intentionally fast; it stops at the first match.

vxdf.reader module

class vxdf.reader.VXDFReader(file_path: str)

Bases: object

Reads and parses a VXDF file.

close() None

Closes the file handle.

property embedding_dim: int | None
get_chunk(doc_id: str) Dict[str, Any]

Retrieves a single chunk by its document ID.

iter_chunks() Iterator[Dict[str, Any]]

Yields all data chunks in the order they appear in the file.

property vxdf_version: str | None

vxdf.update module

Incremental update helpers: append new documents to an existing VXDF.

This is phase-1 functionality covering the incremental updates scenario.

Usage (Python):

from vxdf.update import update update(“corpus.vxdf”, “new_docs”, recursive=True)

CLI:

python -m vxdf update corpus.vxdf new_docs/ -r

By default duplicate document IDs are skipped; use --dedupe overwrite to replace existing chunks or --dedupe error to abort on duplicates.

vxdf.update.update(vxdf_path: str | Path, input_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', dedupe: str = 'skip', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True) None

Append new documents from input_path into an existing VXDF.

Parameters:
  • vxdf_path – Path to an existing .vxdf file (will be modified in-place using an atomic temp-file swap).

  • input_path – File or directory containing new data (same formats as vxdf ingest convert).

  • model – Embedding model name (ignored if documents contain vector already – not currently supported).

  • dedupe – How to handle duplicate IDs – skip (default), overwrite (replace old chunk), or error.

vxdf.writer module

class vxdf.writer.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)

Bases: object

Writes data to a VXDF file.

add_chunk(chunk_data: Dict[str, Any]) None

Adds a single data chunk to the file.

Parameters:

chunk_data (dict) – A dictionary containing the data for one chunk (e.g., id, text, vector).

close() None

Finalizes the VXDF file by writing the offset index and footer.

Module contents

VXDF Python library.

High-level public API re-exports for easy import:

from vxdf import VXDFWriter, VXDFReader
class vxdf.VXDFReader(file_path: str)

Bases: object

Reads and parses a VXDF file.

close() None

Closes the file handle.

property embedding_dim: int | None
get_chunk(doc_id: str) Dict[str, Any]

Retrieves a single chunk by its document ID.

iter_chunks() Iterator[Dict[str, Any]]

Yields all data chunks in the order they appear in the file.

property vxdf_version: str | None
class vxdf.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)

Bases: object

Writes data to a VXDF file.

add_chunk(chunk_data: Dict[str, Any]) None

Adds a single data chunk to the file.

Parameters:

chunk_data (dict) – A dictionary containing the data for one chunk (e.g., id, text, vector).

close() None

Finalizes the VXDF file by writing the offset index and footer.