vxdf package
Submodules
vxdf.auth module
Centralised helpers for authentication credentials.
This module avoids importing heavy cloud SDKs at import-time; it only touches os.environ and lightweight stdlib modules. Third-party cloud libraries still perform their own credential resolution, but we surface friendly errors early so users know what to configure.
- vxdf.auth.ensure_aws_credentials() None
Raise
AuthenticationErrorif AWS credentials cannot be resolved.We rely on
boto3’s default credential chain (env vars, AWS CLI config, IAM roles). For a fast-fail and nicer error UX we test resolution early so the user gets an actionable message before a download starts.
- vxdf.auth.ensure_gcp_credentials() None
Raise
AuthenticationErrorif Application Default Credentials are missing.
- vxdf.auth.get_openai_api_key(cli_key: str | None = None) str
Return an OpenAI API key from CLI flag, env var, or user config.
Lookup order (first win): 1. cli_key argument passed by caller. 2. Environment variable
OPENAI_API_KEY. 3.openai.api_keyfield in~/.vxdf/config.toml.- Raises:
AuthenticationError – If no key is found by any method.
vxdf.cli module
Command-line utilities for working with VXDF files.
Usage examples:
# Show basic info about a VXDF file python -m vxdf info sample.vxdf
# List document IDs in a VXDF file python -m vxdf list sample.vxdf
# Extract a document by ID to stdout (pretty-printed JSON) python -m vxdf get sample.vxdf doc_123 > doc.json
# Pack JSON lines into a VXDF file (expects each line to be a JSON object with id,text,vector) python -m vxdf pack input.jsonl output.vxdf –embedding-dim 768 –compression zlib
- vxdf.cli.build_parser() ArgumentParser
vxdf.errors module
Centralised VXDF exception hierarchy.
All public APIs raise these exceptions instead of bare ValueError, KeyError, etc. This makes it straightforward for calling code to handle specific failure modes.
- exception vxdf.errors.AuthenticationError
Bases:
VXDFErrorAuthentication credentials are missing or invalid.
- exception vxdf.errors.ChecksumMismatchError
Bases:
VXDFErrorFile checksum does not match footer value (file may be corrupted).
- exception vxdf.errors.ChunkNotFoundError
Bases:
VXDFErrorRequested document ID not present in the offset index.
- exception vxdf.errors.CompressionError
Bases:
VXDFErrorError occurred during compression or decompression.
- exception vxdf.errors.DuplicateDocumentIDError
Bases:
VXDFErrorAttempted to add a chunk with a duplicate document ID.
- exception vxdf.errors.EncryptionError
Bases:
VXDFErrorChunk is encrypted but key is missing or decryption failed.
- exception vxdf.errors.InvalidChunkError
Bases:
VXDFErrorChunk data failed validation (e.g., wrong embedding dimension).
Bases:
VXDFErrorFooter or end marker is missing / malformed.
- exception vxdf.errors.MissingDependencyError
Bases:
VXDFErrorOptional library needed for this operation is not installed.
vxdf.ingest module
High-level helpers for converting common data files into VXDF.
This module is intentionally dependency-light: heavy optional deps are imported lazily so that users who only need core VXDF functionality are not forced to install PDF or ML libraries.
- vxdf.ingest.convert(input_path: str | Path, output_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', compression: str = 'none', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True, resume: bool = False, workers: int = 1, detect_pii: bool = True, pii_patterns: List[str] | None = None) None
Convert input_path to a VXDF file at output_path.
Each row / paragraph becomes a chunk with id/text/vector.
vxdf.merge_split module
Utilities to merge multiple VXDF files and split a large VXDF.
Phase-1 advanced CLI helpers.
Public functions
merge(out_file, *inputs, dedupe=”skip”, show_progress=True) split(in_file, *, size_bytes=None, chunks_per_file=None, show_progress=True)
- vxdf.merge_split.merge(output_path: str | Path, input_paths: Sequence[str | Path], *, dedupe: str = 'skip', show_progress: bool = True) None
Merge input_paths into output_path.
If dedupe is
skip(default) duplicate IDs after the first occurrence are skipped.erroraborts on duplicates.firstwinskeeps the first seen (same as skip, but explicit).
- vxdf.merge_split.split(input_path: str | Path, *, size_bytes: int | None = None, chunks_per_file: int | None = None, show_progress: bool = True) None
Split input_path into numbered shards.
Exactly one of size_bytes or chunks_per_file must be provided. Output files are written as
<stem>-partNN.vxdfbeside the input file.
vxdf.pii module
Lightweight regex-based PII detector used during ingestion.
The detector is deliberately simple and dependency-free so it can run on any machine with the Python stdlib. It searches for well-formatted patterns such as email addresses, US Social-Security numbers, credit-card numbers, IPv4/IPv6 addresses and US phone numbers.
The goal is good enough recall for an initial tag-only pass; downstream tools or stricter detectors can refine the result.
vxdf.reader module
vxdf.update module
Incremental update helpers: append new documents to an existing VXDF.
This is phase-1 functionality covering the incremental updates scenario.
- Usage (Python):
from vxdf.update import update update(“corpus.vxdf”, “new_docs”, recursive=True)
- CLI:
python -m vxdf update corpus.vxdf new_docs/ -r
By default duplicate document IDs are skipped; use --dedupe overwrite to
replace existing chunks or --dedupe error to abort on duplicates.
- vxdf.update.update(vxdf_path: str | Path, input_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', dedupe: str = 'skip', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True) None
Append new documents from input_path into an existing VXDF.
- Parameters:
vxdf_path – Path to an existing .vxdf file (will be modified in-place using an atomic temp-file swap).
input_path – File or directory containing new data (same formats as
vxdf ingest convert).model – Embedding model name (ignored if documents contain
vectoralready – not currently supported).dedupe – How to handle duplicate IDs –
skip(default),overwrite(replace old chunk), orerror.
vxdf.writer module
- class vxdf.writer.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)
Bases:
objectWrites data to a VXDF file.
Module contents
VXDF Python library.
High-level public API re-exports for easy import:
from vxdf import VXDFWriter, VXDFReader
- class vxdf.VXDFReader(file_path: str)
Bases:
objectReads and parses a VXDF file.
- class vxdf.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)
Bases:
objectWrites data to a VXDF file.