vxdf package
Submodules
vxdf.auth module
Centralised helpers for authentication credentials.
This module avoids importing heavy cloud SDKs at import-time; it only touches os.environ and lightweight stdlib modules. Third-party cloud libraries still perform their own credential resolution, but we surface friendly errors early so users know what to configure.
- vxdf.auth.ensure_aws_credentials() None
Raise
AuthenticationError
if AWS credentials cannot be resolved.We rely on
boto3
’s default credential chain (env vars, AWS CLI config, IAM roles). For a fast-fail and nicer error UX we test resolution early so the user gets an actionable message before a download starts.
- vxdf.auth.ensure_gcp_credentials() None
Raise
AuthenticationError
if Application Default Credentials are missing.
- vxdf.auth.get_openai_api_key(cli_key: str | None = None) str
Return an OpenAI API key from CLI flag, env var, or user config.
Lookup order (first win): 1. cli_key argument passed by caller. 2. Environment variable
OPENAI_API_KEY
. 3.openai.api_key
field in~/.vxdf/config.toml
.- Raises:
AuthenticationError – If no key is found by any method.
vxdf.cli module
Command-line utilities for working with VXDF files.
Usage examples:
# Show basic info about a VXDF file python -m vxdf info sample.vxdf
# List document IDs in a VXDF file python -m vxdf list sample.vxdf
# Extract a document by ID to stdout (pretty-printed JSON) python -m vxdf get sample.vxdf doc_123 > doc.json
# Pack JSON lines into a VXDF file (expects each line to be a JSON object with id,text,vector) python -m vxdf pack input.jsonl output.vxdf –embedding-dim 768 –compression zlib
- vxdf.cli.build_parser() ArgumentParser
vxdf.errors module
Centralised VXDF exception hierarchy.
All public APIs raise these exceptions instead of bare ValueError, KeyError, etc. This makes it straightforward for calling code to handle specific failure modes.
- exception vxdf.errors.AuthenticationError
Bases:
VXDFError
Authentication credentials are missing or invalid.
- exception vxdf.errors.ChecksumMismatchError
Bases:
VXDFError
File checksum does not match footer value (file may be corrupted).
- exception vxdf.errors.ChunkNotFoundError
Bases:
VXDFError
Requested document ID not present in the offset index.
- exception vxdf.errors.CompressionError
Bases:
VXDFError
Error occurred during compression or decompression.
- exception vxdf.errors.DuplicateDocumentIDError
Bases:
VXDFError
Attempted to add a chunk with a duplicate document ID.
- exception vxdf.errors.EncryptionError
Bases:
VXDFError
Chunk is encrypted but key is missing or decryption failed.
- exception vxdf.errors.InvalidChunkError
Bases:
VXDFError
Chunk data failed validation (e.g., wrong embedding dimension).
Bases:
VXDFError
Footer or end marker is missing / malformed.
- exception vxdf.errors.MissingDependencyError
Bases:
VXDFError
Optional library needed for this operation is not installed.
vxdf.ingest module
High-level helpers for converting common data files into VXDF.
This module is intentionally dependency-light: heavy optional deps are imported lazily so that users who only need core VXDF functionality are not forced to install PDF or ML libraries.
- vxdf.ingest.convert(input_path: str | Path, output_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', compression: str = 'none', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True, resume: bool = False, workers: int = 1, detect_pii: bool = True, pii_patterns: List[str] | None = None) None
Convert input_path to a VXDF file at output_path.
Each row / paragraph becomes a chunk with id/text/vector.
vxdf.merge_split module
Utilities to merge multiple VXDF files and split a large VXDF.
Phase-1 advanced CLI helpers.
Public functions
merge(out_file, *inputs, dedupe=”skip”, show_progress=True) split(in_file, *, size_bytes=None, chunks_per_file=None, show_progress=True)
- vxdf.merge_split.merge(output_path: str | Path, input_paths: Sequence[str | Path], *, dedupe: str = 'skip', show_progress: bool = True) None
Merge input_paths into output_path.
If dedupe is
skip
(default) duplicate IDs after the first occurrence are skipped.error
aborts on duplicates.firstwins
keeps the first seen (same as skip, but explicit).
- vxdf.merge_split.split(input_path: str | Path, *, size_bytes: int | None = None, chunks_per_file: int | None = None, show_progress: bool = True) None
Split input_path into numbered shards.
Exactly one of size_bytes or chunks_per_file must be provided. Output files are written as
<stem>-partNN.vxdf
beside the input file.
vxdf.pii module
Lightweight regex-based PII detector used during ingestion.
The detector is deliberately simple and dependency-free so it can run on any machine with the Python stdlib. It searches for well-formatted patterns such as email addresses, US Social-Security numbers, credit-card numbers, IPv4/IPv6 addresses and US phone numbers.
The goal is good enough recall for an initial tag-only pass; downstream tools or stricter detectors can refine the result.
vxdf.reader module
vxdf.update module
Incremental update helpers: append new documents to an existing VXDF.
This is phase-1 functionality covering the incremental updates scenario.
- Usage (Python):
from vxdf.update import update update(“corpus.vxdf”, “new_docs”, recursive=True)
- CLI:
python -m vxdf update corpus.vxdf new_docs/ -r
By default duplicate document IDs are skipped; use --dedupe overwrite
to
replace existing chunks or --dedupe error
to abort on duplicates.
- vxdf.update.update(vxdf_path: str | Path, input_path: str | Path, *, model: str = 'all-MiniLM-L6-v2', dedupe: str = 'skip', openai_key: str | None = None, recursive: bool = False, show_progress: bool = True) None
Append new documents from input_path into an existing VXDF.
- Parameters:
vxdf_path – Path to an existing .vxdf file (will be modified in-place using an atomic temp-file swap).
input_path – File or directory containing new data (same formats as
vxdf ingest convert
).model – Embedding model name (ignored if documents contain
vector
already – not currently supported).dedupe – How to handle duplicate IDs –
skip
(default),overwrite
(replace old chunk), orerror
.
vxdf.writer module
- class vxdf.writer.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)
Bases:
object
Writes data to a VXDF file.
Module contents
VXDF Python library.
High-level public API re-exports for easy import:
from vxdf import VXDFWriter, VXDFReader
- class vxdf.VXDFReader(file_path: str)
Bases:
object
Reads and parses a VXDF file.
- class vxdf.VXDFWriter(file_path: str, embedding_dim: int, *, compression: str = 'none', fields: Dict[str, str] | None = None)
Bases:
object
Writes data to a VXDF file.