Database Adapters

Backend-specific features, capabilities, and configuration for vector databases.

Overview

CrossVector supports 4 vector database backends:

Backend	Nested Metadata	Metadata-Only Search	License	Recommended For
AstraDB	Yes	Yes	Proprietary	Cloud-hosted, serverless, auto-scaling
ChromaDB	Via Dot Notation	Yes	Apache 2.0	Prototyping, simple deployments, cloud/local
Milvus	Yes	Yes	Apache 2.0	Large-scale, distributed, high-performance
PgVector	Full JSONB	Yes	PostgreSQL	Existing PostgreSQL infrastructure, ACID

*Note: Milvus supports metadata-only via query() method, but recommended to always provide vector for optimal performance.

AstraDB

DataStax Astra DB - Serverless Cassandra with vector search.

Features

Full nested metadata - Complete JSON document support
Metadata-only search - Filter without vector similarity
Universal operators - All 10 operators supported
Scalable - Serverless auto-scaling
Managed - Fully hosted service

Installation

pip install crossvector[astradb]

Configuration

Environment Variables:

ASTRA_DB_APPLICATION_TOKEN="AstraCS:xxx"
ASTRA_DB_API_ENDPOINT="https://xxx.apps.astra.datastax.com"
# Note: Collection name uses VECTOR_COLLECTION_NAME (shared setting)

Programmatic:

from crossvector.dbs.astradb import AstraDBAdapter

db = AstraDBAdapter(
    token="AstraCS:xxx",
    api_endpoint="https://xxx.apps.astra.datastax.com",
    keyspace="default_keyspace"
)

Schema

AstraDB accepts flexible primary key field names:

# All three forms are equivalent - use your preferred convention

# Form 1: pk (recommended - cleaner)
doc = engine.create({
    "pk": "doc-123",
    "text": "Document content",
    "category": "tech",
    "author": {"name": "John", "role": "admin"}
})

# Form 2: id (common alternative)
doc = engine.create({
    "id": "doc-123",
    "text": "Document content",
    "category": "tech"
})

# Form 3: _id (legacy AstraDB style)
doc = engine.create({
    "_id": "doc-123",
    "text": "Document content",
    "category": "tech"
})

# Form 4: Dynamic (auto-generated if not provided)
doc = engine.create({
    "text": "Document content",
    "category": "tech"
    # id is auto-generated based on PRIMARY_KEY_MODE setting
})

Behind the scenes: - CrossVector extracts pk, id, or _id from input (in priority order) - All are stored as _id in AstraDB (internal requirement) - Retrieved documents have id field for consistency - Other fields become metadata

Nested Metadata

Full JSON document support with dynamic and nested queries:

from crossvector.querydsl.q import Q

# Create with nested metadata (using pk field)
doc = engine.create({
    "pk": "article-1",
    "text": "Deep learning guide",
    "author": {
        "name": "Alice",
        "profile": {"verified": True, "tier": "premium"}
    },
    "post": {
        "stats": {"views": 5000, "likes": 200}
    }
})

# Query deep nesting with double underscore notation
results = engine.search(
    "machine learning",
    where=Q(author__profile__verified=True) & Q(post__stats__views__gte=1000)
)

Capabilities

engine = VectorEngine(db=AstraDBAdapter(), embedding=...)

# Metadata-only search
results = engine.search(
    query=None,
    where=Q(status="published")
)

# All operators
results = engine.search(
    "query",
    where=(
        Q(category="tech") &
        Q(score__gte=0.8) &
        Q(tags__in=["python", "ai"]) &
        ~Q(archived=True)
    )
)

Performance

Collection limits: 10M+ documents per collection
Throughput: High (serverless auto-scaling)
Latency: ~10-50ms typical
Cost: Pay-per-request pricing

Best Practices

# Use metadata-only for fast filtering
results = engine.search(query=None, where={"status": {"$eq": "active"}})

# Leverage nested metadata
metadata = {
    "user": {"id": "user123", "tier": "premium"},
    "content": {"type": "article", "category": "tech"}
}

# Batch operations for efficiency
engine.bulk_create(docs, batch_size=100)

ChromaDB

Open-source embedding database with Python-first API.

Features

Nested metadata via dot notation - Access nested fields using dot syntax (e.g., user.role)
Metadata-only search - Filter without vector similarity
Multiple deployment modes - Cloud, HTTP, or local persistence
Strict config validation - Prevents conflicting settings
Explicit imports - Clear dependency management
Lazy initialization - Optimal resource usage
All 10 operators - eq, ne, gt, gte, lt, lte, in, nin, and, or supported
In-memory/persistent - Multiple storage backends
Open source - Apache 2.0 license

Installation

# Local/in-memory
pip install crossvector[chroma]

# ChromaDB Cloud
pip install crossvector[chroma-cloud]

Configuration

Environment Variables:

# ChromaDB Cloud (priority 1)
CHROMA_API_KEY="your-api-key"
CHROMA_TENANT="tenant-name"
CHROMA_DATABASE="database-name"

# Self-hosted HTTP (priority 2, requires no CHROMA_PERSIST_DIR)
CHROMA_HOST="localhost"
CHROMA_PORT="8000"

# Local persistence (priority 3, requires no CHROMA_HOST)
CHROMA_PERSIST_DIR="./chroma_data"

Important: Cannot set both CHROMA_HOST and CHROMA_PERSIST_DIR. Choose one deployment mode: - Cloud: Set CHROMA_API_KEY - HTTP: Set CHROMA_HOST (not CHROMA_PERSIST_DIR) - Local: Set CHROMA_PERSIST_DIR (not CHROMA_HOST)

Programmatic:

from crossvector.dbs.chroma import ChromaAdapter

# Cloud mode
db = ChromaAdapter()  # Uses CHROMA_API_KEY from env

# HTTP mode
db = ChromaAdapter()  # Uses CHROMA_HOST from env

# Local mode
db = ChromaAdapter()  # Uses CHROMA_PERSIST_DIR from env

Configuration Validation:

CrossVector enforces strict configuration validation:

# Valid: Cloud only
CHROMA_API_KEY="..."

# Valid: HTTP only
CHROMA_HOST="localhost"

# Valid: Local only
CHROMA_PERSIST_DIR="./data"

# Invalid: Conflicting settings
CHROMA_HOST="localhost"
CHROMA_PERSIST_DIR="./data"
# Raises: MissingConfigError with helpful message

Schema

ChromaDB automatically flattens nested metadata using dot notation:

Input (nested structure):

metadata = {
    "user": {
        "name": "John",
        "role": "admin",
        "profile": {
            "verified": True
        }
    }
}

Stored as (flattened with dots):

{
    "user.name": "John",
    "user.role": "admin",
    "user.profile.verified": True
}

Access via dot notation:

from crossvector.querydsl.q import Q

# Query nested fields using double underscore (converts to dot notation)
results = engine.search(
    "query",
    where=Q(user__role="admin") & Q(user__profile__verified=True)
)

# Internally compiled to: {"user.role": {"$eq": "admin"}, "user.profile.verified": {"$eq": True}}

Nested Metadata Support

ChromaDB supports nested metadata through automatic dot notation flattening:

from crossvector.querydsl.q import Q

# Nested queries work via dot notation
results = engine.search(
    "query",
    where=Q(user__role="admin") & Q(user__profile__verified=True)
)

# Compiled to: {"user.role": {"$eq": "admin"}, "user.profile.verified": {"$eq": True}}

How it works: - Double underscore __ in Q objects maps to dot notation . in storage - Arbitrarily deep nesting is supported - Queries are automatically flattened to match storage format

Capabilities

engine = VectorEngine(db=ChromaDBAdapter(), embedding=...)

# Metadata-only search
results = engine.search(
    query=None,
    where=Q(category="tech")
)

# All operators
results = engine.search(
    "query",
    where=(
        Q(category="tech") &
        Q(score__gte=0.8) &
        Q(status__in=["active", "pending"])
    )
)

# Wrapper requirement
# Multiple conditions automatically wrapped in $and

Performance

Collection limits: 100K+ documents recommended
Throughput: High (in-memory)
Latency: <10ms (in-memory), 20-50ms (persistent)
Cost: Free (self-hosted)

Best Practices

# Use flat metadata structure for best compatibility
metadata = {
    "category": "tech",
    "author_name": "John",  # Flat instead of author.name
    "author_role": "admin"
}

# Choose deployment mode explicitly
# Option 1: Cloud (managed)
CHROMA_API_KEY="..."

# Option 2: Self-hosted HTTP server
CHROMA_HOST="localhost"

# Option 3: Local persistence (development)
CHROMA_PERSIST_DIR="./chroma_data"

# Don't mix deployment modes - causes MissingConfigError
# Don't do: CHROMA_HOST + CHROMA_PERSIST_DIR

# Batch operations for efficiency
engine.bulk_create(docs, batch_size=100)

# Leverage lazy initialization
db = ChromaAdapter()  # Client created only when first used

Milvus

High-performance distributed vector database.

Features

Full nested metadata - JSON field support (via dynamic fields)
Metadata-only search - Query without vector via query() method (with supports_metadata_only=True)
All 10 operators - eq, ne, gt, gte, lt, lte, in, nin, and, or supported
High performance - Distributed architecture
Scalable - Horizontal scaling
Lazy initialization - Optimal resource usage

Installation

pip install crossvector[milvus]

Configuration

Environment Variables:

MILVUS_HOST="localhost"
MILVUS_PORT="19530"
MILVUS_USER="username"  # Optional
MILVUS_PASSWORD="password"  # Optional
MILVUS_DB_NAME="default"  # Optional

Programmatic:

from crossvector.dbs.milvus import MilvusAdapter

db = MilvusAdapter(
    host="localhost",
    port=19530,
    user="username",
    password="password",
    db_name="default"
)

Schema

Milvus uses boolean expression filters:

# Query compiles to Milvus expression
Q(category="tech") & Q(score__gt=0.8)
# => '(category == "tech") and (score > 0.8)'

Q(status__in=["active", "pending"])
# => 'status in ["active", "pending"]'

Metadata-Only Search Support

Milvus supports metadata-only search (no vector required):

# Correct - Metadata-only query
results = engine.search(query=None, where=Q(category="tech"), limit=10)

# Also valid - Vector + filter
results = engine.search("query text", where=Q(category="tech"))

Check support:

if engine.supports_metadata_only:
    # Can search without vector
    results = engine.search(query=None, where=filters)
else:
    # Need to provide vector
    results = engine.search(vector, where=filters)

Nested Metadata

Full support via JSON field:

from crossvector.querydsl.q import Q

# Nested queries
results = engine.search(
    "query",
    where=Q(user__role="admin") & Q(post__stats__views__gte=1000)
)

# Compiles to: '(user["role"] == "admin") and (post["stats"]["views"] >= 1000)'

Capabilities

engine = VectorEngine(db=MilvusAdapter(), embedding=...)

# Metadata-only search
results = engine.search(
    query=None,
    where=Q(category="tech") & Q(score__gte=0.8)
)

# Vector + filter
results = engine.search(
    "query text",
    where=Q(status="published") & Q(priority__in=[1, 2, 3])
)

# All operators
results = engine.search(
    "query",
    where=(
        Q(status="published") &
        Q(priority__in=[1, 2, 3]) &
        Q(score__gt=0.5) &
        ~Q(archived=True)
    )
)

Performance

Collection limits: Billions of vectors
Throughput: Very high (distributed)
Latency: <10ms (optimized indexes)
Cost: Free (self-hosted), pay-as-you-go (Zilliz Cloud)

Best Practices

# Use metadata-only for fast filtering
if engine.supports_metadata_only:
    results = engine.search(query=None, where=filters, limit=100)

# Combine vector and metadata
results = engine.search("query", where=Q(status="active"))

# Use nested metadata
metadata = {
    "user": {"id": 123, "tier": "premium"},
    "content": {"type": "video", "duration": 600}
}

# Index metadata fields for performance
# (Configure in Milvus collection schema)

# Batch operations
engine.bulk_create(docs, batch_size=1000)

PgVector

PostgreSQL extension for vector similarity search.

Features

Full nested metadata - JSONB support with #>> operator
Metadata-only search - Filter without vector similarity
All 10 operators - Supported with numeric casting
ACID transactions - Full PostgreSQL guarantees
Mature ecosystem - PostgreSQL tooling

Installation

pip install crossvector[pgvector]

PostgreSQL Setup

-- Enable pgvector extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create table (handled automatically by adapter)

Configuration

Environment Variables:

VECTOR_COLLECTION_NAME="vectordb"
PGVECTOR_HOST="localhost"
PGVECTOR_PORT="5432"
PGVECTOR_USER="postgres"
PGVECTOR_PASSWORD="password"

Programmatic:

from crossvector.dbs.pgvector import PgVectorAdapter

db = PgVectorAdapter(
    dbname="vectordb",
    host="localhost",
    port=5432,
    user="postgres",
    password="password"
)

Schema

PgVector stores metadata as JSONB:

CREATE TABLE vector_db (
    id TEXT PRIMARY KEY,
    vector vector(1536),
    text TEXT,
    metadata JSONB,
    created_timestamp DOUBLE PRECISION,
    updated_timestamp DOUBLE PRECISION
);

JSONB Features

Nested Metadata Access

Uses #>> operator for nested paths:

from crossvector.querydsl.q import Q

# Simple field
Q(category="tech")
# => "metadata->>'category' = 'tech'"

# Nested field
Q(user__role="admin")
# => "metadata #>> '{user,role}' = 'admin'"

# Deep nesting
Q(post__stats__views__gte=1000)
# => "(metadata #>> '{post,stats,views}')::numeric >= 1000"

Numeric Casting

Automatic casting for numeric comparisons:

# Text stored as string, but compared numerically
Q(score__gt=0.8)
# => "(metadata->>'score')::numeric > 0.8"

Q(price__lte=100)
# => "(metadata->>'price')::numeric <= 100"

Capabilities

engine = VectorEngine(db=PgVectorAdapter(), embedding=...)

# Metadata-only search
results = engine.search(
    query=None,
    where=Q(status="published")
)

# Nested metadata
results = engine.search(
    "query",
    where=Q(user__profile__verified=True) & Q(user__stats__posts__gte=10)
)

# Numeric comparisons (auto-cast)
results = engine.search(
    "query",
    where=Q(score__gte=0.8) & Q(price__lt=100)
)

# All operators
results = engine.search(
    "query",
    where=(
        Q(category="tech") &
        Q(level__in=["beginner", "intermediate"]) &
        Q(rating__gte=4.0) &
        ~Q(archived=True)
    )
)

Performance

Collection limits: Millions of vectors (PostgreSQL limits)
Throughput: High (ACID overhead)
Latency: 10-50ms typical
Cost: Free (self-hosted PostgreSQL)

Indexing

-- Create IVFFlat index for faster vector search
CREATE INDEX ON vector_db
USING ivfflat (vector vector_cosine_ops)
WITH (lists = 100);

-- Create GIN index for metadata queries
CREATE INDEX ON vector_db USING GIN (metadata);

-- Create index on specific nested field
CREATE INDEX ON vector_db ((metadata->>'category'));

Best Practices

# Use nested metadata with JSONB
metadata = {
    "user": {"id": 123, "role": "admin"},
    "content": {"type": "article", "tags": ["python", "ai"]}
}

# Numeric fields work with string or number
metadata = {"score": "0.95"}  # Auto-cast in comparisons
metadata = {"score": 0.95}    # Direct numeric

# Index frequently queried fields
# CREATE INDEX ON vector_db ((metadata->>'category'));

# Batch operations with transactions
engine.bulk_create(docs, batch_size=500)

# Use metadata-only for fast filtering
results = engine.search(query=None, where={"status": {"$eq": "active"}})

Comparison Matrix

Feature Comparison

Feature	AstraDB	ChromaDB	Milvus	PgVector
Nested Metadata	Full JSON	Via Dot Notation	Full JSON	Full JSONB
Metadata-Only Search	Yes	Yes	Yes	Yes
Numeric Casting	Yes	Limited	Yes	Auto
Transaction Support	No	No	No	ACID
Horizontal Scaling	Auto	No	Yes	Read replicas
Managed Service	Yes	Cloud	Zilliz Cloud	Self-host
Open Source	No	Yes	Yes	Yes

Operator Support

All backends support the same 10 operators:

Operator	AstraDB	ChromaDB	Milvus	PgVector
`$eq`	Yes	Yes	Yes	Yes
`$ne`	Yes	Yes	Yes	Yes
`$gt`	Yes	Yes	Yes	Yes
`$gte`	Yes	Yes	Yes	Yes
`$lt`	Yes	Yes	Yes	Yes
`$lte`	Yes	Yes	Yes	Yes
`$in`	Yes	Yes	Yes	Yes
`$nin`	Yes	Yes	Yes	Yes
`and` (&)	Yes	Yes	Yes	Yes
`or` (\|)	Yes	Yes	Yes	Yes

Use Case Recommendations

Choose AstraDB if

Need managed serverless solution
Want full nested metadata support
Require high scalability
Prefer pay-as-you-go pricing

Choose ChromaDB if

Want simple setup (in-memory)
Building prototype/MVP
Prefer open source
Need multiple deployment options

Choose Milvus if

Need maximum performance
Have large-scale deployment (billions of vectors)
Want distributed architecture
Need full JSON nested metadata

Choose PgVector if

Already using PostgreSQL
Need ACID transactions
Want full SQL capabilities
Prefer mature, stable ecosystem

Switching Backends

Same code works across all backends:

from crossvector import VectorEngine
from crossvector.embeddings.gemini import GeminiEmbeddingAdapter
from crossvector.querydsl.q import Q

# Create embedding adapter (same for all)
embedding = GeminiEmbeddingAdapter()

# Choose backend (interchangeable)
if backend == "astradb":
    from crossvector.dbs.astradb import AstraDBAdapter
    db = AstraDBAdapter()
elif backend == "chroma":
    from crossvector.dbs.chroma import ChromaDBAdapter
    db = ChromaDBAdapter()
elif backend == "milvus":
    from crossvector.dbs.milvus import MilvusAdapter
    db = MilvusAdapter()
else:  # pgvector
    from crossvector.dbs.pgvector import PgVectorAdapter
    db = PgVectorAdapter()

# Same API for all backends
engine = VectorEngine(db=db, embedding=embedding)

# Same operations
doc = engine.create("Document text", category="tech")
results = engine.search("query", where=Q(category="tech"), limit=10)

Only consideration: Check engine.supports_metadata_only for Milvus (it's now supported, but verify with your deployment).

Next Steps

Embedding Adapters - Embedding providers
API Reference - Complete API documentation
Query DSL - Advanced filtering
Configuration - Settings reference