Skip to content

Schema and Data Models

Data structures and schemas used in CrossVector.

VectorDocument

The primary data model representing a document with embeddings.

Fields

from crossvector import VectorDocument

doc = VectorDocument(
    id: str | int,                    # Primary key
    vector: List[float],               # Embedding vector
    text: str = None,                  # Original text (optional)
    metadata: Dict[str, Any] = None,   # Arbitrary metadata
    created_timestamp: float = None,   # Creation timestamp
    updated_timestamp: float = None    # Last update timestamp
)

id (required)

Primary key identifier. Can be string or integer depending on PK strategy.

doc = VectorDocument(id="doc-123", ...)
doc = VectorDocument(id=42, ...)

vector (required)

Embedding vector as list of floats. Dimension must match embedding model.

doc = VectorDocument(
    id="doc-1",
    vector=[0.1, 0.2, 0.3, ...]  # 1536 dims for text-embedding-3-small
)

text (optional)

Original text content. Required if VectorEngine.store_text=True.

doc = VectorDocument(
    id="doc-1",
    vector=[...],
    text="This is the original document text"
)

metadata (optional)

Arbitrary metadata dictionary. Supports nested structures (backend-dependent).

doc = VectorDocument(
    id="doc-1",
    vector=[...],
    metadata={
        "category": "tech",
        "tags": ["python", "ai"],
        "author": {
            "name": "John",
            "role": "admin"
        },
        "score": 0.95,
        "featured": True
    }
)

created_timestamp (optional)

Unix timestamp for document creation. Auto-populated on insert.

import time
doc = VectorDocument(
    id="doc-1",
    vector=[...],
    created_timestamp=time.time()
)

updated_timestamp (optional)

Unix timestamp for last update. Auto-updated on modification.

doc = VectorDocument(
    id="doc-1",
    vector=[...],
    updated_timestamp=time.time()
)

Properties

pk

Alias for id property.

doc = VectorDocument(id="doc-123", vector=[...])
print(doc.pk)  # "doc-123"

Methods

Constructor Classmethods

from_text()

Create document from text string with optional metadata.

VectorDocument.from_text(
    text: str,
    **kwargs
) -> VectorDocument

Example:

doc = VectorDocument.from_text(
    "My document text",
    category="tech",
    priority=1
)
# Creates: VectorDocument(text="...", metadata={"category": "tech", "priority": 1})

from_dict()

Create document from dictionary.

VectorDocument.from_dict(
    data: Dict[str, Any]
) -> VectorDocument

Example:

doc = VectorDocument.from_dict({
    "id": "doc-123",
    "text": "Content",
    "metadata": {"key": "value"},
    "vector": [0.1, 0.2, ...]
})

from_kwargs()

Create document from keyword arguments.

VectorDocument.from_kwargs(**kwargs) -> VectorDocument

Example:

doc = VectorDocument.from_kwargs(
    id="doc-123",
    text="Content",
    vector=[...],
    metadata={"key": "value"}
)

from_any()

Auto-detect input format and create document.

VectorDocument.from_any(
    doc: str | Dict | VectorDocument
) -> VectorDocument

Examples:

# From string
doc = VectorDocument.from_any("Text content")

# From dict
doc = VectorDocument.from_any({"text": "Content", "metadata": {...}})

# From VectorDocument (returns copy)
doc = VectorDocument.from_any(existing_doc)

Data Export Methods

to_vector()

Extract vector in various formats.

to_vector(
    require: bool = False,
    output_format: Literal["dict", "json", "str", "list"] = "list"
) -> Any

Parameters:

  • require: Raise error if vector missing
  • output_format: Desired format:
  • "list" (default): Python list of floats
  • "dict": {"vector": [...]} wrapper
  • "json": JSON string representation
  • "str": String representation

Examples:

vector = doc.to_vector()  # [0.1, 0.2, ...]
vector = doc.to_vector(output_format="dict")  # {"vector": [0.1, 0.2, ...]}
vector = doc.to_vector(output_format="json")  # '[0.1, 0.2, ...]'
vector = doc.to_vector(require=False)  # None if missing

to_metadata()

Extract metadata dictionary.

to_metadata(sanitize: bool = True) -> Dict[str, Any]

Parameters:

  • sanitize: Remove None values

Example:

metadata = doc.to_metadata()
# {"category": "tech", "score": 0.95}

metadata = doc.to_metadata(sanitize=False)
# {"category": "tech", "score": 0.95, "optional": None}

to_storage_dict()

Convert to database storage format.

to_storage_dict(
    store_text: bool = False,
    use_dollar_vector: bool = False
) -> Dict[str, Any]

Parameters:

  • store_text: Include text field
  • use_dollar_vector: Use $vector key (AstraDB format)

Examples:

# Standard format
storage = doc.to_storage_dict()
# {"id": "doc-1", "vector": [...], "metadata": {...}}

# With text
storage = doc.to_storage_dict(store_text=True)
# {"id": "doc-1", "vector": [...], "text": "...", "metadata": {...}}

# AstraDB format
storage = doc.to_storage_dict(use_dollar_vector=True)
# {"_id": "doc-1", "$vector": [...], "metadata": {...}}

Metadata Schema

Metadata can contain arbitrary JSON-serializable data. Different backends support different levels of nesting.

Flat Metadata (All Backends)

metadata = {
    "category": "tech",
    "author": "John Doe",
    "score": 0.95,
    "published": True,
    "tags": ["python", "ai"],
    "count": 42
}

Nested Metadata (Backend Support)

Backend Nested Support Query Format
AstraDB Full {"user.role": {"$eq": "admin"}}
PgVector Full {"user.role": {"$eq": "admin"}}
ChromaDB Via dot notation {"user.role": {"$eq": "admin"}} (auto-flattened)
Milvus Full {"user.role": {"$eq": "admin"}}

Example with nested metadata:

doc = VectorDocument(
    id="doc-1",
    vector=[...],
    metadata={
        "user": {
            "name": "John",
            "role": "admin",
            "verified": True
        },
        "post": {
            "title": "My Post",
            "stats": {
                "views": 1000,
                "likes": 50
            }
        }
    }
)

# Query nested fields
from crossvector.querydsl.q import Q
results = engine.search(
    "query",
    where=Q(user__role="admin") & Q(post__stats__views__gte=500)
)

Metadata Types

Supported Types

CrossVector supports standard JSON types in metadata:

metadata = {
    "string": "text value",
    "integer": 42,
    "float": 3.14,
    "boolean": True,
    "null": None,
    "array": [1, 2, 3],
    "object": {"nested": "value"}
}

Type Casting (Backend-Specific)

Some backends require explicit type casting for numeric comparisons:

PgVector (automatic numeric casting):

# Text stored as string, but compared numerically
metadata = {"price": "99.99"}  # Stored as text
where = {"price": {"$gt": 50}}  # Cast to numeric for comparison

Other Backends:

Store numbers as actual numeric types when using comparison operators:

# Correct
metadata = {"price": 99.99, "count": 42}

# Incorrect for numeric comparisons
metadata = {"price": "99.99", "count": "42"}

Primary Key Strategies

Configure primary key generation in VectorEngine settings.

Strategy: uuid

Generate UUID v4 strings.

from crossvector.settings import CrossVectorSettings

settings = CrossVectorSettings(PRIMARY_KEY_MODE="uuid")
# Generated IDs: "a1b2c3d4-e5f6-7890-abcd-ef1234567890"

Strategy: hash_text

Hash document text using SHA256.

settings = CrossVectorSettings(PRIMARY_KEY_MODE="hash_text")
# Generated IDs: "5f4dcc3b5aa765d61d8327deb882cf99"

doc = engine.create("Hello world")
# doc.id = hash("Hello world")

Note: Requires text field to be present.

Strategy: hash_vector

Hash embedding vector using SHA256.

settings = CrossVectorSettings(PRIMARY_KEY_MODE="hash_vector")
# Generated IDs: "7b8e4d2a9c1f3e5d6a0b4c8e2f7d9a1b"

doc = engine.create(vector=[0.1, 0.2, ...])
# doc.id = hash(vector)

Strategy: int64

Generate random 64-bit integers.

settings = CrossVectorSettings(PRIMARY_KEY_MODE="int64")
# Generated IDs: 7234567890123456789

Strategy: auto

Use backend's native auto-generation (if supported).

settings = CrossVectorSettings(PRIMARY_KEY_MODE="auto")
# Backend-specific ID generation

Strategy: custom

Provide custom ID factory function.

from crossvector import VectorEngine
from crossvector.settings import CrossVectorSettings

def my_id_factory() -> str:
    return f"doc-{int(time.time())}"

settings = CrossVectorSettings(
    PRIMARY_KEY_MODE="custom",
    PK_FACTORY=my_id_factory
)

engine = VectorEngine(db=..., embedding=..., settings=settings)
doc = engine.create("Text")
# doc.id = "doc-1234567890"

Factory signature:

def pk_factory() -> str | int:
    """Generate unique primary key."""
    pass

Input Formats

VectorEngine accepts multiple input formats.

String Input

Creates document with text only. Embedding generated automatically.

doc = engine.create("My document text")
# VectorDocument(id=auto, text="...", vector=auto, metadata={})

Dict Input

Flexible dictionary with any combination of fields.

# Minimal
doc = engine.create({"text": "Content"})

# With metadata
doc = engine.create({
    "text": "Content",
    "metadata": {"key": "value"}
})

# With ID
doc = engine.create({
    "id": "custom-id",
    "text": "Content",
    "metadata": {...}
})

# With pre-computed vector
doc = engine.create({
    "text": "Content",
    "vector": [0.1, 0.2, ...],
    "metadata": {...}
})

VectorDocument Input

Direct VectorDocument instance.

from crossvector import VectorDocument

doc = VectorDocument(
    id="doc-123",
    text="Content",
    metadata={"key": "value"}
)
created = engine.create(doc)

Kwargs Input

Metadata fields as keyword arguments.

doc = engine.create(
    text="Content",
    category="tech",
    priority=1,
    featured=True
)
# metadata = {"category": "tech", "priority": 1, "featured": True}

Validation Rules

Required Fields

  • For creation: Either text or vector must be provided
  • For search: id is required
  • For update: id is required

Field Constraints

from pydantic import ValidationError

try:
    doc = VectorDocument(
        # Missing id
        vector=[0.1, 0.2],
        text="Content"
    )
except ValidationError as e:
    print(e)

Vector Dimension

Vector dimension must match embedding model:

# text-embedding-3-small: 1536 dimensions
doc = VectorDocument(
    id="doc-1",
    vector=[...],  # Must be length 1536
    text="Content"
)

Raises: InvalidFieldError if dimension mismatch


Serialization

JSON Serialization

VectorDocument can be serialized to JSON:

import json

doc = VectorDocument(
    id="doc-1",
    text="Content",
    vector=[0.1, 0.2, 0.3],
    metadata={"key": "value"}
)

# To JSON string
json_str = json.dumps(doc.model_dump())

# From JSON string
data = json.loads(json_str)
doc = VectorDocument(**data)

Database Format

Different backends expect different formats:

Standard (PgVector, Milvus, ChromaDB):

{
    "id": "doc-1",
    "vector": [0.1, 0.2, ...],
    "text": "Content",
    "metadata": {"key": "value"}
}

AstraDB:

{
    "_id": "doc-1",
    "$vector": [0.1, 0.2, ...],
    "text": "Content",
    "metadata": {"key": "value"}
}

Use to_storage_dict() to get correct format:

storage = doc.to_storage_dict(
    store_text=engine.store_text,
    use_dollar_vector=(engine.db.__class__.__name__ == "AstraDBAdapter")
)

Examples

Basic Document Creation

from crossvector import VectorDocument, VectorEngine

# Create with text
doc = VectorDocument.from_text(
    "Python is a programming language",
    category="tech",
    language="python"
)

# Store in database
engine = VectorEngine(db=..., embedding=...)
created = engine.create(doc)

print(created.id)  # Auto-generated
print(created.vector[:5])  # [0.123, 0.456, ...]
print(created.metadata)  # {"category": "tech", "language": "python"}

Document with Nested Metadata

doc = VectorDocument(
    id="post-123",
    text="My blog post about AI",
    vector=[...],
    metadata={
        "post": {
            "title": "Introduction to AI",
            "category": "technology",
            "tags": ["ai", "ml", "python"]
        },
        "author": {
            "name": "Jane Doe",
            "role": "contributor",
            "verified": True
        },
        "stats": {
            "views": 1500,
            "likes": 89,
            "shares": 12
        }
    }
)

# Query nested
from crossvector.querydsl.q import Q
results = engine.search(
    "AI tutorials",
    where=Q(author__verified=True) & Q(stats__views__gte=1000)
)

Batch Document Creation

docs = [
    {"text": f"Document {i}", "metadata": {"index": i, "batch": "A"}}
    for i in range(100)
]

created = engine.bulk_create(docs, batch_size=50)
print(f"Created {len(created)} documents")

Next Steps