Every RAG application needs somewhere to store embeddings. Pick the wrong database and you're paying 10x too much, watching queries time out at scale, or rewriting your entire data layer six months in. The vector database ecosystem has matured fast — seven serious production options as of March 2026, each with genuine trade-offs that matter at different scales.
This guide cuts through the marketing: how the algorithms work, what each database does better than the others, real benchmark numbers, a cost breakdown, and a clear decision framework. The running example throughout is a document search platform for a law firm — 50 million document embeddings, complex metadata filters (jurisdiction, date range, case type), and sub-200ms retrieval latency requirements.
Why Vector Databases Differ from Traditional Databases
A traditional database stores rows and columns, then retrieves them by exact match or range scan on an index like a B-tree. "Find all cases filed after 2020" is a B-tree query. "Find the 10 cases most semantically similar to this contract dispute" is something else entirely.
Vector databases answer nearest neighbor queries: given a query vector , return the vectors in the corpus most similar to by cosine distance or Euclidean distance. With 50 million 1536-dimensional vectors (OpenAI's text-embedding-3-large), a brute-force comparison requires 50 million dot products per query. At 1536 floats each, that's roughly 300 GB of data to scan. No amount of clever B-tree work helps here.
The solution is approximate nearest neighbor (ANN) search: trade a small amount of recall accuracy for a massive speedup.
Key Insight: Vector databases don't guarantee they'll find the true nearest neighbor — they find a very close one, usually within 95–99.5% recall, while scanning only a fraction of the corpus. That trade-off is what makes 50 million vector search feasible in milliseconds.
The Three ANN Algorithms Powering Vector Search
Understanding HNSW, IVF, and FLAT helps you interpret every benchmark and configuration decision you'll encounter.
Click to expandHNSW hierarchical graph traversal across layers for vector search
FLAT is the baseline: store all vectors and scan every one. 100% recall, zero approximation error, but query time scales linearly with corpus size. Useful for datasets under 100K vectors or when recall is non-negotiable.
IVF (Inverted File Index) clusters vectors into groups using k-means, then at query time searches only the most relevant clusters. Building the index is fast, memory usage is modest, and filtered searches (like "only search documents from 2023") perform well because IVF's cluster structure interacts naturally with pre-filtering. The downside: you need to decide the number of clusters upfront, and performance degrades faster than HNSW at very high dimensions.
HNSW (Hierarchical Navigable Small World) builds a multi-layer graph. The top layer has sparse long-range connections between landmark nodes, middle layers progressively add more nodes, and the bottom layer contains every vector with dense local connections. A query starts at the top, greedily navigates toward the nearest neighbor, then descends to denser layers for refinement. HNSW is the default in most production systems because it delivers the best recall-to-latency trade-off, supports incremental inserts without index rebuilds, and scales well in high dimensions.
The catch: HNSW consumes significant memory. Each node stores its graph connections alongside the vector, roughly 50–100 bytes per connection on top of the raw vector bytes. At 50 million 768-dimensional vectors (float32), the raw data is about 150 GB. HNSW metadata can add another 20–40 GB.
| Algorithm | Build Speed | Query Speed | Recall | Memory | Filtered Search |
|---|---|---|---|---|---|
| FLAT | Instant | Slow (linear) | 100% | Low (vectors only) | Excellent |
| IVF | Fast | Fast | 85–95% | Low-medium | Good |
| HNSW | Slow | Very fast | 95–99.5% | High (graph overhead) | Fair (degrades with tight filters) |
Benchmark data from ann-benchmarks.com on 1M vectors at 768 dimensions: HNSW achieves 95% recall at approximately 1,200 QPS, while IVF reaches 85% recall at roughly 2,000 QPS. The recall-to-QPS frontier shifts as filters get more restrictive — as you narrow the filter, HNSW's traversal advantage erodes because graph pruning relies on exploring similar neighborhoods, which breaks when many neighbors are excluded.
Cosine Similarity: The Math Behind Every Query
Before comparing databases, it helps to understand what they're computing. Given a query vector and a document vector , cosine similarity measures how aligned they are in direction:
Where:
- is the dot product of the query and document vectors
- is the L2 norm (magnitude) of the query vector
- is the L2 norm of the document vector
- The result is bounded between and $1, where $1 means identical direction
In Plain English: Cosine similarity ignores how long the vectors are and only cares about their direction. Two contract documents that both heavily reference "breach of fiduciary duty" will point in similar directions in embedding space — even if one is a 1-page summary and the other is a 200-page filing. This is exactly what the law firm's search system needs.
When embeddings are pre-normalized (unit length, as most embedding APIs return), the dot product equals cosine similarity directly. This is why ANN indexes compute dot products internally rather than the full cosine formula.
The Seven Contenders
Click to expandFull RAG retrieval pipeline from text input through vector search to LLM context
Pinecone: Managed Simplicity with Dedicated Read Nodes
Pinecone is a fully managed vector database — you never touch infrastructure. No servers to provision, no indexes to rebuild, no capacity planning for hot replicas. You get an API, a Python client, and a billing dashboard.
In December 2025, Pinecone launched Dedicated Read Nodes (DRN), a major architectural addition that changed the cost equation significantly. The previous "pod tier" model is gone. Now there are two options: serverless on-demand (pay-per-request, best for variable traffic) and Dedicated Read Nodes (hourly per-node pricing, best for sustained high-QPS workloads).
Serverless pricing is based on Read Units (RUs): roughly $16 per million RUs at the Standard plan, with storage at $0.33/GB/month. For a 50M vector index queried at moderate rates, expect $300–$800/month. The billing can surprise teams — AI startups have reported monthly bills jumping from $50 to $3,000 as usage scaled, which exposed a fundamental issue: pay-per-query pricing makes Pinecone serverless a poor fit for sustained high-throughput workloads.
DRNs fix this. They keep data warm in memory plus local SSD for predictable latency. One customer sustained 600 QPS at P50 latency of 45ms and P99 of 96ms across 135 million vectors. In a load test, they reached 2,200 QPS at P50 of 60ms. DRN pricing is hourly per node — better cost predictability for teams with consistent demand patterns.
Pro Tip: Pinecone's namespace feature lets you partition vectors within a single index. For the law firm, each client matter becomes its own namespace, enabling multi-tenant isolation without multiplying infrastructure costs.
Pinecone supports metadata filtering using a MongoDB-style filter syntax. Filtering is competitive but not best-in-class — teams doing heavy pre-filtering sometimes see latency degrade more than with Qdrant at equivalent recall targets.
Qdrant: Open-Source Rust Performance
Qdrant is written entirely in Rust, which gives it something most vector databases can't match: genuinely predictable latency without garbage collection pauses. Java-based systems (looking at Elasticsearch) have periodic GC pauses that blow p99 latency; Qdrant doesn't.
The filtering story is where Qdrant genuinely differentiates. It indexes payload fields (metadata) in a separate BTree structure and uses them to guide HNSW traversal rather than post-filtering. Qdrant added the ACORN algorithm in 2025, which improves filtered HNSW quality significantly — maintaining near-full recall even with highly selective filters by adaptively switching between graph traversal and brute-force based on filter selectivity.
Recent 2025 additions make the full feature set competitive: native BM25 IDF calculation on the server side (since v1.15.2), Maximal Marginal Relevance (MMR) for diversity-aware results, score-boosting reranking, and full-text filtering with multilingual tokenization. Hybrid search (dense + sparse) has been supported since v1.9 via named vectors — a collection can hold both a dense HNSW index and a sparse inverted index, queried simultaneously and fused at retrieval time.
Benchmarks from Qdrant's vector-db-benchmark at 50M vectors show 41 QPS at 99% recall under heavy filtering conditions. At lighter filter ratios on 1M vectors, Qdrant achieves around 326 QPS at high recall — the number varies significantly with filter selectivity.
One practical consideration: Qdrant quantizes vectors by default in recent versions. Scalar quantization (int8) reduces memory by 4x with less than 1% recall loss; binary quantization cuts it to 32x with about 5% recall loss. At 50M vectors, the difference between storing float32 (300 GB) and int8 (75 GB) is the difference between a $4,000/month server cluster and a $1,000/month one.
Qdrant is self-hosted by default (Docker, Kubernetes) with a managed cloud at qdrant.tech. License is Apache 2.0.
Weaviate: Hybrid Search with BlockMax WAND
Weaviate approaches vector search as a hybrid-first problem: it's built to combine dense vector search with keyword (BM25) search in a single query, without managing two separate systems.
The hybrid mode runs both searches in parallel, then fuses the ranked lists. Relative Score Fusion (the default since v1.24) normalizes raw similarity scores from both searches and combines them with a weighted sum, which better respects score magnitude than Reciprocal Rank Fusion.
In 2025, Weaviate reached v1.35 and shipped several meaningful improvements. BlockMax WAND — now generally available — organizes the inverted index in blocks and skips irrelevant ones, delivering up to 10x speedup in keyword searches. Rotational quantization (RQ) is now the default compression method, improving recall over standard binary quantization. Replica movement hit GA for live shard rebalancing. MUVERA encoding enables multi-vector search (one document represented by multiple vectors) natively.
Common Pitfall: Weaviate's modular vectorizer architecture makes it tempting to attach OpenAI, Cohere, or sentence-transformer modules directly to the database. This is convenient for prototyping but creates tight coupling in production — if your vectorizer API is down, so is your search. Keep vectorization in your application layer and pass pre-computed vectors to Weaviate.
Weaviate consumes more memory and CPU than Qdrant at equivalent scale. Below 50 million vectors it runs efficiently; beyond that, capacity planning gets careful. License is BSD 3-clause open source.
Milvus: Cloud-Native at Billion Scale
Milvus is the largest open-source vector database by GitHub stars (40,000+ as of early 2026), governed by the LF AI & Data Foundation. It was built from the start for cloud-native deployments: its architecture separates storage, compute, and coordination into independent microservices backed by Kafka, MinIO, and etcd.
Milvus 2.5 (released December 2024) introduced native full-text search with Sparse-BM25 technology — a sparse vector implementation of the BM25 algorithm. In internal benchmarks against Elasticsearch on 1M vectors (fully managed, same query), Milvus 2.5 returned results in 6ms versus Elasticsearch's 200ms: a 30x improvement, and with unified storage eliminating the need for separate keyword and vector infrastructure. Milvus 2.6, now GA on Zilliz Cloud as of early 2026, adds hot/cold tiering for cost-efficient archival and marks a shift the team describes as moving from "vector search + glue code" to a full retrieval engine.
The scaling story is genuinely different from the others. Where Qdrant or Weaviate handle millions to tens of millions of vectors comfortably, Milvus is routinely deployed at hundreds of millions to billions of vectors in production — search companies, e-commerce platforms, and genomics research.
The trade-off is operational complexity. Running Milvus in production means running Kafka, MinIO, etcd, and the Milvus coordination plane together. The Kubernetes operator simplifies this, but it's still more moving parts than Qdrant or Pinecone. For the law firm's 50M vectors, Milvus is likely over-engineered unless they expect to grow to 500M+ vectors or want the full-text hybrid search performance numbers.
Zilliz Cloud is the managed offering. License is Apache 2.0.
LanceDB: The Multimodal Contender
LanceDB is the newest serious contender in this space and was missing from most vector database comparisons written before 2025. It deserves its own section now.
LanceDB is built on the Lance columnar format, which stores vectors alongside their associated data (images, text, metadata) in a single unified file format rather than keeping vectors and raw data in separate systems. This columnar layout means vector search, filtering, and data retrieval happen in one scan — no join between a vector index and an object store. The format is Apache 2.0 open source.
Production adoption tells the real story: Midjourney, Runway, and Character.ai run LanceDB at scale, searching tens of billions of vectors and managing petabytes of training data. LanceDB raised a $30M Series A in June 2025 to build what they call the Multimodal Lakehouse — unified infrastructure for text, images, audio, and video in a single retrieval platform.
The January 2026 update added native DuckDB SQL integration, turning DuckDB into a SQL compute engine over Lance datasets with vector search exposed as SQL table functions. You can write SELECT * FROM docs WHERE vector_search(embedding, :query_vec, limit=10) AND year > 2022 in plain SQL. That's a fundamentally different interface from any other vector database here.
LanceDB supports hybrid search (dense + sparse BM25) natively, handles multiple vector types per record, and ships first-class integrations with VoyageAI and multimodal embedding models.
Key Insight: If your use case involves AI training data management alongside retrieval (curating datasets, filtering training examples, feature engineering), LanceDB's unified format eliminates the data pipeline complexity that plagues systems using separate vector stores and data lakes.
For pure text retrieval at 10–100M vectors without multimodal needs, Qdrant or Weaviate are still better fits. LanceDB's advantage shows up when you need to search AND manage the underlying data in the same system.
ChromaDB: Prototyping Speed
ChromaDB is the fastest path from zero to working vector search. The Python client is four lines:
import chromadb
client = chromadb.Client()
collection = client.create_collection("case_law")
collection.add(documents=["contract dispute..."], ids=["case_001"])
results = collection.query(query_texts=["breach of fiduciary duty"], n_results=5)
ChromaDB reached v1.5.5 in March 2026, a significant version history jump from its early 0.x days. The core interface is stable, the Python-first experience is excellent, and for prototyping it's unmatched in simplicity. But it's still not a production system. Chroma uses an in-memory HNSW index persisted to disk, with no horizontal scaling, no native hybrid search, and no filtered HNSW. Teams routinely hit performance walls at 5–10 million vectors and migrate to Qdrant or Pinecone.
ChromaDB is MIT licensed, self-hosted. A cloud offering exists but Chroma's primary value is in local development.
pgvector and pgvectorscale: When You're Already on Postgres
Pgvector is a PostgreSQL extension that adds vector storage and HNSW/IVFFlat indexing to any Postgres instance. If your law firm's case metadata already lives in Postgres — case IDs, client names, filing dates, jurisdiction codes — pgvector lets you run vector search without a separate system.
Timescale's pgvectorscale takes this further with StreamingDiskANN, a disk-based ANN index inspired by Microsoft's DiskANN research. Unlike HNSW, StreamingDiskANN stores part of the index on disk, dramatically reducing memory requirements for large datasets while maintaining competitive recall.
The performance headline is striking: according to Timescale's pgvectorscale benchmarks, on a dataset of 50 million Cohere embeddings at 768 dimensions, PostgreSQL with pgvector + pgvectorscale achieves 28x lower p95 latency and 16x higher query throughput compared to Pinecone's storage-optimized (s1) index at 99% recall — and does so at 75% less cost when self-hosted on AWS EC2. That said, pgvector vs. Qdrant at 50M vectors with 99% recall shows pgvectorscale at 471 QPS against Qdrant's 41 QPS in filtered-heavy benchmarks — a 11x gap that shows pgvectorscale handles the Postgres query planner better than vanilla pgvector does under filtering.
The transactional consistency story is pgvector's unique advantage: vector search and metadata queries participate in the same ACID transactions. No dual-write, no eventual consistency between your relational data and your vector index.
Common Pitfall: Postgres's query planner wasn't designed for filtered vector search. When you add a WHERE clause to a vector similarity query, Postgres sometimes chooses a full sequential scan over the HNSW index, wiping out all performance gains. The fix: use SET enable_seqscan = off for vector queries, or switch to pgvectorscale's StreamingDiskANN index, which handles filtered queries more reliably than vanilla HNSW in pgvector.
Beyond 100 million vectors, the architectural cracks show. Postgres wasn't designed with vector-first storage layouts, index rebuilds are memory-intensive and block writes, and the shared-buffer pool that makes Postgres fast for relational queries competes directly with your 300 GB of vectors.
Full Comparison Matrix
Click to expandVector database selection decision tree for choosing between Pinecone, Qdrant, Weaviate, Milvus, LanceDB, Chroma, and pgvector
| Database | Type | License | Index | Hybrid Search | Max Practical Scale | Filtering | Best For |
|---|---|---|---|---|---|---|---|
| Pinecone | Managed cloud | Proprietary | HNSW | Basic | Billions (DRN) | Good | Zero-ops production |
| Qdrant | Self-hosted / cloud | Apache 2.0 | HNSW + ACORN | v1.9+ (sparse) | 100M–1B | Excellent | Filtered self-hosted production |
| Weaviate | Self-hosted / cloud | BSD 3 | HNSW | Native BM25+dense | 50M–500M | Good | Hybrid document search |
| Milvus | Self-hosted / cloud | Apache 2.0 | HNSW + IVF | v2.5+ (Sparse-BM25) | Billions+ | Good | Planet-scale retrieval |
| LanceDB | Self-hosted / cloud | Apache 2.0 | IVF-PQ / HNSW | Native | 10B+ (with enterprise) | Good | Multimodal + training data |
| ChromaDB | Self-hosted | MIT | HNSW | No | < 10M | Basic | Prototyping |
| pgvector + pgvectorscale | Postgres extension | PostgreSQL | HNSW / DiskANN | No | < 100M | Excellent (SQL) | Existing Postgres stacks |
Hybrid Search: Why Dense + Sparse Often Wins
Pure vector search sometimes fails on legal queries that rely on specific citations. When a litigator searches for "Chevron deference after Loper Bright v. Raimondo," the semantic vector captures the general topic but may surface broadly relevant administrative law cases rather than the specific post-2024 cases explicitly grappling with the Loper Bright decision. Keyword matching catches the citation directly.
Hybrid search combines a dense vector index with a sparse BM25 index, then fuses the results. The fusion step matters: Reciprocal Rank Fusion (RRF) combines rankings based on position, rewarding documents that rank well in either list. Relative Score Fusion (Weaviate's default from v1.24) normalizes raw similarity scores and combines them with a weighted sum, which better respects score magnitude.
The databases split cleanly on how they implement this:
- Weaviate: Most mature native implementation with BM25F (field-weighted BM25). BlockMax WAND (GA in 2025) makes the keyword side 10x faster, and MUVERA encoding handles multi-vector documents natively.
- Milvus 2.5+: Sparse-BM25 natively stores BM25 scores as sparse vectors alongside dense embeddings. The 30x latency advantage over Elasticsearch in internal tests makes this compelling for teams moving off ELK stacks.
- Qdrant v1.9+: Named vectors allow a collection to hold both a dense HNSW index and a sparse inverted index. Since v1.15.2, Qdrant handles IDF computation server-side, which was previously a missing piece that required client-side BM25.
- LanceDB: Native hybrid search via a single SQL-like query interface, with DuckDB integration for composable retrieval.
- Pinecone: Supports sparse-dense hybrid via their proprietary sparse encoding. Works, but you're locked into their sparse format rather than standard BM25/SPLADE.
- ChromaDB and plain pgvector: No native hybrid search without external tooling.
For more on building full retrieval pipelines that combine these components, see Agentic RAG: Self-Correcting Retrieval Systems.
Cost at Production Scale
Cost is often the deciding factor. The estimates below are for 50M 768-dim vectors with moderate query rates (not burst traffic) as of March 2026. Self-hosted costs assume AWS EC2 with memory-optimized instances.
Click to expandVector database estimated monthly cost at 50M vectors as of 2026
| Database | Monthly Cost (50M vectors, moderate QPS) | Cost Model |
|---|---|---|
| Pinecone serverless | $300–$800 | Pay-per-request (volatile at scale) |
| Pinecone DRN | $1,500+ | Hourly per node |
| Qdrant self-hosted | $400–$600 | Fixed infrastructure (EC2 + int8 quantization) |
| Weaviate self-hosted | $500–$900 | Fixed infrastructure |
| Milvus self-hosted | $600–$1,200 | Fixed infra + operational overhead |
| pgvector + pgvectorscale | $200–$400 | Shares Postgres instance cost |
| LanceDB self-hosted | $200–$500 | S3 + compute (columnar, very storage-efficient) |
The self-hosted break-even vs. managed serverless sits around 80–100 million queries per month. Below that threshold, Pinecone's serverless pricing is competitive and you avoid operational overhead. Above it, any self-hosted option will save money, often dramatically.
Pro Tip: Qdrant with int8 scalar quantization reduces the memory footprint of 50M vectors from ~150 GB to ~38 GB. This single configuration change drops you from needing a memory-optimized r7g.4xlarge (256 GB RAM, ~$1,200/mo) to a standard r7g.2xlarge (64 GB RAM, ~$450/mo). Run the numbers on quantization before provisioning infrastructure.
When to Use Each Database
The decision isn't just about raw performance. It's about operational complexity, cost structure, and which problems you're most likely to hit at your scale.
Use ChromaDB when you're still proving the concept. Get semantic search working in an afternoon, validate the use case, then pick the right production database. Plan to migrate when you hit 5 million vectors or need complex metadata filtering.
Use pgvector + pgvectorscale when your data already lives in Postgres and your scale stays under 50 million vectors. The operational simplicity of one system, plus transactional consistency, often outweighs the performance advantages of dedicated systems. pgvectorscale's StreamingDiskANN makes this a real production option in 2026 — the 75% cost advantage over Pinecone at 50M vectors is hard to ignore if you're already a Postgres shop.
Use Pinecone when you need production-grade latency (sub-100ms P99), your engineering team doesn't want to operate vector databases, and your budget can absorb managed service pricing. For bursty or unpredictable traffic, serverless is the right tier. For sustained high-QPS production workloads (user-facing assistants, recommendation engines), Dedicated Read Nodes are the better choice. Model the cost at your expected query volume before committing.
Use Qdrant when you need filtering excellence on a self-hosted system. The ACORN algorithm (2025) makes filtered HNSW genuinely competitive at tight filter ratios. For the law firm, Qdrant running on a couple of EC2 instances with int8 quantization is probably the best performance-per-dollar option — and for firms with data residency requirements, self-hosting is often non-negotiable anyway.
Use Weaviate when hybrid search quality is your primary concern. If "find cases semantically similar to this clause AND mentioning this specific statute" is your core query pattern, Weaviate's BlockMax WAND + Relative Score Fusion combination is the most mature native hybrid implementation available in 2026. The modules system also makes cross-encoder reranking easy to add as a post-retrieval step.
Use Milvus when you're building at scale that makes other options sweat. Milvus 2.5's 30x latency advantage over Elasticsearch for hybrid search is a strong migration argument for teams currently running ELK stacks for document retrieval. And when you're approaching billion-scale vectors, Milvus's separation of storage and compute makes it the only realistic open-source option.
Use LanceDB when you're working with multimodal data (images, video, audio alongside text) or when you need unified vector retrieval and training data management. Midjourney and Runway chose it for these exact reasons. The DuckDB SQL interface also makes it appealing for data science teams who prefer SQL to client library APIs.
Understanding the underlying vector representations is essential for getting the most from any of these databases. For a deep dive into how embeddings are generated and why they enable semantic search, see Text Embeddings Explained: From Intuition to Production-Ready Search.
Conclusion
The vector database market looks very different in March 2026 than it did two years ago. Pinecone retired pod tiers for Dedicated Read Nodes. Qdrant's ACORN algorithm solved the filtered HNSW problem that once made it uncompetitive on selective queries. Milvus 2.5 made hybrid search 30x faster than Elasticsearch. LanceDB emerged as a genuine option for multimodal and AI training workloads. pgvectorscale made Postgres competitive at 50M vectors.
For the law firm scenario: start with pgvectorscale if you're already on Postgres — the cost savings and ACID transactions are compelling at 50M vectors. If you need data residency guarantees and filtering performance, Qdrant with int8 quantization delivers the best performance-per-dollar. If operational overhead is a dealbreaker, Pinecone DRN handles the scale with predictable latency.
The database is only one piece of the puzzle. Retrieval quality depends heavily on your embedding model, chunking strategy, reranking step, and the queries your users actually send. A well-configured pgvectorscale instance with a strong embedding model will outperform a poorly-tuned Pinecone deployment on most real-world recall benchmarks.
For a complete picture of how vector search fits into a full RAG pipeline — including self-correcting retrieval, query reformulation, and relevance evaluation — RAG Explained: Retrieval-Augmented Generation for LLMs walks through the end-to-end architecture. And for teams building autonomous systems that dynamically select retrieval strategies, Agentic RAG: Self-Correcting Retrieval Systems covers how agents loop over retrieval to improve answer quality.
Interview Questions
What is the fundamental difference between how a vector database indexes data versus how a relational database does it?
Relational databases use B-trees and hash indexes optimized for exact-match and range queries on structured columns. Vector databases use graph-based or cluster-based indexes (HNSW, IVF, DiskANN) optimized for approximate nearest neighbor search in high-dimensional embedding space. The key word is approximate: vector databases trade a small amount of recall accuracy for orders-of-magnitude speedups in query time.
Explain the recall-versus-QPS trade-off in ANN search and how you'd configure it in production.
Recall measures the fraction of true nearest neighbors returned; QPS is queries per second. You tune both by adjusting ef — HNSW's search expansion factor. Higher ef probes more graph nodes, improving recall but reducing QPS. In production, set ef to achieve your recall target (typically 95–99%) and then size your infrastructure to hit your QPS requirement. If recall degrades under load, it usually means your ef is too aggressive relative to available hardware.
When would you choose pgvector + pgvectorscale over a dedicated vector database like Qdrant or Pinecone?
Choose pgvectorscale when your structured metadata already lives in PostgreSQL, your scale stays under 50 million vectors, and you want ACID transactional consistency between vector searches and relational queries. pgvectorscale's StreamingDiskANN achieved 28x lower p95 latency than Pinecone's storage-optimized index at 50M vectors in Timescale's benchmarks, at 75% less cost. The operational simplicity of one system often outweighs the performance gap at moderate scale.
What makes HNSW more suitable for high-dimensional data than IVF?
HNSW's graph-based structure navigates via similarity at every step, maintaining effectiveness as dimensions grow. IVF clusters with k-means, and k-means cluster quality degrades with the curse of dimensionality — high-dimensional spaces have increasingly uniform distances, making cluster boundaries less meaningful. HNSW also supports incremental inserts without rebuilding the index, while IVF requires periodic re-clustering as the corpus grows.
How does hybrid search improve retrieval quality, and which databases support it natively in 2026?
Hybrid search combines dense vector search (semantic similarity) with sparse BM25 keyword search, then fuses the ranked results. It improves quality on queries involving domain-specific terminology, citations, or proper nouns that semantic vectors don't cluster reliably. In 2026, Weaviate (BlockMax WAND + RSF), Milvus 2.5+ (Sparse-BM25), Qdrant v1.9+ (named vector hybrid), LanceDB, and Pinecone (proprietary sparse encoding) all support native hybrid search. ChromaDB and plain pgvector do not.
A law firm notices their vector search returns semantically similar cases but misses cases containing a specific statute number. What's happening and how do you fix it?
This is the classic failure mode of pure dense vector search: embedding models tokenize statute numbers into subword tokens that don't cluster near the exact string match. The fix is hybrid search — add a BM25 index so exact-match keyword searches run in parallel with the dense vector search, then fuse the results. Alternatively, a reranking step using a cross-encoder model can catch these misses post-retrieval by scoring each candidate against the full query text.
What is vector quantization, and why does it matter for production deployments?
Vector quantization compresses embeddings by reducing their numerical precision. Scalar quantization converts float32 vectors to int8, reducing memory by 4x with under 1% recall loss. Binary quantization goes further to 32x compression with roughly 5% recall loss. At 50 million vectors, enabling int8 quantization in Qdrant can cut infrastructure costs from $1,200/month to around $450/month while maintaining acceptable retrieval quality. Qdrant, Milvus, Weaviate (rotational quantization), and Pinecone (handled internally) all support some form of quantization.
Why would a company like Midjourney choose LanceDB over Pinecone or Qdrant for production?
Midjourney works with multimodal data — images, prompts, and embeddings — at tens of billions of vectors. LanceDB's columnar Lance format stores vectors alongside the raw data in a single unified system, eliminating the need for a separate object store and vector index. This means a single scan handles both retrieval and data access, reducing the pipeline complexity of maintaining synchronized systems. For AI training data management (filtering, feature engineering, materialized views), LanceDB's Multimodal Lakehouse features also replace separate data lake tooling.