Building Semantic Search: From Embeddings to Production

Keyword search has a fundamental problem. Users do not think in keywords. They think in concepts, questions, and intentions. They type "why does my app crash on startup" and expect results about initialization errors, dependency conflicts, and memory allocation failures. Keyword search returns results containing the exact words "app," "crash," and "startup." Often useless.

Semantic search closes this gap. It understands what users mean, not just what they type. Building one that works in production is surprisingly approachable. Building one that works well requires understanding a few things most tutorials skip.

The Embedding Pipeline

Your content needs to become vectors before it can be searched semantically. This pipeline has three stages: chunking, embedding, and storage. Each stage has decisions that directly impact search quality.

Chunking breaks your content into searchable units. This is where most implementations go wrong on the first attempt. The instinct is to chunk by fixed token count. Split every 500 tokens. Clean. Predictable. And often terrible for search quality.

Why? Because fixed-size chunks ignore content structure. A 500-token chunk might contain the end of one section and the beginning of another. The resulting embedding captures a blend of two topics, matching neither well.

Better approach: chunk by content structure. For documentation, chunk by section. For articles, chunk by heading. For conversations, chunk by topic or turn. For code, chunk by function or class. The chunk boundaries should align with semantic boundaries.

Overlap between chunks preserves context that crosses boundaries. A 50-100 token overlap between consecutive chunks means that information near a boundary appears in both chunks. This prevents the "falls between chunks" problem where relevant content is missed because it was split across two chunks that individually do not match well.

Each chunk gets embedded independently. Use a consistent embedding model. The same model for indexing and querying. Different models produce incompatible vector spaces.

Store the embeddings in your vector database with metadata: the original text, source document identifier, section heading, timestamp, and any other attributes you might filter on later.

Indexing Strategy

Your vector database needs an index to search efficiently. For most applications, HNSW (Hierarchical Navigable Small World) is the right choice. It provides excellent recall (finding relevant results) with fast query times.

HNSW has tuning parameters. The two that matter most: ef_construction (how thorough the index build is) and M (how many connections each node maintains). Higher values improve recall but increase memory usage and build time.

Start with the defaults. Benchmark with your actual queries. If recall is too low (relevant results are missing), increase ef_construction and M. If search is too slow, decrease them. The defaults are reasonable for datasets up to a few million vectors.

For datasets beyond a few million vectors, consider IVF (Inverted File Index) or hybrid approaches. IVF partitions the space into clusters and searches only relevant clusters. Faster at scale but requires more tuning.

Hybrid Search Is Not Optional

Here is a truth that pure vector search advocates do not like to admit: keyword search is still better for exact matches.

A user searches for "error code ERR_SSL_PROTOCOL_ERROR." Vector search might return results about SSL errors generally. Keyword search returns results containing that exact error code. For this query, keyword search wins.

A user searches for "how to fix authentication problems after updating." Vector search returns results about auth troubleshooting even if they use different terminology. Keyword search misses results that say "login issues following upgrade." For this query, vector search wins.

Hybrid search combines both. Run a vector search and a keyword search on the same query. Merge the results using reciprocal rank fusion or a learned scoring function. The hybrid consistently outperforms either approach alone.

Reciprocal rank fusion is the simple approach. Each result gets a score based on its rank in each search. Rank 1 in vector search gets a high score. Rank 1 in keyword search gets a high score. A result that ranks high in both searches gets the highest combined score. Results that rank high in only one search get moderate scores.

Implement hybrid search from the start. Do not plan to add it later. The quality difference between pure vector search and hybrid search is noticeable to users on day one.

Relevance Optimization

You have built your pipeline. Chunks are embedded. The index is built. Hybrid search is working. Now comes the part that separates a demo from a product: relevance optimization.

Start by building a test set. Fifty to a hundred real queries with manually labeled relevant documents. These are your ground truth. Every optimization you make gets measured against this test set.

Mean Reciprocal Rank (MRR) measures where the first relevant result appears. An MRR of 1.0 means the first result is always relevant. An MRR of 0.5 means the first relevant result is typically second. Track this as your primary metric.

Normalized Discounted Cumulative Gain (nDCG) measures the quality of the entire results list, not just the first result. It penalizes relevant results that appear too far down the list. Track this as your secondary metric.

With metrics in place, optimize systematically. Change your chunking strategy. Re-run the test set. Did MRR improve? Change your embedding model. Re-run the test set. Did nDCG improve? Adjust your hybrid search weights. Re-run the test set. Measure everything.

Query Processing

Raw user queries are often suboptimal for search. Short queries lack context. Long queries contain noise. Ambiguous queries match too many things.

Query expansion adds related terms to improve recall. The user searches for "React hooks." Query expansion adds "useState, useEffect, custom hooks, React functional components." More terms mean more potential matches.

Query rewriting transforms the user's query into a better search query. An AI model rewrites "why is my page slow" into "web page performance optimization techniques." The rewritten query matches documentation better than the original.

Hypothetical Document Embeddings (HyDE) take this further. Generate a hypothetical answer to the user's query. Embed that hypothetical answer instead of the query. The hypothetical answer is more similar to actual documents than the short query is. Counterintuitive but effective.

These techniques add latency. An AI rewrite adds 200-500ms. HyDE adds similar latency. For interactive search, this may be too slow. For search where quality matters more than speed, they are worth it.

Monitoring in Production

Search quality degrades silently. New content gets added that is poorly chunked. User behavior changes. The distribution of queries shifts. Without monitoring, you will not notice until users complain.

Track click-through rates on search results. If users consistently skip the first result and click the third, your ranking is off. Track zero-result queries. These represent gaps in your content or failures in your search pipeline. Track query abandonment. Users who search, see results, and leave without clicking found nothing useful.

Set up weekly reviews of the worst-performing queries. The queries with the lowest click-through rates. The queries that return no results. The queries where users reformulate multiple times. These are your optimization targets.

The Minimum Viable Search

Do not over-engineer the first version. Chunk your content by section. Embed with a standard model. Store in a vector database. Add basic keyword search for hybrid capability. Deploy.

Test with real users. Collect real queries. Build your test set from real data. Then optimize.

The first version will be good enough to be useful. Iterations will make it great. But you cannot iterate on something that does not exist yet. Ship the basics. Improve from there.

The Embedding Pipeline

Each chunk gets embedded independently. Use a consistent embedding model. The same model for indexing and querying. Different models produce incompatible vector spaces.

Store the embeddings in your vector database with metadata: the original text, source document identifier, section heading, timestamp, and any other attributes you might filter on later.

Indexing Strategy

Hybrid Search Is Not Optional

Here is a truth that pure vector search advocates do not like to admit: keyword search is still better for exact matches.

Implement hybrid search from the start. Do not plan to add it later. The quality difference between pure vector search and hybrid search is noticeable to users on day one.

Relevance Optimization

You have built your pipeline. Chunks are embedded. The index is built. Hybrid search is working. Now comes the part that separates a demo from a product: relevance optimization.

Start by building a test set. Fifty to a hundred real queries with manually labeled relevant documents. These are your ground truth. Every optimization you make gets measured against this test set.

Query Processing

Raw user queries are often suboptimal for search. Short queries lack context. Long queries contain noise. Ambiguous queries match too many things.

Monitoring in Production

The Minimum Viable Search

Do not over-engineer the first version. Chunk your content by section. Embed with a standard model. Store in a vector database. Add basic keyword search for hybrid capability. Deploy.

Test with real users. Collect real queries. Build your test set from real data. Then optimize.

The first version will be good enough to be useful. Iterations will make it great. But you cannot iterate on something that does not exist yet. Ship the basics. Improve from there.

Building Semantic Search: From Embeddings to Production

The Embedding Pipeline

Indexing Strategy

Hybrid Search Is Not Optional

Relevance Optimization

Query Processing

Monitoring in Production

The Minimum Viable Search

Related Articles

Embeddings Explained: A Practical Guide for Developers

Vector Databases Explained: A Developer's Practical Guide

WebSocket Patterns for Real-Time AI Applications

Want to Implement This?

Building Semantic Search: From Embeddings to Production

The Embedding Pipeline

Indexing Strategy

Hybrid Search Is Not Optional

Relevance Optimization

Query Processing

Monitoring in Production

The Minimum Viable Search

Related Articles

Embeddings Explained: A Practical Guide for Developers

Vector Databases Explained: A Developer's Practical Guide

WebSocket Patterns for Real-Time AI Applications

Want to Implement This?