INDX
Transform Search Accuracy! The Frontline of Index Strategy - Dense/Sparse/Hybrid Comparison
Blog
Data Utilization

Transform Search Accuracy! The Frontline of Index Strategy - Dense/Sparse/Hybrid Comparison

Solve search omissions and noise through strategic use of FAISS, Milvus, Pinecone, and Weaviate. Learn combination strategies for BM25, SPLADE, and DPR.

K
Katsuya Ito
CEO
12 min

Transform Search Accuracy! The Frontline of Index Strategy

Dense/Sparse/Hybrid Thorough Comparison

In RAG systems, index strategy is a crucial element that significantly affects search accuracy. This article provides detailed practical insights into traditional keyword search (Sparse), semantic search (Dense), and hybrid search that combines both approaches.

Why Do Search Accuracy Problems Occur?

Many RAG systems experience the following issues:

1. Limitations of Exact Matching: Keyword search cannot handle expression variations

2. Ambiguity in Semantic Search: Vector search tends to miss specific proper nouns

3. Lack of Context: Single search methods cannot consider context

Types and Characteristics of Index Strategies

1. Sparse Search (BM25)

Features: Keyword matching, statistical weighting

Strengths: Proper nouns, technical terms, exact string matching

Weaknesses: Synonyms, expression variations, semantic similarity

python
1from rank_bm25 import BM25Okapi
2import jieba
3
4# Japanese-compatible BM25 implementation
5def create_bm25_index(documents):
6    # Japanese word segmentation
7    tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
8    bm25 = BM25Okapi(tokenized_docs)
9    return bm25
10
11# Execute search
12query = "machine learning accuracy improvement"
13tokenized_query = list(jieba.cut(query))
14scores = bm25.get_scores(tokenized_query)

2. Dense Search (Vector Search)

Features: Semantic similarity, contextual understanding

Strengths: Synonym search, semantic similarity, multilingual support

Weaknesses: Proper nouns, numerical values, exact string matching

python
1import faiss
2from sentence_transformers import SentenceTransformer
3
4# Dense search implementation
5model = SentenceTransformer('BAAI/bge-large-ja')
6embeddings = model.encode(documents)
7
8# Build FAISS index
9dimension = embeddings.shape[1]
10index = faiss.IndexFlatIP(dimension)  # Inner Product
11index.add(embeddings.astype('float32'))
12
13# Execute search
14query_embedding = model.encode([query])
15k = 5
16scores, indices = index.search(query_embedding.astype('float32'), k)

3. Hybrid Search

Features: Combination of Sparse and Dense

Benefits: Leverages strengths of both methods

Implementation: Weighted score fusion

python
1from langchain.retrievers import EnsembleRetriever
2from langchain.retrievers import BM25Retriever
3from langchain.vectorstores import FAISS
4
5# Build hybrid retriever
6bm25_retriever = BM25Retriever.from_documents(documents)
7vector_store = FAISS.from_documents(documents, embeddings)
8vector_retriever = vector_store.as_retriever()
9
10# Ensemble retriever
11hybrid_retriever = EnsembleRetriever(
12    retrievers=[bm25_retriever, vector_retriever],
13    weights=[0.4, 0.6]  # BM25: 40%, Vector: 60%
14)

Vector Database Comparison

FAISS (Facebook AI Similarity Search)

Features:

  • Pros: High speed, diverse index types, free
  • Cons: Difficult distributed processing, limited metadata features
  • Use Cases: Single machine, prototypes
python
1import faiss
2
3# IVFPQ (memory efficiency focused)
4quantizer = faiss.IndexFlatL2(dimension)
5index = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)
6
7# HNSW (speed focused)
8index = faiss.IndexHNSWFlat(dimension, 32)

Milvus

Features:

  • Pros: Distributed processing, scalability, rich indexes
  • Cons: Complex setup, high resource consumption
  • Use Cases: Large-scale data, production environments
python
1from pymilvus import connections, Collection
2
3# Milvus connection
4connections.connect("default", host="localhost", port="19530")
5
6# Create collection
7collection = Collection("rag_collection")
8
9# Execute search
10search_params = {"metric_type": "IP", "params": {"nprobe": 10}}
11results = collection.search(
12    query_embedding, 
13    "embeddings", 
14    search_params, 
15    limit=5
16)

Pinecone

Features:

  • Pros: Managed service, easy setup
  • Cons: Cloud dependency, cost
  • Use Cases: Rapid development, reduced operational burden
python
1import pinecone
2
3# Initialize Pinecone
4pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
5
6# Index operations
7index = pinecone.Index("rag-index")
8
9# Insert vectors
10index.upsert(vectors=zip(ids, embeddings, metadata))
11
12# Execute search
13results = index.query(
14    vector=query_embedding.tolist(),
15    top_k=5,
16    include_metadata=True
17)

Weaviate

Features:

  • Pros: GraphQL API, schema flexibility, AI integration
  • Cons: Learning curve, complex performance tuning
  • Use Cases: Complex schemas, multimodal
python
1import weaviate
2
3# Weaviate connection
4client = weaviate.Client("http://localhost:8080")
5
6# Execute search
7result = client.query.get("Document", ["content", "title"])     .with_near_vector({"vector": query_embedding})     .with_limit(5)     .do()

Advanced Index Optimization

Hierarchical Indexing

Hierarchical approach utilizing document structure:

python
1class HierarchicalIndex:
2    def __init__(self):
3        self.document_index = {}  # Document level
4        self.section_index = {}   # Section level
5        self.paragraph_index = {} # Paragraph level
6    
7    def build_hierarchy(self, documents):
8        for doc in documents:
9            # Document level index
10            doc_embedding = self.embed_document(doc)
11            self.document_index[doc.id] = doc_embedding
12            
13            # Section level
14            for section in doc.sections:
15                section_embedding = self.embed_section(section)
16                self.section_index[section.id] = section_embedding

Time-Aware Indexing

Considering temporal weighting:

python
1import datetime
2
3def time_weighted_search(query_embedding, results, decay_factor=0.1):
4    current_time = datetime.datetime.now()
5    weighted_results = []
6    
7    for result in results:
8        # Apply temporal decay
9        time_diff = (current_time - result.timestamp).days
10        time_weight = math.exp(-decay_factor * time_diff)
11        
12        final_score = result.score * time_weight
13        weighted_results.append((result, final_score))
14    
15    return sorted(weighted_results, key=lambda x: x[1], reverse=True)

Search Strategy Optimization

Reciprocal Rank Fusion (RRF)

Effectively fusing multiple search results:

python
1def reciprocal_rank_fusion(rankings, k=60):
2    fused_scores = {}
3    
4    for ranking in rankings:
5        for rank, doc_id in enumerate(ranking, 1):
6            if doc_id not in fused_scores:
7                fused_scores[doc_id] = 0
8            fused_scores[doc_id] += 1 / (k + rank)
9    
10    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

Query Expansion

Improving search accuracy through query expansion:

python
1from transformers import pipeline
2
3# Query expansion implementation
4def expand_query(query, model_name="rinna/japanese-gpt-neox-3.6b"):
5    generator = pipeline("text-generation", model=model_name)
6    
7    prompt = f"Keywords related to question '{query}':"
8    expanded = generator(prompt, max_length=50, num_return_sequences=1)
9    
10    return expanded[0]['generated_text']

INDX Practical Case Studies

Law Firm C

Challenge: Poor case law search accuracy (45% precision)

Solution:

  • BM25 (for statutes & case numbers) + Dense (for content search)
  • Hierarchical indexing (law → article → clause)
  • Temporal weighting (prioritizing recent cases)

Implementation:

python
1# Legal-specialized index
2class LegalHybridRetriever:
3    def __init__(self):
4        self.statute_bm25 = BM25Retriever()  # For statutes
5        self.case_dense = DenseRetriever()   # For cases
6        self.temporal_weights = TemporalWeighting()
7    
8    def search(self, query):
9        statute_results = self.statute_bm25.search(query)
10        case_results = self.case_dense.search(query)
11        
12        # Apply temporal weighting
13        weighted_results = self.temporal_weights.apply(
14            statute_results + case_results
15        )
16        
17        return self.fuse_results(weighted_results)

Results: Precision 45% → 78%, User satisfaction 90%

Manufacturing Company D

Challenge: Multilingual technical specification search

Solution:

  • Multilingual Dense search (Japanese/English/Chinese support)
  • Integration with technical terminology dictionary
  • Domain-specific fine-tuning

Results: Search time 30s → 3s, Accuracy 60% → 85%

Performance Optimization

Index Compression

python
1# Product Quantization (PQ)
2index = faiss.IndexPQ(dimension, 8, 8)  # Compress to 8 bytes
3
4# Scalar Quantization (SQ)
5index = faiss.IndexScalarQuantizer(dimension, faiss.ScalarQuantizer.QT_8bit)

Parallel Processing Optimization

python
1import concurrent.futures
2
3def parallel_search(queries, retrievers):
4    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
5        futures = [
6            executor.submit(retriever.search, query) 
7            for query, retriever in zip(queries, retrievers)
8        ]
9        results = [future.result() for future in futures]
10    return results

Conclusion

Index strategy optimization is a crucial element that determines RAG system success. Rather than relying on a single method, significant accuracy improvements can be achieved by appropriately using Sparse, Dense, and Hybrid approaches according to use cases, and combining advanced techniques like hierarchical and temporal approaches. At INDX, we design and implement optimal index strategies tailored to each client's data characteristics and requirements.

Tags

FAISS
Milvus
Pinecone
Weaviate
BM25
SPLADE