検索精度が変わる！インデックス戦略の最前線

Dense/Sparse/ハイブリッド徹底比較

RAGシステムにおいて、インデックス戦略は検索精度を大きく左右する重要な要素です。従来のキーワード検索（Sparse）と意味検索（Dense）、そしてそれらを組み合わせたハイブリッド検索について、実践的な観点から詳しく解説します。

なぜ検索精度の問題が起きるのか？

多くのRAGシステムで以下の問題が発生しています：

1. 完全一致の限界: キーワード検索では表現の揺れに対応できない

2. 意味検索の曖昧性: ベクトル検索では具体的な固有名詞を見逃しがち

3. コンテキストの欠如: 単一の検索手法では文脈を考慮できない

インデックス戦略の種類と特徴

1. Sparse検索（BM25）

特徴: キーワードマッチング、統計的重み付け

得意分野: 固有名詞、専門用語、正確な文字列マッチ

苦手分野: 同義語、表現の揺れ、意味的類似性

python

1from rank_bm25 import BM25Okapi
2import jieba
3
4# 日本語対応のBM25実装
5def create_bm25_index(documents):
6    # 日本語の分かち書き
7    tokenized_docs = [list(jieba.cut(doc)) for doc in documents]
8    bm25 = BM25Okapi(tokenized_docs)
9    return bm25
10
11# 検索実行
12query = "機械学習 精度向上"
13tokenized_query = list(jieba.cut(query))
14scores = bm25.get_scores(tokenized_query)

2. Dense検索（ベクトル検索）

特徴: 意味的類似性、文脈理解

得意分野: 同義語検索、意味的類似性、多言語対応

苦手分野: 固有名詞、数値、正確な文字列マッチ

python

1import faiss
2from sentence_transformers import SentenceTransformer
3
4# Dense検索の実装
5model = SentenceTransformer('BAAI/bge-large-ja')
6embeddings = model.encode(documents)
7
8# FAISSインデックス構築
9dimension = embeddings.shape[1]
10index = faiss.IndexFlatIP(dimension)  # Inner Product
11index.add(embeddings.astype('float32'))
12
13# 検索実行
14query_embedding = model.encode([query])
15k = 5
16scores, indices = index.search(query_embedding.astype('float32'), k)

3. ハイブリッド検索

特徴: SparseとDenseの組み合わせ

メリット: 両手法の長所を活用

実装: 重み付きスコア融合

python

1from langchain.retrievers import EnsembleRetriever
2from langchain.retrievers import BM25Retriever
3from langchain.vectorstores import FAISS
4
5# ハイブリッドリトリーバーの構築
6bm25_retriever = BM25Retriever.from_documents(documents)
7vector_store = FAISS.from_documents(documents, embeddings)
8vector_retriever = vector_store.as_retriever()
9
10# アンサンブルリトリーバー
11hybrid_retriever = EnsembleRetriever(
12    retrievers=[bm25_retriever, vector_retriever],
13    weights=[0.4, 0.6]  # BM25: 40%, Vector: 60%
14)

ベクトルデータベース比較

FAISS（Facebook AI Similarity Search）

特徴:

•メリット: 高速、多様なインデックス種類、無料
•デメリット: 分散処理困難、メタデータ機能限定
•適用場面: 単一マシン、プロトタイプ

python

1import faiss
2
3# IVFPQ（メモリ効率重視）
4quantizer = faiss.IndexFlatL2(dimension)
5index = faiss.IndexIVFPQ(quantizer, dimension, 100, 8, 8)
6
7# HNSW（速度重視）
8index = faiss.IndexHNSWFlat(dimension, 32)

Milvus

特徴:

•メリット: 分散処理、スケーラビリティ、豊富なインデックス
•デメリット: セットアップ複雑、リソース消費大
•適用場面: 大規模データ、本番環境

python

1from pymilvus import connections, Collection
2
3# Milvus接続
4connections.connect("default", host="localhost", port="19530")
5
6# コレクション作成
7collection = Collection("rag_collection")
8
9# 検索実行
10search_params = {"metric_type": "IP", "params": {"nprobe": 10}}
11results = collection.search(
12    query_embedding, 
13    "embeddings", 
14    search_params, 
15    limit=5
16)

Pinecone

特徴:

•メリット: マネージドサービス、簡単セットアップ
•デメリット: クラウド依存、コスト
•適用場面: 迅速な開発、運用負荷軽減

python

1import pinecone
2
3# Pinecone初期化
4pinecone.init(api_key="your-api-key", environment="us-west1-gcp")
5
6# インデックス操作
7index = pinecone.Index("rag-index")
8
9# ベクトル挿入
10index.upsert(vectors=zip(ids, embeddings, metadata))
11
12# 検索実行
13results = index.query(
14    vector=query_embedding.tolist(),
15    top_k=5,
16    include_metadata=True
17)

Weaviate

特徴:

•メリット: GraphQL API、スキーマ柔軟性、AI統合
•デメリット: 学習コスト、パフォーマンス調整複雑
•適用場面: 複雑なスキーマ、マルチモーダル

python

1import weaviate
2
3# Weaviate接続
4client = weaviate.Client("http://localhost:8080")
5
6# 検索実行
7result = client.query.get("Document", ["content", "title"])     .with_near_vector({"vector": query_embedding})     .with_limit(5)     .do()

高度なインデックス最適化

階層的インデックス

文書の構造を活用した階層的アプローチ：

python

1class HierarchicalIndex:
2    def __init__(self):
3        self.document_index = {}  # 文書レベル
4        self.section_index = {}   # セクションレベル
5        self.paragraph_index = {} # パラグラフレベル
6    
7    def build_hierarchy(self, documents):
8        for doc in documents:
9            # 文書レベルインデックス
10            doc_embedding = self.embed_document(doc)
11            self.document_index[doc.id] = doc_embedding
12            
13            # セクションレベル
14            for section in doc.sections:
15                section_embedding = self.embed_section(section)
16                self.section_index[section.id] = section_embedding

時系列対応インデックス

時間的な重み付けを考慮：

python

1import datetime
2
3def time_weighted_search(query_embedding, results, decay_factor=0.1):
4    current_time = datetime.datetime.now()
5    weighted_results = []
6    
7    for result in results:
8        # 時間による減衰を適用
9        time_diff = (current_time - result.timestamp).days
10        time_weight = math.exp(-decay_factor * time_diff)
11        
12        final_score = result.score * time_weight
13        weighted_results.append((result, final_score))
14    
15    return sorted(weighted_results, key=lambda x: x[1], reverse=True)

検索戦略の最適化

Reciprocal Rank Fusion (RRF)

複数の検索結果を効果的に融合：

python

1def reciprocal_rank_fusion(rankings, k=60):
2    fused_scores = {}
3    
4    for ranking in rankings:
5        for rank, doc_id in enumerate(ranking, 1):
6            if doc_id not in fused_scores:
7                fused_scores[doc_id] = 0
8            fused_scores[doc_id] += 1 / (k + rank)
9    
10    return sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)

Query Expansion

クエリの拡張による検索精度向上：

python

1from transformers import pipeline
2
3# クエリ拡張の実装
4def expand_query(query, model_name="rinna/japanese-gpt-neox-3.6b"):
5    generator = pipeline("text-generation", model=model_name)
6    
7    prompt = f"質問「{query}」に関連するキーワード:"
8    expanded = generator(prompt, max_length=50, num_return_sequences=1)
9    
10    return expanded[0]['generated_text']

INDXでの実践事例

法律事務所C社での取り組み

課題: 判例検索の精度不足（適合率45%）

解決策:

•BM25（条文・判例番号用）+ Dense（内容検索用）
•階層的インデックス（法律→条項→項目）
•時系列重み付け（新しい判例の優先）

実装コード:

python

1# 法律特化インデックス
2class LegalHybridRetriever:
3    def __init__(self):
4        self.statute_bm25 = BM25Retriever()  # 条文用
5        self.case_dense = DenseRetriever()   # 判例用
6        self.temporal_weights = TemporalWeighting()
7    
8    def search(self, query):
9        statute_results = self.statute_bm25.search(query)
10        case_results = self.case_dense.search(query)
11        
12        # 時系列重み付け適用
13        weighted_results = self.temporal_weights.apply(
14            statute_results + case_results
15        )
16        
17        return self.fuse_results(weighted_results)

結果: 適合率 45% → 78%、ユーザー満足度 90%達成

製造業D社での取り組み

課題: 技術仕様書の多言語検索

解決策:

•多言語Dense検索（日英中対応）
•技術用語辞書との組み合わせ
•ドメイン特化ファインチューニング

結果: 検索時間 30秒 → 3秒、精度 60% → 85%

パフォーマンス最適化

インデックス圧縮

python

1# Product Quantization（PQ）
2index = faiss.IndexPQ(dimension, 8, 8)  # 8バイトに圧縮
3
4# Scalar Quantization（SQ）
5index = faiss.IndexScalarQuantizer(dimension, faiss.ScalarQuantizer.QT_8bit)

並列処理最適化

python

1import concurrent.futures
2
3def parallel_search(queries, retrievers):
4    with concurrent.futures.ThreadPoolExecutor(max_workers=4) as executor:
5        futures = [
6            executor.submit(retriever.search, query) 
7            for query, retriever in zip(queries, retrievers)
8        ]
9        results = [future.result() for future in futures]
10    return results

まとめ

インデックス戦略の最適化は、RAGシステムの成功を左右する重要な要素です。単一の手法に頼るのではなく、用途に応じてSparse、Dense、ハイブリッドを使い分け、階層化や時系列対応などの高度な技術を組み合わせることで、大幅な精度向上が実現できます。INDXでは、クライアントのデータ特性と要件に合わせた最適なインデックス戦略を設計・実装しています。

検索精度が変わる！インデックス戦略の最前線〜Dense/Sparse/ハイブリッド徹底比較〜

Table of Contents

検索精度が変わる！インデックス戦略の最前線

Dense/Sparse/ハイブリッド徹底比較

なぜ検索精度の問題が起きるのか？

インデックス戦略の種類と特徴

1. Sparse検索（BM25）

2. Dense検索（ベクトル検索）

3. ハイブリッド検索

ベクトルデータベース比較

FAISS（Facebook AI Similarity Search）

Milvus

Pinecone

Weaviate

高度なインデックス最適化

階層的インデックス

時系列対応インデックス

検索戦略の最適化

Reciprocal Rank Fusion (RRF)

Query Expansion

INDXでの実践事例

法律事務所C社での取り組み

製造業D社での取り組み

パフォーマンス最適化

インデックス圧縮

並列処理最適化

まとめ

タグ