INDX
Evolve to "Searchable AI" with Metadata - Boost Search Accuracy with Attribute Information
Blog
Data Utilization

Evolve to "Searchable AI" with Metadata - Boost Search Accuracy with Attribute Information

Solve the challenge of finding necessary information in large datasets through title, author, tag attribution and utilization. Learn Weaviate/LlamaIndex metadata filtering implementation.

K
Katsuya Ito
CEO
7 min

Evolve to "Searchable AI" with Metadata - Boost Search Accuracy with Attribute Information

Introduction

Finding necessary information efficiently from large amounts of data is a crucial challenge for modern AI systems. Simple vector search alone can understand document content but cannot utilize important attribute information such as when the document was created, who wrote it, or which category it belongs to.

This article explains methods to significantly improve AI search accuracy by utilizing metadata (attribute information), with implementation examples.

What is Metadata?

Metadata is "data about data." In the context of document search, this includes information such as:

  • Basic Information: Title, author, creation date, update date
  • Classification Information: Category, tags, department, project
  • Technical Information: File format, language, word count
  • Business Information: Importance level, access scope, approval status

Limitations of Traditional Search

Vector Search Only

python
1# Simple vector search example
2query = "About project management"
3results = vector_db.similarity_search(query, k=10)

This method:

  • Finds content-related documents
  • But cannot distinguish between old and new information
  • Cannot filter to specific authors or departments
  • Doesn't consider business importance

The Power of Metadata Filtering

Implementation Example with Weaviate

python
1import weaviate
2
3client = weaviate.Client("http://localhost:8080")
4
5# Add document with metadata
6doc_data = {
7    "content": "About the basic principles of project management...",
8    "title": "Effective Project Management Techniques",
9    "author": "John Smith",
10    "category": "Process Improvement",
11    "publishedAt": "2024-01-15",
12    "tags": ["project management", "teamwork", "efficiency"],
13    "importance": "high"
14}
15
16client.data_object.create(doc_data, "Document")
17
18# Search utilizing metadata
19result = (
20    client.query
21    .get("Document", ["content", "title", "author"])
22    .with_near_text({"concepts": ["project management"]})
23    .with_where({
24        "operator": "And",
25        "operands": [
26            {
27                "path": ["publishedAt"],
28                "operator": "GreaterThan",
29                "valueDate": "2023-01-01"
30            },
31            {
32                "path": ["importance"],
33                "operator": "Equal",
34                "valueText": "high"
35            }
36        ]
37    })
38    .with_limit(5)
39    .do()
40)

Implementation Example with LlamaIndex

python
1from llama_index import VectorStoreIndex, Document
2from llama_index.vector_stores import WeaviateVectorStore
3from llama_index.retrievers import VectorIndexRetriever
4from llama_index.query_engine import RetrieverQueryEngine
5
6# Create documents with metadata
7documents = [
8    Document(
9        text="In project management...",
10        metadata={
11            "title": "Project Management Guide",
12            "author": "Jane Doe", 
13            "category": "Management",
14            "publishedAt": "2024-02-20",
15            "tags": ["project", "leadership"]
16        }
17    )
18]
19
20# Create index
21vector_store = WeaviateVectorStore(
22    weaviate_client=client,
23    index_name="documents"
24)
25index = VectorStoreIndex.from_documents(
26    documents, 
27    vector_store=vector_store
28)
29
30# Query engine with metadata filters
31retriever = VectorIndexRetriever(
32    index=index,
33    similarity_top_k=10,
34    filters={
35        "author": "Jane Doe",
36        "publishedAt": {"$gte": "2024-01-01"}
37    }
38)
39
40query_engine = RetrieverQueryEngine(retriever=retriever)
41response = query_engine.query("What are efficient project management methods?")

Practical Metadata Design

Hierarchical Category Design

typescript
1interface DocumentMetadata {
2  // Basic information
3  title: string;
4  author: {
5    name: string;
6    department: string;
7    role: string;
8  };
9  
10  // Time information
11  createdAt: Date;
12  updatedAt: Date;
13  validUntil?: Date;
14  
15  // Classification information
16  category: {
17    primary: string;      // "Technical", "Sales", "HR"
18    secondary?: string;   // "Development", "Infrastructure", "Design"
19    tertiary?: string;    // "Frontend", "Backend"
20  };
21  
22  // Business information
23  confidentiality: "public" | "internal" | "confidential";
24  importance: "low" | "medium" | "high" | "critical";
25  status: "draft" | "review" | "approved" | "archived";
26  
27  // Related information
28  tags: string[];
29  relatedProjects?: string[];
30  targetAudience?: string[];
31}

Dynamic Metadata Extraction

python
1import re
2from datetime import datetime
3from typing import Dict, List, Optional
4
5class MetadataExtractor:
6    def __init__(self):
7        self.category_keywords = {
8            "Technical": ["API", "database", "programming", "system"],
9            "Sales": ["customer", "revenue", "proposal", "contract"],
10            "HR": ["recruitment", "evaluation", "training", "labor"]
11        }
12    
13    def extract_metadata(self, content: str, filename: str) -> Dict:
14        metadata = {}
15        
16        # Extract information from filename
17        metadata["title"] = self._extract_title(filename, content)
18        
19        # Extract date information
20        metadata["dates"] = self._extract_dates(content)
21        
22        # Classify content
23        metadata["category"] = self._classify_content(content)
24        
25        # Extract important keywords
26        metadata["tags"] = self._extract_keywords(content)
27        
28        return metadata
29    
30    def _classify_content(self, content: str) -> str:
31        scores = {}
32        for category, keywords in self.category_keywords.items():
33            score = sum(1 for keyword in keywords if keyword in content)
34            scores[category] = score
35        
36        return max(scores.items(), key=lambda x: x[1])[0] if scores else "Other"
37    
38    def _extract_keywords(self, content: str) -> List[str]:
39        # Simple example: extract words starting with capital letters
40        keywords = re.findall(r'[A-Z][a-zA-Z]+', content)
41        return list(set(keywords[:10]))  # Top 10 unique keywords

Advanced Search Patterns

Composite Filtering

python
1# Search combining multiple conditions
2def advanced_search(query: str, filters: Dict) -> List[Dict]:
3    search_params = {
4        "query": {
5            "bool": {
6                "must": [
7                    {"match": {"content": query}}
8                ],
9                "filter": []
10            }
11        }
12    }
13    
14    # Date range filter
15    if "date_range" in filters:
16        search_params["query"]["bool"]["filter"].append({
17            "range": {
18                "publishedAt": {
19                    "gte": filters["date_range"]["start"],
20                    "lte": filters["date_range"]["end"]
21                }
22            }
23        })
24    
25    # Author filter
26    if "authors" in filters:
27        search_params["query"]["bool"]["filter"].append({
28            "terms": {"author": filters["authors"]}
29        })
30    
31    # Category filter
32    if "categories" in filters:
33        search_params["query"]["bool"]["filter"].append({
34            "terms": {"category": filters["categories"]}
35        })
36    
37    return elasticsearch_client.search(
38        index="documents",
39        body=search_params
40    )
41
42# Usage example
43results = advanced_search(
44    query="machine learning implementation cases",
45    filters={
46        "date_range": {"start": "2023-01-01", "end": "2024-12-31"},
47        "authors": ["John Smith", "Jane Doe"],
48        "categories": ["Technical", "AI"]
49    }
50)

Weighted Search

python
1def weighted_search(query: str, weights: Dict[str, float]) -> List[Dict]:
2    search_query = {
3        "query": {
4            "function_score": {
5                "query": {"match": {"content": query}},
6                "functions": [
7                    {
8                        "filter": {"term": {"importance": "critical"}},
9                        "weight": weights.get("critical", 3.0)
10                    },
11                    {
12                        "filter": {"term": {"importance": "high"}},
13                        "weight": weights.get("high", 2.0)
14                    },
15                    {
16                        "filter": {"range": {"publishedAt": {"gte": "2024-01-01"}}},
17                        "weight": weights.get("recent", 1.5)
18                    }
19                ],
20                "score_mode": "sum"
21            }
22        }
23    }
24    
25    return elasticsearch_client.search(
26        index="documents",
27        body=search_query
28    )

Performance Optimization

Index Design

python
1# Efficient mapping design in Elasticsearch
2mapping = {
3    "mappings": {
4        "properties": {
5            "content": {
6                "type": "text",
7                "analyzer": "english"
8            },
9            "title": {
10                "type": "text",
11                "fields": {
12                    "keyword": {
13                        "type": "keyword"
14                    }
15                }
16            },
17            "author": {
18                "type": "keyword"  # For filtering
19            },
20            "category": {
21                "type": "keyword"
22            },
23            "publishedAt": {
24                "type": "date"
25            },
26            "tags": {
27                "type": "keyword"
28            },
29            "importance": {
30                "type": "keyword"
31            }
32        }
33    }
34}

Caching Strategy

python
1from functools import lru_cache
2import redis
3
4redis_client = redis.Redis(host='localhost', port=6379)
5
6@lru_cache(maxsize=1000)
7def cached_metadata_search(query_hash: str, filters_hash: str):
8    cache_key = f"search:{query_hash}:{filters_hash}"
9    cached_result = redis_client.get(cache_key)
10    
11    if cached_result:
12        return json.loads(cached_result)
13    
14    # Execute actual search
15    result = perform_search(query_hash, filters_hash)
16    
17    # Cache result (1 hour)
18    redis_client.setex(
19        cache_key, 
20        3600, 
21        json.dumps(result)
22    )
23    
24    return result

Real-world Implementation Results

Improved Search Accuracy

Results after implementing metadata filtering:

  • Improved Relevance: Over 90% of searches return expected results in top 3
  • Reduced Search Time: Average search time reduced by 60%
  • User Satisfaction: 85% of users reported "easier to find desired information"

Specific Improvement Examples

text
1【Before】
2Query: "project progress"
3Results: Old 2019 materials, irrelevant documents from other departments ranked high
4
5【After】  
6Query: "project progress"
7Filters: 
8- Date: 2024 and later
9- Department: Development
10- Importance: High
11Results: Latest development department project management guide ranked high

Conclusion

Utilizing metadata significantly improves AI system search capabilities. Key points include:

1. Proper Metadata Design: Define attribute information aligned with business requirements

2. Efficient Implementation: Implementation using Weaviate or LlamaIndex

3. Continuous Improvement: Review metadata based on user feedback

By combining these methods, evolution from simple "similar document search" to "truly necessary information search" becomes possible.

Tags

メタデータ
Weaviate
LlamaIndex
フィルタリング