Flexible Structured Data Search with SQL×RAG Integration
Solve the inability to cross-search databases and documents through natural language DB queries + RAG fusion. Practical explanation of LlamaIndex/LangChain SQL agent implementation.
Table of Contents
Flexible Structured Data Search with SQL×RAG Integration
The inability to cross-search structured data stored in databases and unstructured document-based data is a challenge faced by many enterprises. This article provides a detailed explanation of how to solve this challenge by fusing natural language DB query generation with RAG (Retrieval-Augmented Generation), along with implementation examples using LlamaIndex and LangChain.
Traditional Challenge: Data Silo Problem
Many enterprises face the following data silo problems:
- •Structured Data: Sales data, customer information, inventory data stored in databases
- •Unstructured Data: Documents, reports, emails, chat logs
Integrating and searching these data types has been technically challenging, requiring users to navigate multiple systems.
SQL×RAG Fusion Approach
Architecture Overview
The SQL×RAG approach processes queries through the following flow:
1. Natural Language Query Analysis: Analyze user questions to determine whether structured or unstructured data is needed
2. SQL Query Generation: When structured data is needed, LLM automatically generates SQL queries
3. RAG Search: When unstructured data is needed, retrieve documents through vector search
4. Result Integration: Combine results from both data sources to generate final answers
Implementation with LlamaIndex
1from llama_index import SQLDatabase, VectorStoreIndex, ServiceContext
2from llama_index.query_engine import SQLJoinQueryEngine
3from llama_index.objects import SQLTableNodeMapping, ObjectIndex
4
5# Database connection setup
6sql_database = SQLDatabase.from_uri("postgresql://user:pass@localhost/db")
7
8# Document index creation
9documents = SimpleDirectoryReader("./docs").load_data()
10vector_index = VectorStoreIndex.from_documents(documents)
11
12# SQL table metadata setup
13table_node_mapping = SQLTableNodeMapping(sql_database)
14table_schema_objs = [
15 (table_node_mapping.from_objects([sql_database.get_single_table_info("sales")]))
16]
17
18# Integrated query engine creation
19obj_index = ObjectIndex.from_objects(table_schema_objs)
20query_engine = SQLJoinQueryEngine(
21 sql_database=sql_database,
22 obj_index=obj_index,
23 vector_query_engine=vector_index.as_query_engine()
24)
25
26# Query execution example
27response = query_engine.query(
28 "Analyze the top 10 products by sales last month, including product reviews"
29)
SQL Agent Implementation with LangChain
1from langchain.agents import create_sql_agent
2from langchain.agents.agent_toolkits import SQLDatabaseToolkit
3from langchain.sql_database import SQLDatabase
4from langchain.llms import OpenAI
5from langchain.vectorstores import Chroma
6from langchain.embeddings import OpenAIEmbeddings
7
8class HybridSQLRAGAgent:
9 def __init__(self, db_uri: str, docs_path: str):
10 # SQL environment setup
11 self.db = SQLDatabase.from_uri(db_uri)
12 self.sql_toolkit = SQLDatabaseToolkit(db=self.db, llm=OpenAI())
13
14 # RAG environment setup
15 embeddings = OpenAIEmbeddings()
16 self.vectorstore = Chroma.from_documents(
17 documents=self.load_documents(docs_path),
18 embedding=embeddings
19 )
20
21 # Agent creation
22 self.agent = create_sql_agent(
23 llm=OpenAI(),
24 toolkit=self.sql_toolkit,
25 verbose=True,
26 agent_type="openai-functions"
27 )
28
29 def query(self, question: str):
30 # First determine if SQL or RAG related
31 if self._needs_database(question):
32 sql_result = self.agent.run(question)
33
34 if self._needs_documents(question):
35 docs = self.vectorstore.similarity_search(question, k=5)
36 doc_context = "\n".join([doc.page_content for doc in docs])
37
38 # Integrate results and generate final answer
39 return self._synthesize_response(sql_result, doc_context, question)
Implementation Best Practices
1. Schema Design and Metadata Management
It's important to provide database schema information in a format that LLM can easily understand:
1# Adding table descriptions
2table_info = {
3 "sales": {
4 "description": "Sales data table",
5 "columns": {
6 "product_id": "Product ID (joins with products.id)",
7 "sale_date": "Sale datetime",
8 "amount": "Sale amount",
9 "customer_id": "Customer ID (joins with customers.id)"
10 }
11 }
12}
2. Security Measures
To prevent SQL injection attacks, implement the following measures:
1from sqlalchemy import text
2
3class SecureSQLGenerator:
4 def __init__(self, allowed_tables: list):
5 self.allowed_tables = allowed_tables
6
7 def validate_query(self, sql_query: str) -> bool:
8 # Check for dangerous keywords
9 dangerous_keywords = ["DROP", "DELETE", "UPDATE", "INSERT", "ALTER"]
10 if any(keyword in sql_query.upper() for keyword in dangerous_keywords):
11 return False
12
13 # Only allow access to permitted tables
14 referenced_tables = self.extract_table_names(sql_query)
15 return all(table in self.allowed_tables for table in referenced_tables)
3. Performance Optimization
Query optimization is crucial for large-scale databases:
1class OptimizedSQLRAG:
2 def __init__(self):
3 self.query_cache = {}
4 self.embedding_cache = {}
5
6 def cached_query(self, sql_query: str):
7 if sql_query in self.query_cache:
8 return self.query_cache[sql_query]
9
10 result = self.execute_sql(sql_query)
11 self.query_cache[sql_query] = result
12 return result
Practical Use Cases
1. Sales Analysis Reports
"Create an analysis report for the top 5 products by sales this quarter, along with customer feedback"
This question involves:
- •SQL: Extract top 5 products by sales this quarter
- •RAG: Search customer feedback documents related to products
- •Integration: Generate analysis report combining quantitative and qualitative data
2. Customer Support
"We're seeing increased inquiries about returns for Product A. Can you explain the specific reasons and solutions?"
This question involves:
- •SQL: Product A return data and trend analysis
- •RAG: Search related information from product specifications, FAQs, and past support history
Conclusion
The SQL×RAG integration approach enables cross-search and analysis of structured and unstructured data. While it can be implemented relatively easily using LlamaIndex and LangChain, careful attention to security and performance is essential.
This approach becomes a powerful tool for supporting data-driven decision making and is expected to provide significant value in enterprise DX initiatives.