Flexible Structured Data Search with SQL×RAG Integration

The inability to cross-search structured data stored in databases and unstructured document-based data is a challenge faced by many enterprises. This article provides a detailed explanation of how to solve this challenge by fusing natural language DB query generation with RAG (Retrieval-Augmented Generation), along with implementation examples using LlamaIndex and LangChain.

Traditional Challenge: Data Silo Problem

Many enterprises face the following data silo problems:

•Structured Data: Sales data, customer information, inventory data stored in databases
•Unstructured Data: Documents, reports, emails, chat logs

Integrating and searching these data types has been technically challenging, requiring users to navigate multiple systems.

SQL×RAG Fusion Approach

Architecture Overview

The SQL×RAG approach processes queries through the following flow:

1. Natural Language Query Analysis: Analyze user questions to determine whether structured or unstructured data is needed

2. SQL Query Generation: When structured data is needed, LLM automatically generates SQL queries

3. RAG Search: When unstructured data is needed, retrieve documents through vector search

4. Result Integration: Combine results from both data sources to generate final answers

Implementation with LlamaIndex

python

1from llama_index import SQLDatabase, VectorStoreIndex, ServiceContext
2from llama_index.query_engine import SQLJoinQueryEngine
3from llama_index.objects import SQLTableNodeMapping, ObjectIndex
4
5# Database connection setup
6sql_database = SQLDatabase.from_uri("postgresql://user:pass@localhost/db")
7
8# Document index creation
9documents = SimpleDirectoryReader("./docs").load_data()
10vector_index = VectorStoreIndex.from_documents(documents)
11
12# SQL table metadata setup
13table_node_mapping = SQLTableNodeMapping(sql_database)
14table_schema_objs = [
15    (table_node_mapping.from_objects([sql_database.get_single_table_info("sales")]))
16]
17
18# Integrated query engine creation
19obj_index = ObjectIndex.from_objects(table_schema_objs)
20query_engine = SQLJoinQueryEngine(
21    sql_database=sql_database,
22    obj_index=obj_index,
23    vector_query_engine=vector_index.as_query_engine()
24)
25
26# Query execution example
27response = query_engine.query(
28    "Analyze the top 10 products by sales last month, including product reviews"
29)

SQL Agent Implementation with LangChain

python

1from langchain.agents import create_sql_agent
2from langchain.agents.agent_toolkits import SQLDatabaseToolkit
3from langchain.sql_database import SQLDatabase
4from langchain.llms import OpenAI
5from langchain.vectorstores import Chroma
6from langchain.embeddings import OpenAIEmbeddings
7
8class HybridSQLRAGAgent:
9    def __init__(self, db_uri: str, docs_path: str):
10        # SQL environment setup
11        self.db = SQLDatabase.from_uri(db_uri)
12        self.sql_toolkit = SQLDatabaseToolkit(db=self.db, llm=OpenAI())
13        
14        # RAG environment setup
15        embeddings = OpenAIEmbeddings()
16        self.vectorstore = Chroma.from_documents(
17            documents=self.load_documents(docs_path),
18            embedding=embeddings
19        )
20        
21        # Agent creation
22        self.agent = create_sql_agent(
23            llm=OpenAI(),
24            toolkit=self.sql_toolkit,
25            verbose=True,
26            agent_type="openai-functions"
27        )
28    
29    def query(self, question: str):
30        # First determine if SQL or RAG related
31        if self._needs_database(question):
32            sql_result = self.agent.run(question)
33            
34        if self._needs_documents(question):
35            docs = self.vectorstore.similarity_search(question, k=5)
36            doc_context = "\n".join([doc.page_content for doc in docs])
37            
38        # Integrate results and generate final answer
39        return self._synthesize_response(sql_result, doc_context, question)

Implementation Best Practices

1. Schema Design and Metadata Management

It's important to provide database schema information in a format that LLM can easily understand:

python

1# Adding table descriptions
2table_info = {
3    "sales": {
4        "description": "Sales data table",
5        "columns": {
6            "product_id": "Product ID (joins with products.id)",
7            "sale_date": "Sale datetime",
8            "amount": "Sale amount",
9            "customer_id": "Customer ID (joins with customers.id)"
10        }
11    }
12}

2. Security Measures

To prevent SQL injection attacks, implement the following measures:

python

1from sqlalchemy import text
2
3class SecureSQLGenerator:
4    def __init__(self, allowed_tables: list):
5        self.allowed_tables = allowed_tables
6        
7    def validate_query(self, sql_query: str) -> bool:
8        # Check for dangerous keywords
9        dangerous_keywords = ["DROP", "DELETE", "UPDATE", "INSERT", "ALTER"]
10        if any(keyword in sql_query.upper() for keyword in dangerous_keywords):
11            return False
12            
13        # Only allow access to permitted tables
14        referenced_tables = self.extract_table_names(sql_query)
15        return all(table in self.allowed_tables for table in referenced_tables)

3. Performance Optimization

Query optimization is crucial for large-scale databases:

python

1class OptimizedSQLRAG:
2    def __init__(self):
3        self.query_cache = {}
4        self.embedding_cache = {}
5        
6    def cached_query(self, sql_query: str):
7        if sql_query in self.query_cache:
8            return self.query_cache[sql_query]
9            
10        result = self.execute_sql(sql_query)
11        self.query_cache[sql_query] = result
12        return result

Practical Use Cases

1. Sales Analysis Reports

"Create an analysis report for the top 5 products by sales this quarter, along with customer feedback"

This question involves:

•SQL: Extract top 5 products by sales this quarter
•RAG: Search customer feedback documents related to products
•Integration: Generate analysis report combining quantitative and qualitative data

2. Customer Support

"We're seeing increased inquiries about returns for Product A. Can you explain the specific reasons and solutions?"

This question involves:

•SQL: Product A return data and trend analysis
•RAG: Search related information from product specifications, FAQs, and past support history

Conclusion

The SQL×RAG integration approach enables cross-search and analysis of structured and unstructured data. While it can be implemented relatively easily using LlamaIndex and LangChain, careful attention to security and performance is essential.

This approach becomes a powerful tool for supporting data-driven decision making and is expected to provide significant value in enterprise DX initiatives.

Flexible Structured Data Search with SQL×RAG Integration

Table of Contents

Flexible Structured Data Search with SQL×RAG Integration

Traditional Challenge: Data Silo Problem

SQL×RAG Fusion Approach

Architecture Overview

Implementation with LlamaIndex

SQL Agent Implementation with LangChain

Implementation Best Practices

1. Schema Design and Metadata Management

2. Security Measures

3. Performance Optimization

Practical Use Cases

1. Sales Analysis Reports

2. Customer Support

Conclusion

Tags