1. Introduction
In the world of data, insights are only as accessible as the language used to retrieve them. For domain experts, analysts or everyday users, asking a database a simple question often hits a fundamental barrier: the requirement to speak SQL (Structured Query Language). This technical translation layer slows down workflows and limits who can interact directly with critical information.
What if you can eliminate that barrier? Effortlessly asking your database a question in plain English and instantly receiving the corresponding SQL query.
User query: "Show me all customers from United Kingdom"
SQL command: SELECT * FROM customers WHERE state = 'United Kingdom';
This is the promise of Natural Language to SQL translation, a dynamic and challenging field at the intersection of Natural Language Processing (NLP) and database management.
In this post, I'm diving into a hands-on project that does just that: transforming casual user requests into executable SQL commands, using a Jupyter Notebook that leverages SpaCy, a powerful and widely adopted NLP library, to analyze and structure user input. Below are the key features I have addressed on the aforementioned notebook:
- Accept user inputs in natural language.
- Extract important entities such as tables, fields, conditions, etc.
- Generate SQL commands based on parsed input.
2. Creating Database
For this experiment, I used SQLite as the database because it's lightweight and perfect for prototyping. Additionally, I used a public Online Retail dataset. Below are the steps I followed to create and populate my database:2.1 Downloading dataset
Downloaded the dataset from a public repository and loaded it into a pandas dataframe.
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
file_path = "Online_Retail.xlsx"
urllib.request.urlretrieve(data_url, file_path)
df = pd.read_excel(file_path)
df.head(5) # Viewing the first 5 rows in table format
2.2 Database connection
Established connection to the database server and created the table structure.
conn = sqlite3.connect('retail_data.db')
cursor = conn.cursor()
cursor.execute('''
CREATE TABLE IF NOT EXISTS retail (
InvoiceNo TEXT,
StockCode TEXT,
Description TEXT,
Quantity INTEGER,
InvoiceDate TEXT,
UnitPrice REAL,
CustomerID TEXT,
Country TEXT
)
''')
Inserted the data into the table and verified the insertion by executing a simple SQL command: "SELECT * FROM retail LIMIT 5"
df.to_sql('retail', conn, if_exists='replace', index=False)
cursor.execute("SELECT * FROM retail LIMIT 5")
rows = cursor.fetchall()
for row in rows:
print(row)
conn.close()
3. Text extraction using NLP
To demonstrate text extraction using NLP, consider this query: "Show me all customers from United Kingdom". Using SpaCy's English model , the sentence is processed to extract key components. The analysis identifies "Show" as the main action, "customers" as the target data and "from" as establishing a location relationship. These extracted components: action, target and filtering condition, provide all the necessary information to construct an SQL query.
Below is an example of a code I executed for this scenario and the corresponding results:
# Loading a model from spacy
nlp = spacy.load('en_core_web_sm')
# Defining and processing the user_query
user_query = "Show me all customers from United Kingdom"
doc = nlp(user_query)
# Extracting the entities and keywords
for token in doc:
print(f"Token: {token.text}, POS: {token.pos_}, Dependency: {token.dep_}")
>
Token: Show, POS: VERB, Dependency: ROOT
Token: me, POS: PRON, Dependency: dative
Token: all, POS: DET, Dependency: det
Token: customers, POS: NOUN, Dependency: dobj
Token: from, POS: ADP, Dependency: prep
Token: United, POS: PROPN, Dependency: compound
Token: Kingdom, POS: PROPN, Dependency: pobj
4. Pipeline
Below is a function that constructs and returns an SQL command based on a user query with optional conditions. The function analyzes natural language input and automatically generates the appropriate SELECT statement with WHERE clauses.The function operates in two main stages.
4.1 Identifies key informtion
It uses named entity recognition to identify key information: geographical locations become country filters (GPE), numbers set quantity conditions (CARDINAL), person names match customer IDs (PERSON) and dates filter temporal data (DATE). Special handling distinguishes between invoice numbers and actual dates.
4.2 Analyse relationships
It analyzes token relationships to determine which columns to select and what conditions to apply. It identifies adjective and noun pairs for column selection, maps common terms to database column names through a field mapping dictionary and creates conditions based on grammatical dependencies like prepositional objects.
Function code
The function then assembles the complete SQL query, defaulting to SELECT * if no specific columns are identified. All conditions are combined with AND operators in the WHERE clause. For demonstration purposes, the LIMIT is preconfigured to 5.
def generate_sql_query(user_query, table):
nlp = spacy.load('en_core_web_sm')
doc = nlp(user_query)
table = table
columns = []
conditions = []
field_mapping = {
"customer": "CustomerID",
"date": "Date",
"description": "Description",
"item": "Quantity",
"invoice": "InvoiceNo"
}
# Extracting entities and keywords as conditions
for ent in doc.ents:
if ent.label_ == "GPE": # Geographical location
conditions.append(f"Country = '{ent.text}'")
elif ent.label_ == "CARDINAL": # Extracts quantity
conditions.append(f"Quantity >= {ent.text}")
elif ent.label_ == "PERSON": # Extracts customerID
conditions.append(f"CustomerID = '{ent.text}'")
elif ent.label_ == "DATE": # Extracts dates, but since spaCy recognises InvoiceNo as Date, so it is handled separately
if "InvoiceNo" in user_query or ent.text.isdigit():
conditions.append(f"InvoiceNo = '{ent.text}'")
else:
conditions.append(f"Date = '{ent.text}'")
# Extracting tokens and checking whether the token is an adjective describing/modifying a noun
for token in doc:
if token.dep_ == "amod" and token.head.pos_ == "NOUN":
columns.append(token.head.text)
elif token.text.lower() in field_mapping:
columns.append(field_mapping[token.text.lower()])
elif token.dep_ == "nummod" and token.head.text.lower() in field_mapping:
conditions.append(f"{field_mapping[token.head.text.lower()]} = {token.text}")
elif token.dep_ == "pobj" and token.head.text.lower() == "description":
conditions.append(f"Description LIKE '%{token.text}%'")
elif token.dep_ == "pobj" and token.head.text.lower() == "customer":
conditions.append(f"CustomerID = '{token.text}'")
# Constructing the SQL query
columns_str = ', '.join(columns) if columns else '*'
query = f"SELECT {columns_str} FROM {table}"
if conditions:
query += " WHERE " + " AND ".join(conditions)
query += " LIMIT 5" # Setting the limit to 5, for demo purposes
return query
Testing
For testing purposes, I implemented a database connection function to execute the generated queries:
def execute_sql_query(user_query):
sql_query = generate_sql_query(user_query)
print(f"Generated SQL Query: {sql_query}\n")
# Connect to the SQLite database
conn = sqlite3.connect('retail_data.db')
cursor = conn.cursor()
# Execute the generated SQL query
try:
cursor.execute(sql_query)
rows = cursor.fetchall()
for row in rows:
print(row)
except sqlite3.Error as e:
print(f"An error occurred: {e}")
# Close the connection
conn.close()
Test 1
user_query_1 = "Show customers from United Kingdom who bought 50 or more items"
execute_sql_query(user_query_1)
>
Generated SQL Query: SELECT * FROM retail WHERE Country = 'United Kingdom' AND Quantity >= 50 LIMIT 5
('536371', '22086', "PAPER CHAIN KIT 50'S CHRISTMAS ", 80, '2010-12-01 09:00:00', 2.55, 13748.0, 'United Kingdom')
('536376', '21733', 'RED HANGING HEART T-LIGHT HOLDER', 64, '2010-12-01 09:32:00', 2.55, 15291.0, 'United Kingdom')
('536378', '21212', 'PACK OF 72 RETROSPOT CAKE CASES', 120, '2010-12-01 09:37:00', 0.42, 14688.0, 'United Kingdom')
('536378', '85071B', 'RED CHARLIE+LOLA PERSONAL DOORSIGN', 96, '2010-12-01 09:37:00', 0.38, 14688.0, 'United Kingdom')
('536382', '22381', 'TOY TIDY PINK POLKADOT', 50, '2010-12-01 09:45:00', 1.85, 16098.0, 'United Kingdom')
Test 2
user_query_2 = "Show the description where InvoiceNo is 536365"
execute_sql_query(user_query_2)
>
Generated SQL Query: SELECT Description FROM retail WHERE InvoiceNo = '536365' LIMIT 5
('WHITE HANGING HEART T-LIGHT HOLDER',)
('WHITE METAL LANTERN',)
('CREAM CUPID HEARTS COAT HANGER',)
('KNITTED UNION FLAG HOT WATER BOTTLE',)
('RED WOOLLY HOTTIE WHITE HEART.',)
Conclusion
Using a pretrained spaCy model for an NLP task, such as generating SQL queries from user input works well for basic tasks like identifying countries. However, as shown in this example:
user_query = "Show items on invoice 536365 "
-----------
>
Token: Show, POS: VERB, Dependency: ROOT
Token: items, POS: NOUN, Dependency: dobj
Token: on, POS: ADP, Dependency: prep
Token: invoice, POS: NOUN, Dependency: pobj
Token: 536365, POS: NUM, Dependency: nummod
Entity: 536365, Label: DATE
There are limitations. In this case, the model mistakenly treated the InvoiceNo as a DATE entity instead of a CARDINAL. This happens because spaCy struggles to understand the context in complex queries and can misinterpret numbers. As such, using a pretrained spaCy model for such use-cases is not efficient.
You can also find an end-to-end demo of deploying this work on an EC2 instance.