Natural-Language Queries for Spark: Using LangChain to Run SQL on DataFrames

Khuyen Tran

Have you ever wanted to ask a question about your PySpark data in plain English instead of writing SQL?

LangChain’s Spark SQL Toolkit enables natural language data querying by:

Translating your requests into SQL
Executing them against your Spark cluster
Returning the results in a readable format

This makes it much easier to work with large-scale data while still leveraging Spark’s powerful distributed computing capabilities.

To demonstrate, we’ll create a simple DataFrame and use LangChain’s Spark SQL tool to query it.

from pyspark.sql import SparkSession, Row

# Create sample data and DataFrame
data = [Row(name="Alice", age=30), Row(name="Bob", age=25)]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(data)
df.write.saveAsTable("people")
df.show()

This creates a table people accessible via SQL.

Next, we’ll set up the key components that enable natural language querying of our Spark data. Here are the steps:

Initialize the Spark SQL tool which provides the interface to our Spark database.
Initialize a language model.
Initialize the Spark SQL toolkit, which connects the language model with the Spark database.
Create an agent executor that combines a language model with the Spark SQL toolkit.

# Initialize Spark SQL tool
spark_sql = SparkSQL(schema="default")

# Initialize LLM
llm = ChatOpenAI(temperature=0)

# Initialize toolkit
toolkit = SparkSQLToolkit(db=spark_sql, llm=llm)

# Create agent executor
agent_executor = create_spark_sql_agent(llm=llm, toolkit=toolkit, verbose=True)

For a hands-on guide on how to build coordinated AI agents with LangGraph, check out Building Coordinated AI Agents with LangGraph: A Hands-On Tutorial.

Now we can ask the agent to query the data.

agent_executor.run("What is the average age of people in the table?")

> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input:
Observation: people
Thought:I can query the "people" table for the average age.
Action: query_sql_db
Action Input: SELECT AVG(age) FROM people
Observation: [('27.5',)]
Thought:The average age of people in the table is 27.5.
Final Answer: 27.5

> Finished chain.

The answer for the average age is correct.

The output shows that the agent:

Looked up the available tables
Queried the people table for the average age
Got the result
Answered the question with the result

Let’s try another question.

agent_executor.run("Who is the oldest person in the table?")

> Entering new AgentExecutor chain...
Action: list_tables_sql_db
Action Input:
Observation: people
Thought:I should query the "people" table to find the oldest person.
Action: schema_sql_db
Action Input: people
Observation: CREATE TABLE spark_catalog.default.people (
  name STRING,
  age BIGINT)
;

/*
3 rows from people table:
name    age
Alice   30
Bob     25
*/
Thought:I should write a query to select the oldest person from the "people" table.
Action: query_sql_db
Action Input: SELECT name, age FROM people ORDER BY age DESC LIMIT 1
Observation: [('Alice', '30')]
Thought:I now know the final answer
Final Answer: Alice

> Finished chain.

The answer for the oldest person is also correct.

Related Resources

For deeper exploration of LangChain and Spark integrations:

Multi-Agent Systems: Building Coordinated AI Agents with LangGraph: A Hands-On Tutorial for advanced AI agent coordination with LangChain
Data Science Workflows: Build Data Science Workflows with DeepSeek and LangChain for comprehensive data analysis pipelines
Private AI Solutions: Private AI Workflows with LangChain and Ollama for local model deployment and privacy-focused implementations
Official Documentation: LangChain Spark SQL Tool for complete API reference and advanced configuration options

5 Essential Itertools for Data Science

July 26, 2025

Stop Writing SQL for AI Agents: Build Direct Database Access with FastMCP

July 10, 2025

Transform Any PDF into Searchable AI Data with Docling

July 1, 2025

Natural-Language Queries for Spark: Using LangChain to Run SQL on DataFrames

Table of Contents

Natural-Language Queries for Spark: Using LangChain to Run SQL on DataFrames

Khuyen Tran

Related Resources

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with
data skills using
CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Natural-Language Queries for Spark: Using LangChain to Run SQL on DataFrames

Table of Contents

Natural-Language Queries for Spark: Using LangChain to Run SQL on DataFrames

Khuyen Tran

Related Resources

Related Posts

Leave a Comment Cancel Reply

Stay up-to-date with data skills using CodeCut

Drop a line

Get in touch

Follow Us on Social Media

Work with Khuyen Tran

Work with Khuyen Tran

Stay up-to-date with
data skills using
CodeCut