How to Build Natural Language Queries for On-Chain Data

introduction

TUTORIAL

Setting Up Natural Language Queries for On-Chain Data Analytics

Learn how to query blockchain data using plain English instead of complex SQL or GraphQL, enabling faster insights for developers and analysts.

Natural language query (NLQ) systems translate human questions into structured queries that a blockchain indexer can understand. Instead of writing SELECT * FROM ethereum.transactions WHERE block_number > 15000000, you can ask "Show me the latest high-value transactions on Ethereum." This is powered by large language models (LLMs) that parse intent and map it to on-chain data schemas. Services like Chainscore, The Graph's Natural Language API, and Flipside Crypto's AI Assistant provide these interfaces, abstracting the underlying complexity of subgraphs or raw RPC calls.

Setting up NLQ typically involves connecting to an API endpoint with your query and receiving structured JSON data. For example, using Chainscore's API, you might send a POST request with a prompt like "Get the top 10 NFT collections by volume on Polygon last week." The system identifies the required data points—collection addresses, sale volumes, timestamps—and executes the corresponding query against its indexed dataset. You need an API key from the provider and basic knowledge of making HTTP requests from your application or a script.

For developers, integrating NLQ starts with choosing a provider based on supported chains, data freshness, and cost. Most offer a free tier for testing. A basic integration in Python using the requests library looks like this:

python
import requests
api_key = 'YOUR_API_KEY'
url = 'https://api.chainscore.io/v1/nl-query'
headers = {'Authorization': f'Bearer {api_key}'}
data = {'query': 'What is the current TVL of Aave v3 on Arbitrum?'}
response = requests.post(url, json=data, headers=headers)
print(response.json())

The response will contain the Total Value Locked figure and often additional context like the underlying query used.

Key considerations when using NLQ systems include query specificity and cost management. Vague questions may return ambiguous results, so refining prompts (e.g., "TVL in USD" vs. "TVL") improves accuracy. Since each query consumes API credits, it's efficient to batch related questions. Furthermore, always verify critical data points, as LLMs can occasionally "hallucinate" or misinterpret niche terminology. For production use, implement caching for frequent queries to reduce latency and cost.

The primary use cases are exploratory data analysis, dashboard prototyping, and automated reporting. An analyst can quickly investigate wallet activity or protocol metrics without writing SQL. A developer can embed NLQ features into an application to let users ask questions about their portfolio. This technology significantly lowers the barrier to on-chain analytics, but it complements rather than replaces deep, custom querying for complex, high-frequency data needs.

prerequisites

GETTING STARTED

Prerequisites and Setup

Before you can query on-chain data with natural language, you need to configure your development environment and understand the core components involved.

Natural language querying for blockchain data requires a specific technical stack. You will need a working knowledge of JavaScript/TypeScript and Node.js (v18 or later). Familiarity with Ethereum concepts like smart contracts, addresses, and common data standards (ERC-20, ERC-721) is essential. For the initial setup, ensure you have npm or yarn installed to manage dependencies. This guide uses the Chainscore API as the primary interface, which translates English questions into structured queries for protocols like Ethereum, Arbitrum, and Base.

The first step is to obtain your API credentials. Navigate to the Chainscore Developer Portal and create an account to generate an API key. This key authenticates your requests and manages rate limits. Store it securely using environment variables. Create a .env file in your project root and add your key: CHAINSCORE_API_KEY=your_key_here. You can then install the necessary Node.js client library using npm install @chainscore/chainscore.

With the library installed, you can initialize the client in your application. Import the Chainscore class and instantiate it with your API key. The client is your gateway to submitting queries. For example, a basic initialization script looks like this:

javascript
import { Chainscore } from '@chainscore/chainscore';
const chainscore = new Chainscore({
  apiKey: process.env.CHAINSCORE_API_KEY,
});

This client object will be used to call methods like chainscore.query("Your question here") to fetch on-chain data.

Your queries will target specific blockchain networks. You must configure the chain ID for your data source. Common IDs include 1 for Ethereum Mainnet and 42161 for Arbitrum One. The Chainscore API supports multiple chains, allowing you to analyze data across ecosystems. When formulating a question, clarity is key. Instead of "show me some swaps," ask "What were the top 5 Uniswap V3 swaps by USD volume on Arbitrum in the last 24 hours?" This specificity helps the engine return accurate, actionable results.

Finally, consider the output format. The API returns data in structured JSON, which you can integrate into dashboards, trading bots, or research tools. You may want to set up a simple Express server or a script to periodically fetch and process this data. Understanding these prerequisites—the development environment, API access, client setup, and query formulation—ensures a smooth start to building powerful on-chain analytics with natural language.

key-concepts

NATURAL LANGUAGE QUERIES

Core System Components

Tools and frameworks that translate plain English questions into executable queries for blockchain data, enabling analytics without SQL.

The Graph Protocol

The leading decentralized protocol for indexing and querying blockchain data using GraphQL. It allows you to query data from networks like Ethereum, Arbitrum, and Polygon using a human-readable syntax.

Subgraphs are open APIs that define the data schema and indexing logic for specific smart contracts.
Queries are written in GraphQL, which is more intuitive than raw SQL for nested blockchain data.
Supports complex queries like "Show me the top 10 liquidity pools by volume on Uniswap V3 this week."

1,000+

Deployed Subgraphs

EXPLORE

Covalent Unified API

A unified API that provides indexed, structured data from over 200 blockchain networks. It abstracts away chain-specific complexities, allowing you to fetch wallet balances, transaction histories, and NFT data with a single API call.

The Class A and Class B unified endpoints return data in a consistent schema across all supported chains.
Enables queries like "Get all token transfers for address X" without needing to parse raw logs.
Offers a SQL interface (Covalent Nomics) for analysts familiar with traditional query languages.

200+

Supported Chains

EXPLORE

Dune Analytics Spellbook

Dune's community-contributed library of reusable SQL queries and data abstractions called Spells. While the interface uses SQL, Spells create standardized tables (like dex.trades or nft.trades) that simplify writing complex queries.

Abstractions turn raw, messy blockchain data into clean, human-readable datasets.
You can fork and modify existing Spells, enabling queries like "Compare gas fees across L2s" by building on pre-built logic.
The platform's Wizard offers a semi-guided interface to generate SQL from simple prompts.

500k+

Community Queries

EXPLORE

AI-Powered Query Engines

Emerging tools that use large language models (LLMs) to interpret natural language and generate the correct query code (SQL, GraphQL, or API calls).

Chainbase AI and Space and Time GPT allow you to ask questions like "What was the average transaction fee on Ethereum yesterday?" and receive the answer with the generated query.
These systems map user intent to a known data schema, handling the translation layer.
Critical to verify the generated query's logic, as LLMs can hallucinate or misinterpret context.

< 10 sec

Query Generation

EXPLORE

Query Abstraction with Subgraphs

A practical approach where you build a dedicated subgraph for your dApp's specific data needs. This creates a custom, optimized GraphQL API endpoint that serves as the natural language query layer.

Define your schema (Users, Transactions, Pools) in a schema.graphql file.
Write mapping scripts in AssemblyScript to translate blockchain events into your schema.
Once deployed, you can query complex relationships with simple syntax: { users(where: {totalSwaps_gt: 100}) { id totalSwaps } }.

EXPLORE

Building a Query Interface

Architectural components for integrating natural language queries into your own application.

Query Parser: A service that interprets the user's question (e.g., using an intent-classification model or a set of predefined patterns).
Query Builder: Translates the parsed intent into a specific API call to The Graph, Covalent, or your database. Frameworks like Hasura can help auto-generate GraphQL APIs.
Response Formatter: Takes the raw API response and formats it into a human-readable answer, chart, or table for the end-user.

EXPLORE

system-architecture

SYSTEM ARCHITECTURE AND DATA FLOW

Setting Up Natural Language Queries for On-Chain Data Analytics

This guide explains the architectural components and data flow required to build a system that translates natural language questions into executable queries for on-chain data.

A natural language query (NLQ) system for blockchain data requires a multi-layered architecture to bridge the gap between human language and structured on-chain information. The core components are: a user interface for input, a natural language processing (NLP) engine to interpret intent, a query translation layer that maps concepts to on-chain data schemas, and a data indexing and retrieval layer that executes the query against a processed dataset. This architecture abstracts the complexity of raw blockchain data, allowing users to ask questions like "What was the total volume on Uniswap V3 yesterday?" without writing SQL or interacting with low-level RPC calls.

The data flow begins when a user submits a question. The NLP engine, typically powered by a large language model (LLM) like GPT-4 or an open-source alternative, performs intent classification and named entity recognition. It identifies key entities (e.g., "Uniswap V3", "volume", "yesterday") and the user's goal (e.g., aggregate a metric). The output is a structured intermediate representation, often in JSON, that defines the what, where, and when of the query. For reliability, this stage may use few-shot prompting with examples of successful translations to guide the model.

The query translation layer is the critical bridge. It takes the structured intent and maps it to a specific data schema and query language. If your indexed data is in a PostgreSQL data warehouse, this layer generates SQL. If you're querying a subgraph on The Graph, it constructs a GraphQL query. This requires a schema definition that the system understands—a metadata layer describing available tables, fields, and their relationships to blockchain concepts (e.g., mapping 'swap' to a dex.trades table). Tools like LangChain or LlamaIndex are often used here to create this abstraction and manage context.

Finally, the generated query executes against a pre-indexed data layer. Querying raw Ethereum blocks via an RPC is too slow for analytics. Instead, systems rely on indexed data providers like The Graph, Dune Analytics, Covalent, or custom pipelines using Apache Spark or ClickHouse. The result is returned, often formatted by the LLM into a human-readable answer (e.g., "The total volume was $125.4M"). For production systems, implementing a caching layer for frequent queries and query validation to prevent malformed or expensive operations is essential for performance and cost control.

When implementing this architecture, key decisions include choosing between a general-purpose LLM API (easier setup, ongoing cost) and a fine-tuned open-source model (more control, higher initial effort), and selecting your data index. For Ethereum, using Dune's Spark SQL engine or The Graph's subgraphs provides robust, community-vetted schemas. A practical first step is to prototype using the OpenAI API with LangChain's SQL Agent, connecting it to a public Dune API endpoint or a mirrored dataset to handle the translation and execution automatically, demonstrating the end-to-end flow.

PROTOCOL COMPARISON

The Graph vs. Dune Analytics for Query Generation

Key differences between the two primary platforms for querying and analyzing blockchain data.

Feature	The Graph	Dune Analytics
Core Architecture	Decentralized indexing protocol with subgraphs	Centralized analytics platform with community queries
Query Language	GraphQL	SQL (SparkSQL dialect)
Data Freshness	Near real-time (block-by-block)	Typically 5-15 minute delay
Data Model	Custom subgraph schema defined by developer	Pre-defined, unified "Spellbook" abstractions
Deployment Model	Self-host or use hosted service (The Graph Network)	Fully hosted, no self-deployment
Cost for Heavy Usage	Query fees via GRT (decentralized network) or subscription (hosted service)	Free tier + paid team plans for advanced features
Primary Use Case	Building applications that require custom, real-time data	Ad-hoc analysis, dashboards, and reporting
Developer Skill Required	Proficiency in GraphQL and subgraph definition (AssemblyScript/GraphQL)	Proficiency in SQL and understanding of Dune's table structure

PRACTICAL GUIDE

Implementation Code Examples

Building a Basic NLQ Agent

This example uses OpenAI's API and The Graph to query Ethereum DeFi data. You'll need an OpenAI API key and a GraphQL endpoint for a subgraph.

python
import openai
import requests
import json

# Configuration
OPENAI_API_KEY = 'your_key_here'
GRAPHQL_ENDPOINT = 'https://api.thegraph.com/subgraphs/name/uniswap/uniswap-v3'

openai.api_key = OPENAI_API_KEY

def nlq_to_graphql(user_query):
    """Uses LLM to convert natural language to a GraphQL query."""
    prompt = f"""Convert this on-chain data question into a precise GraphQL query for The Graph.
    Schema Context: The subgraph has Pool, Swap, and Token entities with fields like id, token0, token1, amountUSD, timestamp.
    Question: {user_query}
    Return ONLY the GraphQL query, no explanation.
    """
    
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0
    )
    return response.choices[0].message.content

def execute_graphql_query(query):
    """Executes the generated GraphQL query."""
    response = requests.post(
        GRAPHQL_ENDPOINT,
        json={'query': query}
    )
    return response.json()

# Example usage
question = "What was the total swap volume in USD on Uniswap V3 in the last 24 hours?"
graphql_query = nlq_to_graphql(question)
print("Generated Query:", graphql_query)
data = execute_graphql_query(graphql_query)
print("Result:", json.dumps(data, indent=2))

Note: This is a simplified proof-of-concept. Production systems require robust error handling, schema validation, and prompt engineering to improve reliability.

prompt-engineering

ON-CHAIN ANALYTICS

Prompt Engineering for Intent and Entity Recognition

A guide to designing natural language prompts that accurately extract user intent and on-chain entities for data queries.

In on-chain analytics, prompt engineering transforms ambiguous user questions into structured queries a system can execute. The core challenge is intent recognition—determining if a user wants a wallet's balance, an NFT's floor price, or a token's trading volume—and entity recognition—identifying the specific addresses, token symbols, or contract names involved. A well-designed prompt acts as an instruction set for a language model, guiding it to parse natural language and output a standardized query format like GraphQL or SQL.

Start by defining the core user intents your system supports. Common intents in DeFi and NFT analytics include: get_balance, get_transactions, get_token_price, get_pool_stats, and get_holder_distribution. For each intent, specify the required and optional entities. For example, the get_balance intent requires an address entity and a token_symbol or chain entity. Document these in a schema that maps intents to their expected parameters and data sources, such as The Graph subgraphs or direct RPC calls.

Craft your system prompt to explicitly instruct the LLM on this schema. Use few-shot prompting with clear examples to demonstrate the input-output mapping. For instance: User: "What's the ETH balance for vitalik.eth?" System: {"intent": "get_balance", "entities": {"address": "vitalik.eth", "token_symbol": "ETH"}} User: "Show me the last 10 transactions for the Uniswap V3 factory." System: {"intent": "get_transactions", "entities": {"contract_address": "0x1F98431c8aD98523631AE4a59f267346ea31F984"}, "parameters": {"limit": 10}}. This teaches the model the expected JSON structure.

Handle ambiguity and errors by designing prompts that encourage the model to ask clarifying questions or return confidence scores. If a user asks "What's the price of APE?", the prompt should instruct the model to recognize APE could refer to ApeCoin (0x4d224452801aced8b2f0aebe155379bb5d594381) or the NFT collection Bored Ape Yacht Club. Your prompt might include a disambiguation step: "If a token symbol is ambiguous, return a list of possible contract addresses for the user to choose from."

Finally, integrate the parsed output into your data pipeline. The structured JSON from the LLM should trigger a specific query function. For the get_pool_stats intent for a DEX like Uniswap, you would use the extracted pool_address to call a subgraph query fetching totalValueLocked, volumeUSD, and feesUSD. Always validate extracted entities—confirm an address is valid on the specified chain using libraries like ethers.js or viem before querying to prevent errors and enhance security.

NATURAL LANGUAGE QUERIES

Common Issues and Troubleshooting

Resolve frequent challenges when setting up and using natural language to query on-chain data. This guide covers API errors, query parsing, and performance optimization.

This is often due to ambiguous phrasing or referencing unsupported data. The system maps your words to specific on-chain entities and metrics.

Common causes and fixes:

Vague Entity Reference: "Show me Vitalik's wallet." Be specific: "Show me the Ethereum balance for address 0xd8dA6BF26964aF9D7eEd9e03E53415D37aA96045."
Unsupported Metric: "How happy are NFT holders?" Use quantifiable on-chain data: "Show me the average holding time for Bored Ape Yacht Club NFTs."
Timeframe Issues: "Recent transactions" is ambiguous. Specify: "Show transactions from the last 24 hours."

Debugging Steps:

Check the system's query log to see how your prompt was interpreted into SQL.
Simplify your query to a basic template: [Action] [Metric] for [Entity] on [Chain] over [Timeframe].
Consult the data schema to confirm the metric (e.g., token_balance, tx_count) and entity type (e.g., address, contract, collection) exist.

resource-links

GUIDES

Essential Resources and Tools

Tools and frameworks that let developers query on-chain data using natural language instead of raw SQL or GraphQL. These resources focus on translating prompts into verifiable queries, validating results, and integrating analytics into production workflows.

Dune AI: Natural Language to SQL for Blockchain Data

Dune AI allows developers to generate SQL queries from natural language prompts on top of Dune's curated blockchain datasets.

Key capabilities:

Prompt-based query generation for Ethereum, Solana, Arbitrum, Optimism, and other supported chains
Automatic mapping to Dune's spellbook tables like erc20.evt_transfer and dex.trades
Inline explanations of generated SQL, useful for auditing and refinement

Practical usage:

Start with explicit prompts like: "Daily active addresses on Arbitrum over the last 90 days"
Review and edit the generated SQL to enforce filters, block ranges, or join logic
Save validated queries as dashboards or API endpoints

Limitations to account for:

Complex joins and protocol-specific edge cases often require manual correction
Output quality depends heavily on using Dune's canonical table names

Best suited for analysts and developers who want fast iteration without sacrificing query transparency.

EXPLORE

Flipside ShroomDK + LLMs for Prompt-Based Analytics

Flipside's ShroomDK provides a programmatic way to run SQL queries against curated blockchain data, often paired with LLMs to support natural language interfaces.

How teams implement NL queries:

Use an LLM to translate user prompts into parameterized SQL
Execute queries via ShroomDK's API
Post-process results for charts, summaries, or alerts

Key advantages:

Strong data normalization across chains like Ethereum, Solana, Near, and Flow
Deterministic SQL execution with reproducible results
Python-first workflow fits data science and backend teams

Example flow:

Prompt: "Top NFT marketplaces by volume on Solana last month"
LLM generates SQL using Flipside tables
ShroomDK executes and returns structured JSON

This approach requires more setup than UI-based tools but offers full control, making it suitable for production analytics pipelines.

EXPLORE

Space and Time: Natural Language Queries with ZK-Proven Results

Space and Time combines natural language querying with zero-knowledge proofs to ensure on-chain analytics results are verifiable.

Core features:

Prompt-based query generation over indexed blockchain and off-chain datasets
SQL execution with cryptographic proofs of correctness
Ability to push verified results back on-chain via smart contracts

Why this matters:

Natural language interfaces reduce friction, but trust in results is critical
ZK proofs allow contracts and users to verify analytics without re-running queries

Typical use cases:

DeFi protocols validating TVL or liquidation metrics on-chain
DAOs automating decisions based on provable analytics

Developers should expect:

A learning curve around ZK concepts and proof verification
Higher latency compared to traditional analytics tools

Best for teams building trust-minimized analytics that feed directly into on-chain logic.

EXPLORE

Chainbase AI: Prompting Unified On-Chain Data APIs

Chainbase offers an AI-powered layer that translates natural language queries into API calls across multiple blockchains.

What it provides:

Unified data access for Ethereum, BNB Chain, Polygon, Arbitrum, and more
Prompt-based endpoints for common analytics like balances, transfers, and contract activity
JSON responses optimized for application consumption

Developer workflow:

Submit a natural language query via API or dashboard
Inspect the generated query logic and parameters
Integrate results into dashboards, bots, or backend services

Strengths:

Faster than writing custom indexers
Good fit for application-level analytics rather than deep research

Trade-offs:

Less flexibility than raw SQL-based platforms
Advanced protocol-specific metrics may not be supported

Useful for teams prioritizing speed and multi-chain coverage over custom analytics depth.

EXPLORE

Design Patterns for Reliable Natural Language On-Chain Queries

Beyond tools, reliable natural language analytics depends on prompt and validation patterns that reduce hallucinations and errors.

Recommended practices:

Constrain prompts with explicit schemas: chain, timeframe, metric definition
Use read-only SQL generation with no write permissions
Add automated checks for row counts, null values, and outliers

Common pattern:

NL prompt → draft query
Deterministic query execution
Result validation and summary generation

Example:

Prompt: "Weekly DEX volume on Base"
Validation: ensure all volume fields are non-negative and timestamped

These patterns apply across Dune, Flipside, and API-based tools and are critical for production use.

This card is a conceptual resource rather than a single product, aimed at developers building trustworthy NL analytics systems.

NATURAL LANGUAGE QUERIES

Frequently Asked Questions

Common questions and troubleshooting for developers using natural language to query on-chain data.

Natural language queries (NLQs) allow you to ask questions about blockchain data in plain English instead of writing complex SQL or GraphQL. For example, you can ask "What was the total volume on Uniswap V3 on Arbitrum last week?" instead of manually joining tables for pools, swaps, and timestamps. Systems like Chainscore's AI agent parse your intent, translate it into a structured query against indexed blockchain data, and return the results. This significantly reduces the time spent on data exploration and ad-hoc analysis, especially for complex metrics like user retention, protocol revenue, or cross-contract interactions.

conclusion

IMPLEMENTATION SUMMARY

Conclusion and Next Steps

You have now configured a system to query on-chain data using natural language. This guide covered the essential components from data indexing to AI-powered query translation.

Setting up natural language queries transforms blockchain analytics from a developer-centric task to an accessible tool for researchers, analysts, and product teams. The core architecture you've implemented involves three layers: a data indexing layer (using services like The Graph or Subsquid), a vector database (like Pinecone or Weaviate) for semantic search on indexed schemas, and an LLM orchestration layer (via LangChain or LlamaIndex) to translate user questions into executable queries, such as GraphQL or SQL. This setup allows questions like "Show me the top 10 NFT minters on Ethereum last week" to be processed automatically.

For production deployment, several critical next steps are required. First, implement rigorous query validation and sanitization to prevent malicious or malformed inputs from reaching your indexers. Second, establish a feedback loop where users can flag incorrect LLM translations, which you can use to fine-tune your prompts or create a curated set of few-shot examples. Third, consider cost optimization by caching frequent query patterns and implementing usage limits, as LLM API calls and complex indexer queries can become expensive at scale. Tools like Prometheus for monitoring and Grafana for dashboards are essential for tracking system performance and costs.

To extend your system's capabilities, explore integrating more data sources. Consider adding real-time data streams from WebSocket connections to nodes for sub-second analytics, or off-chain data from oracles like Chainlink to enrich context. You can also build specialized agents that not only query data but execute actions based on the results, such as triggering a smart contract or sending an alert. The foundational pattern you've built—translating intent into structured query—is the first step toward more autonomous, intelligent blockchain applications. Continue experimenting with different LLM models (like Claude 3 or GPT-4) and fine-tuning strategies to improve accuracy for your specific use case.