Natural language query (NLQ) systems translate human questions into structured queries that a blockchain indexer can understand. Instead of writing SELECT * FROM ethereum.transactions WHERE block_number > 15000000, you can ask "Show me the latest high-value transactions on Ethereum." This is powered by large language models (LLMs) that parse intent and map it to on-chain data schemas. Services like Chainscore, The Graph's Natural Language API, and Flipside Crypto's AI Assistant provide these interfaces, abstracting the underlying complexity of subgraphs or raw RPC calls.
Setting Up Natural Language Queries for On-Chain Data Analytics
Setting Up Natural Language Queries for On-Chain Data Analytics
Learn how to query blockchain data using plain English instead of complex SQL or GraphQL, enabling faster insights for developers and analysts.
Setting up NLQ typically involves connecting to an API endpoint with your query and receiving structured JSON data. For example, using Chainscore's API, you might send a POST request with a prompt like "Get the top 10 NFT collections by volume on Polygon last week." The system identifies the required data points—collection addresses, sale volumes, timestamps—and executes the corresponding query against its indexed dataset. You need an API key from the provider and basic knowledge of making HTTP requests from your application or a script.
For developers, integrating NLQ starts with choosing a provider based on supported chains, data freshness, and cost. Most offer a free tier for testing. A basic integration in Python using the requests library looks like this:
pythonimport requests api_key = 'YOUR_API_KEY' url = 'https://api.chainscore.io/v1/nl-query' headers = {'Authorization': f'Bearer {api_key}'} data = {'query': 'What is the current TVL of Aave v3 on Arbitrum?'} response = requests.post(url, json=data, headers=headers) print(response.json())
The response will contain the Total Value Locked figure and often additional context like the underlying query used.
Key considerations when using NLQ systems include query specificity and cost management. Vague questions may return ambiguous results, so refining prompts (e.g., "TVL in USD" vs. "TVL") improves accuracy. Since each query consumes API credits, it's efficient to batch related questions. Furthermore, always verify critical data points, as LLMs can occasionally "hallucinate" or misinterpret niche terminology. For production use, implement caching for frequent queries to reduce latency and cost.
The primary use cases are exploratory data analysis, dashboard prototyping, and automated reporting. An analyst can quickly investigate wallet activity or protocol metrics without writing SQL. A developer can embed NLQ features into an application to let users ask questions about their portfolio. This technology significantly lowers the barrier to on-chain analytics, but it complements rather than replaces deep, custom querying for complex, high-frequency data needs.
Prerequisites and Setup
Before you can query on-chain data with natural language, you need to configure your development environment and understand the core components involved.
Natural language querying for blockchain data requires a specific technical stack. You will need a working knowledge of JavaScript/TypeScript and Node.js (v18 or later). Familiarity with Ethereum concepts like smart contracts, addresses, and common data standards (ERC-20, ERC-721) is essential. For the initial setup, ensure you have npm or yarn installed to manage dependencies. This guide uses the Chainscore API as the primary interface, which translates English questions into structured queries for protocols like Ethereum, Arbitrum, and Base.
The first step is to obtain your API credentials. Navigate to the Chainscore Developer Portal and create an account to generate an API key. This key authenticates your requests and manages rate limits. Store it securely using environment variables. Create a .env file in your project root and add your key: CHAINSCORE_API_KEY=your_key_here. You can then install the necessary Node.js client library using npm install @chainscore/chainscore.
With the library installed, you can initialize the client in your application. Import the Chainscore class and instantiate it with your API key. The client is your gateway to submitting queries. For example, a basic initialization script looks like this:
javascriptimport { Chainscore } from '@chainscore/chainscore'; const chainscore = new Chainscore({ apiKey: process.env.CHAINSCORE_API_KEY, });
This client object will be used to call methods like chainscore.query("Your question here") to fetch on-chain data.
Your queries will target specific blockchain networks. You must configure the chain ID for your data source. Common IDs include 1 for Ethereum Mainnet and 42161 for Arbitrum One. The Chainscore API supports multiple chains, allowing you to analyze data across ecosystems. When formulating a question, clarity is key. Instead of "show me some swaps," ask "What were the top 5 Uniswap V3 swaps by USD volume on Arbitrum in the last 24 hours?" This specificity helps the engine return accurate, actionable results.
Finally, consider the output format. The API returns data in structured JSON, which you can integrate into dashboards, trading bots, or research tools. You may want to set up a simple Express server or a script to periodically fetch and process this data. Understanding these prerequisites—the development environment, API access, client setup, and query formulation—ensures a smooth start to building powerful on-chain analytics with natural language.
Core System Components
Tools and frameworks that translate plain English questions into executable queries for blockchain data, enabling analytics without SQL.
Setting Up Natural Language Queries for On-Chain Data Analytics
This guide explains the architectural components and data flow required to build a system that translates natural language questions into executable queries for on-chain data.
A natural language query (NLQ) system for blockchain data requires a multi-layered architecture to bridge the gap between human language and structured on-chain information. The core components are: a user interface for input, a natural language processing (NLP) engine to interpret intent, a query translation layer that maps concepts to on-chain data schemas, and a data indexing and retrieval layer that executes the query against a processed dataset. This architecture abstracts the complexity of raw blockchain data, allowing users to ask questions like "What was the total volume on Uniswap V3 yesterday?" without writing SQL or interacting with low-level RPC calls.
The data flow begins when a user submits a question. The NLP engine, typically powered by a large language model (LLM) like GPT-4 or an open-source alternative, performs intent classification and named entity recognition. It identifies key entities (e.g., "Uniswap V3", "volume", "yesterday") and the user's goal (e.g., aggregate a metric). The output is a structured intermediate representation, often in JSON, that defines the what, where, and when of the query. For reliability, this stage may use few-shot prompting with examples of successful translations to guide the model.
The query translation layer is the critical bridge. It takes the structured intent and maps it to a specific data schema and query language. If your indexed data is in a PostgreSQL data warehouse, this layer generates SQL. If you're querying a subgraph on The Graph, it constructs a GraphQL query. This requires a schema definition that the system understands—a metadata layer describing available tables, fields, and their relationships to blockchain concepts (e.g., mapping 'swap' to a dex.trades table). Tools like LangChain or LlamaIndex are often used here to create this abstraction and manage context.
Finally, the generated query executes against a pre-indexed data layer. Querying raw Ethereum blocks via an RPC is too slow for analytics. Instead, systems rely on indexed data providers like The Graph, Dune Analytics, Covalent, or custom pipelines using Apache Spark or ClickHouse. The result is returned, often formatted by the LLM into a human-readable answer (e.g., "The total volume was $125.4M"). For production systems, implementing a caching layer for frequent queries and query validation to prevent malformed or expensive operations is essential for performance and cost control.
When implementing this architecture, key decisions include choosing between a general-purpose LLM API (easier setup, ongoing cost) and a fine-tuned open-source model (more control, higher initial effort), and selecting your data index. For Ethereum, using Dune's Spark SQL engine or The Graph's subgraphs provides robust, community-vetted schemas. A practical first step is to prototype using the OpenAI API with LangChain's SQL Agent, connecting it to a public Dune API endpoint or a mirrored dataset to handle the translation and execution automatically, demonstrating the end-to-end flow.
The Graph vs. Dune Analytics for Query Generation
Key differences between the two primary platforms for querying and analyzing blockchain data.
| Feature | The Graph | Dune Analytics |
|---|---|---|
Core Architecture | Decentralized indexing protocol with subgraphs | Centralized analytics platform with community queries |
Query Language | GraphQL | SQL (SparkSQL dialect) |
Data Freshness | Near real-time (block-by-block) | Typically 5-15 minute delay |
Data Model | Custom subgraph schema defined by developer | Pre-defined, unified "Spellbook" abstractions |
Deployment Model | Self-host or use hosted service (The Graph Network) | Fully hosted, no self-deployment |
Cost for Heavy Usage | Query fees via GRT (decentralized network) or subscription (hosted service) | Free tier + paid team plans for advanced features |
Primary Use Case | Building applications that require custom, real-time data | Ad-hoc analysis, dashboards, and reporting |
Developer Skill Required | Proficiency in GraphQL and subgraph definition (AssemblyScript/GraphQL) | Proficiency in SQL and understanding of Dune's table structure |
Implementation Code Examples
Building a Basic NLQ Agent
This example uses OpenAI's API and The Graph to query Ethereum DeFi data. You'll need an OpenAI API key and a GraphQL endpoint for a subgraph.
pythonimport openai import requests import json # Configuration OPENAI_API_KEY = 'your_key_here' GRAPHQL_ENDPOINT = 'https://api.thegraph.com/subgraphs/name/uniswap/uniswap-v3' openai.api_key = OPENAI_API_KEY def nlq_to_graphql(user_query): """Uses LLM to convert natural language to a GraphQL query.""" prompt = f"""Convert this on-chain data question into a precise GraphQL query for The Graph. Schema Context: The subgraph has Pool, Swap, and Token entities with fields like id, token0, token1, amountUSD, timestamp. Question: {user_query} Return ONLY the GraphQL query, no explanation. """ response = openai.ChatCompletion.create( model="gpt-4", messages=[{"role": "user", "content": prompt}], temperature=0 ) return response.choices[0].message.content def execute_graphql_query(query): """Executes the generated GraphQL query.""" response = requests.post( GRAPHQL_ENDPOINT, json={'query': query} ) return response.json() # Example usage question = "What was the total swap volume in USD on Uniswap V3 in the last 24 hours?" graphql_query = nlq_to_graphql(question) print("Generated Query:", graphql_query) data = execute_graphql_query(graphql_query) print("Result:", json.dumps(data, indent=2))
Note: This is a simplified proof-of-concept. Production systems require robust error handling, schema validation, and prompt engineering to improve reliability.
Prompt Engineering for Intent and Entity Recognition
A guide to designing natural language prompts that accurately extract user intent and on-chain entities for data queries.
In on-chain analytics, prompt engineering transforms ambiguous user questions into structured queries a system can execute. The core challenge is intent recognition—determining if a user wants a wallet's balance, an NFT's floor price, or a token's trading volume—and entity recognition—identifying the specific addresses, token symbols, or contract names involved. A well-designed prompt acts as an instruction set for a language model, guiding it to parse natural language and output a standardized query format like GraphQL or SQL.
Start by defining the core user intents your system supports. Common intents in DeFi and NFT analytics include: get_balance, get_transactions, get_token_price, get_pool_stats, and get_holder_distribution. For each intent, specify the required and optional entities. For example, the get_balance intent requires an address entity and a token_symbol or chain entity. Document these in a schema that maps intents to their expected parameters and data sources, such as The Graph subgraphs or direct RPC calls.
Craft your system prompt to explicitly instruct the LLM on this schema. Use few-shot prompting with clear examples to demonstrate the input-output mapping. For instance:
User: "What's the ETH balance for vitalik.eth?"
System: {"intent": "get_balance", "entities": {"address": "vitalik.eth", "token_symbol": "ETH"}}
User: "Show me the last 10 transactions for the Uniswap V3 factory."
System: {"intent": "get_transactions", "entities": {"contract_address": "0x1F98431c8aD98523631AE4a59f267346ea31F984"}, "parameters": {"limit": 10}}. This teaches the model the expected JSON structure.
Handle ambiguity and errors by designing prompts that encourage the model to ask clarifying questions or return confidence scores. If a user asks "What's the price of APE?", the prompt should instruct the model to recognize APE could refer to ApeCoin (0x4d224452801aced8b2f0aebe155379bb5d594381) or the NFT collection Bored Ape Yacht Club. Your prompt might include a disambiguation step: "If a token symbol is ambiguous, return a list of possible contract addresses for the user to choose from."
Finally, integrate the parsed output into your data pipeline. The structured JSON from the LLM should trigger a specific query function. For the get_pool_stats intent for a DEX like Uniswap, you would use the extracted pool_address to call a subgraph query fetching totalValueLocked, volumeUSD, and feesUSD. Always validate extracted entities—confirm an address is valid on the specified chain using libraries like ethers.js or viem before querying to prevent errors and enhance security.
Common Issues and Troubleshooting
Resolve frequent challenges when setting up and using natural language to query on-chain data. This guide covers API errors, query parsing, and performance optimization.
This is often due to ambiguous phrasing or referencing unsupported data. The system maps your words to specific on-chain entities and metrics.
Common causes and fixes:
- Vague Entity Reference: "Show me Vitalik's wallet." Be specific: "Show me the Ethereum balance for address 0xd8dA6BF26964aF9D7eEd9e03E53415D37aA96045."
- Unsupported Metric: "How happy are NFT holders?" Use quantifiable on-chain data: "Show me the average holding time for Bored Ape Yacht Club NFTs."
- Timeframe Issues: "Recent transactions" is ambiguous. Specify: "Show transactions from the last 24 hours."
Debugging Steps:
- Check the system's query log to see how your prompt was interpreted into SQL.
- Simplify your query to a basic template:
[Action] [Metric] for [Entity] on [Chain] over [Timeframe]. - Consult the data schema to confirm the metric (e.g.,
token_balance,tx_count) and entity type (e.g.,address,contract,collection) exist.
Essential Resources and Tools
Tools and frameworks that let developers query on-chain data using natural language instead of raw SQL or GraphQL. These resources focus on translating prompts into verifiable queries, validating results, and integrating analytics into production workflows.
Design Patterns for Reliable Natural Language On-Chain Queries
Beyond tools, reliable natural language analytics depends on prompt and validation patterns that reduce hallucinations and errors.
Recommended practices:
- Constrain prompts with explicit schemas: chain, timeframe, metric definition
- Use read-only SQL generation with no write permissions
- Add automated checks for row counts, null values, and outliers
Common pattern:
- NL prompt → draft query
- Deterministic query execution
- Result validation and summary generation
Example:
- Prompt: "Weekly DEX volume on Base"
- Validation: ensure all volume fields are non-negative and timestamped
These patterns apply across Dune, Flipside, and API-based tools and are critical for production use.
This card is a conceptual resource rather than a single product, aimed at developers building trustworthy NL analytics systems.
Frequently Asked Questions
Common questions and troubleshooting for developers using natural language to query on-chain data.
Natural language queries (NLQs) allow you to ask questions about blockchain data in plain English instead of writing complex SQL or GraphQL. For example, you can ask "What was the total volume on Uniswap V3 on Arbitrum last week?" instead of manually joining tables for pools, swaps, and timestamps. Systems like Chainscore's AI agent parse your intent, translate it into a structured query against indexed blockchain data, and return the results. This significantly reduces the time spent on data exploration and ad-hoc analysis, especially for complex metrics like user retention, protocol revenue, or cross-contract interactions.
Conclusion and Next Steps
You have now configured a system to query on-chain data using natural language. This guide covered the essential components from data indexing to AI-powered query translation.
Setting up natural language queries transforms blockchain analytics from a developer-centric task to an accessible tool for researchers, analysts, and product teams. The core architecture you've implemented involves three layers: a data indexing layer (using services like The Graph or Subsquid), a vector database (like Pinecone or Weaviate) for semantic search on indexed schemas, and an LLM orchestration layer (via LangChain or LlamaIndex) to translate user questions into executable queries, such as GraphQL or SQL. This setup allows questions like "Show me the top 10 NFT minters on Ethereum last week" to be processed automatically.
For production deployment, several critical next steps are required. First, implement rigorous query validation and sanitization to prevent malicious or malformed inputs from reaching your indexers. Second, establish a feedback loop where users can flag incorrect LLM translations, which you can use to fine-tune your prompts or create a curated set of few-shot examples. Third, consider cost optimization by caching frequent query patterns and implementing usage limits, as LLM API calls and complex indexer queries can become expensive at scale. Tools like Prometheus for monitoring and Grafana for dashboards are essential for tracking system performance and costs.
To extend your system's capabilities, explore integrating more data sources. Consider adding real-time data streams from WebSocket connections to nodes for sub-second analytics, or off-chain data from oracles like Chainlink to enrich context. You can also build specialized agents that not only query data but execute actions based on the results, such as triggering a smart contract or sending an alert. The foundational pattern you've built—translating intent into structured query—is the first step toward more autonomous, intelligent blockchain applications. Continue experimenting with different LLM models (like Claude 3 or GPT-4) and fine-tuning strategies to improve accuracy for your specific use case.