How to Build a Blockchain Indexing and Query Service

introduction

GUIDE

Setting Up a Blockchain Indexing and Query Service

A practical guide to building a service that extracts, transforms, and serves structured blockchain data for applications.

Blockchain indexing is the process of extracting, transforming, and loading (ETL) raw on-chain data into a structured, queryable database. Unlike a standard blockchain node that provides sequential block data, an indexing service answers specific questions like "What are the token balances for this address?" or "List all NFT transfers in the last hour." This is essential for dApps, analytics dashboards, and wallets that require fast, complex queries. Popular solutions include The Graph for decentralized indexing and Subsquid for high-performance data lakes.

The core architecture involves three main components. First, a data extraction layer connects to an RPC node (e.g., using eth_getLogs) to capture events and transaction data. Second, a transformation layer, often written in TypeScript or Rust, decodes raw hexadecimal data using Application Binary Interfaces (ABIs) and applies business logic. Third, a database (like PostgreSQL) stores the processed data, and a GraphQL or REST API serves it to end-users. This pipeline turns raw blockchain logs into meaningful entities like User, Swap, or Transfer.

To build a basic indexer for ERC-20 transfers, start by defining a schema. Using a tool like The Graph, you create a schema.graphql file with a Transfer entity containing fields like id, from, to, value, and timestamp. The mapping logic in mapping.ts subscribes to the Transfer event, decodes the log data using the ERC-20 ABI, and saves a new Transfer entity to the store. The service then automatically indexes historical data and stays synced with new blocks, providing a queryable endpoint.

For custom or high-performance needs, a self-hosted indexer using Subsquid or TrueBlocks offers more control. A Subsquid squid uses a processor to batch-fetch blocks, apply transformations in-memory, and upsert data to the target database. This approach can process thousands of blocks per second. Critical optimizations include using RPC providers with high rate limits (like Alchemy or QuickNode), implementing database indexing on frequently queried columns, and designing schemas to avoid the N+1 query problem in GraphQL.

Deploying your indexer requires infrastructure decisions. For production, run the indexer as a resilient service using Docker containers orchestrated with Kubernetes or a managed service. Implement monitoring for sync status, RPC errors, and database latency. Always plan for reorg handling—your service must be able to revert data when a blockchain reorganization occurs. Cost management is also key; indexing full historical data for Ethereum mainnet can require terabytes of storage and significant RPC calls, so consider starting with a multi-chain data provider like Covalent or Goldsky for initial prototyping.

The final step is exposing the data via an API. A GraphQL API generated from your schema allows clients to request exactly the data they need. For example, a query to get the latest large Uniswap swaps might filter by protocol, pool, and amount. Secure your endpoint with API keys and consider implementing query cost analysis to prevent abusive queries. With your indexing service live, developers can build fast applications without the complexity of directly parsing blockchain data, unlocking use cases in DeFi analytics, NFT marketplaces, and on-chain governance tools.

prerequisites

SETUP GUIDE

Prerequisites

Essential tools and knowledge required before building a blockchain indexing and query service.

Building a robust indexing service requires a foundational understanding of blockchain architecture and modern development tools. You should be comfortable with core concepts like blocks, transactions, smart contract events, and the structure of a Merkle Patricia Trie. Familiarity with the target blockchain's execution client (e.g., Geth, Erigon) and its JSON-RPC API is crucial for data extraction. This guide assumes proficiency in a backend programming language like Go, TypeScript (Node.js), or Rust, which are commonly used for high-performance data pipelines.

Your development environment must be properly configured. Install the latest LTS version of Node.js (v20+) and a package manager like npm or yarn. For Go-based projects, ensure you have Go 1.21+ installed. You will need Docker and Docker Compose to run local blockchain nodes, databases, and other dependencies in isolated containers. A code editor like VS Code with relevant extensions and a terminal for command-line operations are also essential.

A local or testnet blockchain node is a non-negotiable prerequisite for development and testing. You can run a light node using clients like Geth in --syncmode light or Ganache for Ethereum-based chains. For other ecosystems, use official tools like solana-test-validator or anvil from Foundry. Securing testnet tokens (e.g., Sepolia ETH, Solana devnet SOL) from a faucet will be necessary for deploying test contracts and generating transaction data to index.

You will need a database to store and query the indexed data efficiently. PostgreSQL is the industry standard for relational data, often paired with the pgvector extension for embeddings. For time-series data or high-write throughput, consider TimescaleDB (a PostgreSQL extension) or ClickHouse. Understanding basic SQL and schema design for blockchain data—such as tables for blocks, transactions, logs, and traces—is required before writing any indexing logic.

Finally, set up a project scaffolding. Initialize a new repository with git and create a basic project structure. For Node.js, initialize with npm init and install critical libraries like ethers.js v6 or viem for Ethereum interaction, and a database driver like pg for PostgreSQL. For Go, set up your go.mod file and import packages such as go-ethereum and pgx. This prepared workspace will allow you to focus on the indexing logic in subsequent steps.

key-concepts-text

KEY CONCEPTS

Setting Up a Blockchain Indexing and Query Service

A blockchain indexing service transforms raw, sequential on-chain data into a structured, queryable database, enabling efficient access to historical and real-time blockchain information.

Blockchains like Ethereum store data as a series of blocks containing transactions and event logs. Querying this data directly from a node is slow and impractical for applications needing historical data, complex filtering, or aggregated results. An indexing service solves this by processing the chain, extracting relevant data (like token transfers, NFT mints, or DAO votes), and storing it in a format optimized for fast queries, typically in a database like PostgreSQL. This process is essential for dApps, analytics dashboards, and wallets that need to display user balances, transaction history, or protocol metrics efficiently.

The core architecture involves three main components. First, a synchronization process (or indexer) connects to a blockchain node via RPC, ingests new blocks, and decodes them using Application Binary Interfaces (ABIs). Second, a transformation layer applies custom logic to map raw data into a structured schema, such as creating a table for ERC20Transfers. Third, a query layer, often a GraphQL or REST API, exposes this structured data to end-users. Popular frameworks like The Graph abstract much of this complexity by letting developers define subgraphs that specify which data to index and how to transform it.

For a custom setup, you typically start by defining your data schema. What specific events or contract calls do you need to track? For example, to index Uniswap V3 pool activity, you would listen for Swap, Mint, and Burn events. You then write a handler function for each event type that processes the log data and saves it to your database. A critical consideration is handling chain reorganizations (reorgs), where the canonical chain changes. Your indexer must be able to detect reorgs, invalidate data from orphaned blocks, and re-index the new chain segment to maintain data accuracy.

Performance and scalability are major challenges. As blockchain data grows, a naive indexing approach can fall behind the chain tip. Strategies to mitigate this include using block subscription (websockets) instead of polling, implementing batch processing of blocks, and optimizing database queries with proper indexing. For production services, you often need a high-availability RPC endpoint (like from Alchemy or Infura) and may deploy the indexer across multiple instances with coordinated checkpointing to avoid duplicate work.

Finally, the query service must be secure and efficient. Use query timeouts, rate limiting, and authentication to protect your API. For complex analytical queries, consider using a columnar data warehouse like ClickHouse alongside your operational database. The end goal is to provide developers with a reliable, fast, and accurate data layer that abstracts away the complexities of direct blockchain interaction, enabling them to build responsive applications focused on user experience rather than data plumbing.

ARCHITECTURE

Indexing Approach Comparison

A comparison of core methodologies for building a blockchain indexing and query service.

Feature	Full Node + Custom Scripts	Subsquid	The Graph
Development Speed	Slow (weeks-months)	Fast (days-weeks)	Fast (days-weeks)
Data Freshness	< 1 sec	< 10 sec	~1 min (indexer sync)
Query Language	Custom API / SQL	GraphQL	GraphQL
Historical Data
Real-time Updates
Multi-chain Support	Manual integration
Decentralized Network
Hosting Responsibility	Self-managed	Self-managed / Cloud	Indexer-managed
Typical Cost (per month)	$200-500+ (infra)	$50-200 (cloud)	Query fees (GRT)

CHOOSE YOUR PATH

Implementation Tutorials

Understanding the Stack

A blockchain indexing service transforms raw on-chain data into a queryable format. The core components are:

Indexer Node: A service that connects to a blockchain node (e.g., an Ethereum Geth client) to ingest new blocks and transactions.
Data Processor: Logic that extracts, transforms, and normalizes data from raw logs and transaction receipts based on smart contract ABIs.
Database: A structured store (commonly PostgreSQL or TimescaleDB) where processed data is written for fast querying.
Query API: A GraphQL or REST endpoint (like The Graph's subgraphs or a custom API) that serves the indexed data to applications.

This architecture decouples data ingestion from querying, enabling complex queries that are impossible directly from an RPC node.

schema-design

FOUNDATION

Designing Your Data Schema

A well-designed data schema is the blueprint for your indexing service, dictating query performance, developer experience, and long-term maintainability.

The first step in building a blockchain indexing service is defining your data schema. This schema acts as a contract between your data ingestion pipeline and your GraphQL or REST API, determining what data is available and how it can be queried. Unlike a traditional database schema, a blockchain indexer's schema must model on-chain entities like Block, Transaction, Event, and TokenTransfer, as well as derived entities like Pool, Position, or User. The design choices you make here—such as normalization level, field types, and relationships—directly impact query speed and complexity.

A common approach is to start with a normalized schema, separating data into distinct, related tables (e.g., a transactions table and a separate events table linked by a foreign key). This reduces data duplication and is efficient for storage. However, for blockchain data, a denormalized or hybrid approach is often better for read performance. For example, you might embed frequently accessed event data directly within a transaction record to avoid expensive joins. Consider your primary query patterns: if users often request "all events for transaction X," denormalizing those events into an array field can provide sub-millisecond response times.

When defining fields, use specific, scalar GraphQL types or database column types that match the on-chain data. For an Event entity, include fields like id: ID!, address: String!, topics: [String!]!, and data: String!. For derived data, such as a Uniswap V3 Position, calculate and store computed values like liquidity, feesEarned, or currentValueUSD in the schema to avoid on-the-fly calculations during queries. Tools like The Graph's GraphQL Schema Definition Language provide a standardized way to define these entities and their relationships.

Finally, plan for schema evolution. Blockchains are upgraded, and new token standards emerge. Your schema should be extensible to accommodate new event types or entity relationships without breaking existing queries. Use techniques like adding optional fields, implementing interface types (e.g., a Token interface for ERC-20 and ERC-721), or using a meta-schema table to track custom event decodings. A robust, thoughtfully designed schema reduces technical debt and ensures your indexing service can scale with the ecosystem it serves.

query-optimization

BLOCKCHAIN INDEXING

Optimizing Query Performance

A guide to configuring and scaling your blockchain indexing service for low-latency, high-throughput queries.

A high-performance blockchain indexing service is built on a data pipeline that efficiently ingests, transforms, and serves data. The core components are the indexer, which processes raw blockchain data, and the query engine, which serves the indexed data via an API like GraphQL. Performance bottlenecks typically occur at the data ingestion layer, during complex joins in the database, or in the API response serialization. The goal is to minimize the time between a new block being mined and it being queryable, known as indexing latency, while ensuring query latency for end-users remains under 100ms for common requests.

Database selection and schema design are foundational. For complex, relational on-chain data, a traditional SQL database like PostgreSQL is often preferred for its powerful JOIN operations and ACID compliance. Use indexes strategically on foreign keys and frequently filtered columns (e.g., block_number, sender_address, timestamp). For time-series data like token transfers or gas prices, consider specialized databases like TimescaleDB. Avoid overly normalized schemas that require excessive joins for common queries; a partially denormalized schema can dramatically speed up read operations at the cost of some storage.

Optimize the indexing process itself. Process blocks in batches rather than individually to reduce database write overhead. Implement real-time listeners for new blocks while running a separate historical sync to backfill data. Use a message queue (e.g., RabbitMQ, Apache Kafka) to decouple block ingestion from data processing, allowing you to scale workers horizontally. For EVM chains, leverage eth_getLogs filters with block ranges to fetch event logs efficiently, and parse transaction receipts only for transactions relevant to your tracked contracts.

At the query layer, use a GraphQL gateway like Apollo Server or Hasura, which can automatically optimize queries, cache responses, and merge data from multiple sources. Implement query complexity limits and depth limiting to prevent overly expensive requests from degrading service for all users. Use DataLoader, a batching and caching utility, to solve the N+1 query problem common in GraphQL, where a request for a list of items triggers separate database calls for each item's related data.

Caching is critical at multiple levels. Use an in-memory cache like Redis for frequently accessed data, such as the latest block number or metadata for popular tokens. Implement database connection pooling to avoid the overhead of establishing new connections for each query. For publicly accessible APIs, consider a CDN for caching static or semi-static query results. Monitor performance with metrics for p95 query latency, indexing lag, and database connection pool usage using tools like Prometheus and Grafana.

Finally, load test your service using tools like k6 or Locust to simulate real-world traffic. Test under load with complex queries joining multiple entities, such as "all swaps for a given token pair in the last 24 hours, with USD volume." Use the results to identify slow queries, which you can then optimize by adding composite database indexes or creating materialized views that pre-compute expensive aggregations. Regularly re-index and vacuum your database to maintain performance as the dataset grows into the terabyte range.

resource-links

DEVELOPER GUIDES

Resources and Tools

Tools and frameworks for setting up a blockchain indexing and query service. These resources cover hosted APIs, decentralized indexers, and custom data pipelines used in production Web3 applications.

The Graph: Decentralized Indexing with Subgraphs

The Graph is a decentralized indexing protocol used by many production dApps to query blockchain data using GraphQL. Developers define subgraphs that specify which smart contract events, calls, and entities to index.

Key components and workflow:

Subgraph manifest (subgraph.yaml) defines data sources, ABI files, and event handlers
AssemblyScript mappings transform on-chain events into structured entities
Graph Node indexes data and serves queries
Queries use GraphQL, making it easy to integrate with frontend frameworks

Supported networks include Ethereum, Arbitrum, Optimism, Polygon, Base, and several L2s. You can deploy subgraphs to:

The Graph Hosted Service for fast setup
The Graph Network for decentralized indexing with query fees

Best suited for:

dApps that need deterministic historical data
Teams comfortable with GraphQL and schema-driven design
Applications that want to avoid running custom ETL infrastructure

EXPLORE

SubQuery: Multi-Chain Indexing Framework

SubQuery is an open-source blockchain indexing framework designed for multi-chain support, including Ethereum, Polkadot, Cosmos SDK chains, and several L2s. It is often used when teams want more control over indexing logic without building everything from scratch.

How SubQuery works:

Define a project manifest that specifies data sources and handlers
Write TypeScript mappings to process events, extrinsics, or calls
Store indexed data in PostgreSQL using an auto-generated schema
Query indexed data via GraphQL

Deployment options:

Self-hosted using Docker and PostgreSQL
SubQuery Managed Service for hosted indexing

Strengths:

Strong support for non-EVM chains alongside EVM
Familiar TypeScript development model
Easier customization compared to fully hosted APIs

This is a good fit for teams that need cross-chain indexing or want to run their own infrastructure with predictable costs.

EXPLORE

Covalent: Unified Blockchain Data API

Covalent provides a hosted REST and GraphQL API that aggregates indexed blockchain data across dozens of EVM-compatible networks. Instead of defining custom indexing logic, developers query pre-indexed datasets.

What Covalent offers:

Token balances, historical transfers, and NFT metadata
Decoded transaction data with contract-level details
Unified schemas across chains like Ethereum, BNB Chain, Polygon, and Avalanche

Typical use cases:

Dashboards and analytics tools
Portfolio trackers and accounting systems
Rapid prototyping without running indexers

Implementation details:

Query data using REST or GraphQL endpoints
Authenticate using an API key
Responses are normalized JSON, reducing client-side parsing

Trade-offs:

Less flexibility than custom indexers
Dependent on Covalent’s indexing coverage and update cadence

Covalent is best for teams prioritizing speed of integration over full control of the indexing pipeline.

EXPLORE

Alchemy and QuickNode: Node + Enhanced Data APIs

Alchemy and QuickNode provide high-availability blockchain nodes combined with enhanced indexing and query APIs. While primarily known as RPC providers, both platforms offer pre-indexed datasets that reduce the need for custom indexers.

Common features:

Enhanced APIs for token balances, NFTs, logs, and transfers
Webhooks for real-time event notifications
Archive node access for historical queries

How developers use them:

Replace self-hosted nodes with managed infrastructure
Use enhanced endpoints instead of raw RPC calls
Combine with lightweight off-chain databases for caching

Examples:

Alchemy NFT API for metadata and ownership history
QuickNode Streams for event-driven pipelines

These platforms are well-suited for:

Teams that need reliable RPC access plus indexed data
Applications that mix real-time data with historical lookups
Smaller teams avoiding the overhead of running Graph Node or SubQuery

EXPLORE

Custom Indexing Stack: ETL + Database

A custom indexing stack gives full control over data models, performance, and cost, at the expense of higher operational complexity. This approach is common for analytics-heavy or compliance-focused applications.

Typical architecture:

RPC provider (self-hosted or managed) for raw blockchain access
ETL jobs written in Python, TypeScript, or Rust to process blocks
PostgreSQL, ClickHouse, or BigQuery for storage
Optional message queues (Kafka, SQS) for scalability

Key design decisions:

Event-based vs block-based indexing
Handling chain reorganizations and finality
Backfilling historical data efficiently

Advantages:

Complete flexibility over schemas and queries
Easier integration with traditional data warehouses

Disadvantages:

Requires ongoing maintenance
More engineering time upfront

This model is best for teams with data engineering expertise and long-term indexing requirements.

BLOCKCHAIN INDEXING

Frequently Asked Questions

Common questions and troubleshooting for developers setting up and operating blockchain indexing and query services.

A blockchain indexer is a specialized service that processes raw blockchain data into a structured, queryable format. It listens for new blocks and transactions, extracts relevant data (like token transfers or event logs), and stores it in a database optimized for fast reads (e.g., PostgreSQL).

An RPC node (like Geth or Erigon) provides direct, low-level access to the blockchain state. It can fetch raw data on-demand but is not optimized for complex historical queries or aggregations. The key difference is purpose: an RPC node serves the current state and simple lookups, while an indexer transforms historical data for analytical queries. For example, answering "Show all NFT mints for this collection in the last month" is efficient for an indexer but extremely slow for a standard RPC node.

conclusion

NEXT STEPS

Conclusion and Next Steps

You have successfully set up a foundational blockchain indexing and query service. This guide covered the core components: ingesting data, processing it, and exposing it via an API.

Your indexing pipeline is now operational, but this is just the beginning. To move from a proof-of-concept to a production-ready service, you must focus on robustness and scalability. Implement comprehensive error handling for RPC node disconnections and chain reorganizations. Add monitoring with tools like Prometheus and Grafana to track ingestion lag, database performance, and API latency. For high-throughput chains, consider sharding your database or using a columnar data store like ClickHouse for analytical queries.

The next logical step is to expand your data model. Start by indexing more complex on-chain entities. For example, instead of just tracking token transfers, build tables for DeFi positions (like Uniswap V3 liquidity), NFT attributes, or governance proposal states. Use The Graph's subgraph manifest as a reference for entity definitions. You can also integrate off-chain data, such as price feeds from Chainlink or Pyth oracles, to enrich your on-chain context, enabling more powerful analytics.

Finally, optimize your query service for real-world use. Implement query batching and caching strategies using Redis to reduce database load. Secure your GraphQL endpoint with rate limiting and authentication, especially if you plan to offer a public API. To stay current, set up a process to handle protocol upgrades and hard forks automatically. Continuously test your indexer against alternative services like Covalent, Alchemy's Enhanced APIs, or The Graph to validate data accuracy and completeness.