How to Build a Blockchain Indexer and Data Pipeline

introduction

GUIDE

Setting Up a Blockchain Indexer and Data Pipeline

A practical guide to building a system that transforms raw blockchain data into structured, queryable information for applications.

A blockchain indexer is a specialized service that listens to a blockchain network, processes its raw data—blocks, transactions, logs—and organizes it into a structured database. This process, known as indexing, is essential because querying data directly from a node via RPC calls is slow and inefficient for complex queries. An indexer acts as a middle layer, enabling fast, complex queries like "show all NFT transfers for this wallet" or "calculate total DEX volume in the last 24 hours." Popular indexing solutions include The Graph for decentralized subgraphs, Covalent for unified APIs, and custom-built solutions using tools like Subsquid or TrueBlocks.

The core architecture of a data pipeline involves several key components. First, an ingestion layer connects to blockchain nodes via RPC (e.g., using ethers.js or web3.py) to fetch new blocks. This data is then passed to a processing layer, where smart contract events are decoded using Application Binary Interfaces (ABIs) and business logic is applied to transform the data. Finally, the processed data is stored in a persistence layer, typically a relational database like PostgreSQL or a time-series database. This separation of concerns ensures scalability and maintainability.

To build a basic indexer for Ethereum, you can start with a script using the ethers library. The following example listens for new blocks and extracts transaction details:

javascript
const { ethers } = require('ethers');
const provider = new ethers.providers.JsonRpcProvider('YOUR_RPC_URL');

provider.on('block', async (blockNumber) => {
  const block = await provider.getBlockWithTransactions(blockNumber);
  block.transactions.forEach(tx => {
    console.log(`Tx Hash: ${tx.hash}, From: ${tx.from}, Value: ${ethers.utils.formatEther(tx.value)} ETH`);
    // Process and save to your database here
  });
});

This simple pipeline can be extended to decode event logs by providing the contract ABI.

Handling chain reorganizations (reorgs) and ensuring data integrity are critical challenges. A reorg occurs when the canonical chain changes, invalidating previously processed blocks. A robust indexer must track block finality—often waiting for a certain number of confirmations—and have a mechanism to roll back and re-process data from a reorged block. Using a database transaction to store indexed data allows for atomic updates and rollbacks. For production systems, consider using a dedicated indexing framework that handles these complexities, such as Subsquid, which provides a processor abstraction with built-in state management and reorg handling.

Optimizing your data pipeline is necessary for performance. Key strategies include batch processing to reduce database write operations, implementing data pruning policies for old or irrelevant data, and using materialized views for expensive aggregate queries. For high-throughput chains, you may need to shard your database or use a horizontally scalable store. The end goal is to provide a reliable API or data endpoint that serves your application's needs with low latency, turning the chaotic stream of blockchain data into a powerful asset for analytics, dashboards, and smart contract interactions.

prerequisites

TECHNICAL FOUNDATION

Prerequisites and System Requirements

A robust blockchain indexer requires a specific technical stack. This guide details the essential hardware, software, and knowledge needed before you start building.

Building a blockchain indexer is a resource-intensive process. The core requirement is a full node for the target blockchain, such as an Ethereum execution client (Geth, Nethermind) or a Bitcoin Core node. This node must be fully synced, which can require significant storage—often 1-2 TB for mainnets. You'll need a machine with at least 16 GB of RAM, a multi-core CPU, and fast SSD storage to handle the I/O load of continuous block processing and RPC queries. For production systems, consider using cloud instances with scalable compute and block storage.

Your software environment must support high-throughput data processing. A modern Linux distribution (Ubuntu 22.04 LTS, Debian 11) is standard. You'll need docker and docker-compose for containerized services, git for version control, and a package manager like apt. The core of your pipeline will be written in a language like Go, TypeScript (with Node.js), or Rust, chosen for performance and existing blockchain libraries. Essential tools include a database client (e.g., psql for PostgreSQL) and a process manager like pm2 or systemd for running services.

Key conceptual knowledge is required to design an effective pipeline. You must understand blockchain data structures: blocks, transactions, logs, and receipts. For EVM chains, familiarity with event logs and ABI encoding is critical. You should be comfortable with database design for time-series and relational data (PostgreSQL, TimescaleDB), message queues (Apache Kafka, RabbitMQ) for decoupling ingestion, and basic DevOps for monitoring and deployment. Experience with the target chain's RPC methods (eth_getBlockByNumber, eth_getLogs) is non-negotiable.

Set up your development environment by first installing the core dependencies. Clone a reference indexer repository, such as the TrueBlocks core or a Subsquid template, to examine a working structure. Configure your .env file with database connection strings and RPC endpoints. Use a testnet RPC URL (e.g., from Alchemy or Infura) for initial development to avoid syncing a full node immediately. Run docker-compose up to spin up your database and any auxiliary services, ensuring all containers communicate on a shared network.

Before writing any indexing logic, validate your RPC connection. Use curl or a library like ethers.js to call eth_blockNumber. Test database connectivity by creating a simple schema with tables for blocks and transactions. A minimal proof-of-concept should ingest the last 100 blocks, parse transactions, and store them. This test confirms your stack is operational and helps you gauge performance. Document any bottlenecks in data fetching or writing at this early stage.

For a production-ready pipeline, plan for resilience and observability. Implement retry logic with exponential backoff for RPC calls. Use a dedicated logging service (Loki, ELK stack) and metrics collection (Prometheus, Grafana) to monitor block processing lag, database health, and error rates. Establish an alerting system for chain reorgs or RPC failures. Your requirements document should specify backup strategies, disaster recovery procedures, and a plan for handling chain-specific edge cases, like Ethereum's uncle blocks or Solana's forks.

ARCHITECTURE

Indexing Solution Comparison

Key technical and operational differences between popular blockchain indexing approaches.

Feature	The Graph (Subgraphs)	Covalent	Self-Hosted (e.g., Subsquid)
Data Freshness	< 1 sec	~15 sec	User-defined
Query Cost Model	GRT-based query fees	Pay-as-you-go API credits	Infrastructure & dev time
Data Schema Control	Subgraph manifest	Unified API schema	Full control
Multi-Chain Support	40+ chains via Substreams	200+ chains	Chain-specific setup required
Historical Data Access	From deployment block	Full history	From synced block
Decentralization	Decentralized network	Centralized service	Centralized by user
Development Overhead	Medium (learn subgraphs)	Low (use API)	High (build pipeline)
Smart Contract Event Support
Raw Transaction Decoding

architecture-overview

DATA INFRASTRUCTURE

Pipeline Architecture Overview

A robust data pipeline is the backbone of any blockchain indexer, transforming raw on-chain data into structured, queryable information. This overview details the core components and data flow.

A blockchain indexer pipeline is a multi-stage system designed to ingest, process, and serve decentralized ledger data. It begins with a data ingestion layer that connects to blockchain nodes via RPC (Remote Procedure Call) to capture raw block data, logs, and transaction receipts. This layer must be resilient to node failures and handle chain reorganizations. For Ethereum-based chains, tools like ethers.js, web3.py, or direct geth/erigon RPC calls are commonly used. The primary challenge here is maintaining synchronization with the chain tip while managing the volume of data, which can exceed several gigabytes per day for active networks.

Once ingested, raw data moves to the transformation and enrichment layer. This is where the core indexing logic resides. Smart contract events are decoded using Application Binary Interfaces (ABIs), transactions are parsed for specific function calls, and related data is joined. For example, a Uniswap swap event is decoded, and the involved token amounts and pool addresses are extracted and formatted. This stage often employs a message queue (like Apache Kafka or RabbitMQ) or a streaming framework (like Apache Flink) to handle data flow, allowing for parallel processing and fault tolerance. Data is typically written to an intermediate storage, such as PostgreSQL or a time-series database.

The final stage is the serving and query layer, which provides an interface for applications to access the indexed data. This often involves populating a optimized database like PostgreSQL with well-defined schemas or a GraphQL endpoint (using Hasura or The Graph) for flexible queries. Performance is critical; common optimizations include database indexing on frequently queried fields (like block_number, address), materialized views for complex aggregations, and caching layers (Redis) for popular queries. The architecture must support low-latency reads for dApp frontends and batch analytics for researchers, making the separation of read and write workloads a key design consideration.

step1-ingestion

DATA PIPELINE FOUNDATION

Step 1: Ingesting Data from an RPC Node

The first step in building a blockchain indexer is establishing a reliable connection to the source of truth: a blockchain node. This guide covers how to connect to an RPC endpoint, fetch raw block data, and handle the initial parsing.

A blockchain indexer's primary function is to transform raw, sequential block data into a structured, queryable database. This process begins by connecting to a Remote Procedure Call (RPC) node. You can use a public provider like Alchemy, Infura, or QuickNode, or run your own node software such as Geth for Ethereum or Erigon for higher performance. The RPC interface, typically using JSON-RPC over HTTP or WebSocket, provides methods like eth_getBlockByNumber and eth_getLogs to retrieve block headers, transactions, and event logs.

For robust data ingestion, you must implement a block polling or subscription mechanism. A simple approach uses a loop to call eth_blockNumber to get the latest block, then processes blocks sequentially from your last indexed height. For real-time indexing, a WebSocket subscription to newHeads events is more efficient. Your ingestion service must handle reorgs (chain reorganizations) by verifying the parent hash of each new block and having logic to roll back and re-index if a deeper reorg is detected. Error handling for rate limits and node disconnections is critical.

The raw JSON-RPC response for a block contains hex-encoded data. Your initial parsing layer should decode this into native types (integers, addresses, byte arrays). For an Ethereum block, key fields to extract include: number, hash, parentHash, timestamp, the list of transactions, and logsBloom. Each transaction object contains from, to, input data, value, and gas fields. This parsed data forms the base layer for all subsequent transformation and enrichment steps in your indexing pipeline.

step2-processing

BUILDING THE PIPELINE

Step 2: Processing and Transforming Data

After ingesting raw blockchain data, you must process and transform it into a structured, queryable format. This step defines your data model and business logic.

The core of your indexer is the transformation logic that converts raw logs and transactions into application-specific data. This is where you define your data schema and implement the business logic that populates it. For Ethereum-based chains, you'll write handlers for specific event signatures and function calls. A common pattern is to use a state machine or a set of reducer functions that process blocks sequentially, updating an internal state representation (like user balances or pool reserves) with each new transaction.

Efficient data modeling is critical. You must decide between a normalized schema (separate tables for entities and relationships) and a denormalized schema (fewer joins, faster reads). For example, indexing Uniswap V3 requires tracking Pool, Swap, Mint, and Burn entities. Your transformation code parses the Swap event, calculates derived fields like price impact, and writes records to corresponding database tables. Using an idempotent design ensures re-processing blocks from a fork doesn't corrupt your dataset.

Implement robust error handling and data validation. Your pipeline should log and quarantine malformed data instead of failing silently. Use schema validation libraries like Pydantic (Python) or Zod (TypeScript) to enforce data types and constraints before insertion. For performance, batch database writes and consider using an in-memory cache (like Redis) for frequently accessed state, such as the latest block number processed or current token prices, to avoid redundant on-chain calls.

Here's a simplified Python example using the web3.py library to process a Uniswap V3 Swap event:

python
from web3 import Web3
w3 = Web3(Web3.HTTPProvider(RPC_URL))

# ABI for the Swap event
swap_abi = '[{"anonymous":false,"inputs":[{"indexed":true,"name":"sender","type":"address"},{"indexed":true,"name":"recipient","type":"address"},{"indexed":false,"name":"amount0","type":"int256"},{"indexed":false,"name":"amount1","type":"int256"},{"indexed":false,"name":"sqrtPriceX96","type":"uint160"},{"indexed":false,"name":"liquidity","type":"uint128"},{"indexed":false,"name":"tick","type":"int24"}],"name":"Swap","type":"event"}]'

event_contract = w3.eth.contract(address=pool_address, abi=swap_abi)

# Process a block
def process_block(block_number):
    block = w3.eth.get_block(block_number, full_transactions=True)
    for tx in block.transactions:
        receipt = w3.eth.get_transaction_receipt(tx.hash)
        for log in receipt.logs:
            try:
                event = event_contract.events.Swap().process_log(log)
                # Transform and store event data
                store_swap_event(event['args'])
            except:
                log_error(log)

After transformation, you have structured data ready for querying. The next step is designing the serving layer—the APIs or databases that expose this data to your application. Common choices include PostgreSQL for relational data, TimescaleDB for time-series metrics, or GraphQL interfaces for flexible client queries. The quality of your processed data directly determines the reliability and performance of your entire application.

step3-storage

DATA PERSISTENCE

Step 3: Storing Data in PostgreSQL

This section details how to structure your database and write the code to persistently store processed blockchain data, moving it from memory to a durable, queryable system.

After extracting and transforming blockchain data, you need a reliable storage layer. PostgreSQL is a robust, open-source relational database ideal for this task due to its ACID compliance, JSON support, and powerful querying capabilities. For an indexer, you'll typically create tables that mirror on-chain entities. Common tables include blocks (storing block number, hash, and timestamp), transactions (with hash, from/to addresses, value, and gas data), logs (for smart contract events), and token_transfers. This structured schema allows for complex joins and efficient analysis that raw JSON storage cannot easily provide.

Define your schema using SQL migrations for version control and reproducibility. A basic blocks table might look like:

sql
CREATE TABLE blocks (
  id SERIAL PRIMARY KEY,
  number BIGINT UNIQUE NOT NULL,
  hash VARCHAR(66) UNIQUE NOT NULL,
  timestamp TIMESTAMP NOT NULL,
  parent_hash VARCHAR(66)
);

Use tools like node-postgres for Node.js or an ORM like Prisma or Drizzle to manage connections and queries. Your indexer's processing loop will now include a final step: taking the normalized data object for each block and performing an INSERT or UPSERT operation into the corresponding tables, ensuring data integrity with transactions.

Optimize for performance and reliability. Use bulk inserts to reduce database round-trips when processing multiple blocks or events. Implement connection pooling to handle concurrent queries efficiently. Add database indexes on frequently queried columns like block_number, transaction_hash, and address to speed up read operations, which are critical for your API layer. Always wrap related inserts (e.g., a block and all its transactions) in a single database transaction to maintain consistency—if one part fails, the entire block's data can be rolled back, preventing partial saves.

step4-querying

DATA LAYER

Step 4: Building a Query API

Transform your indexed blockchain data into a structured, accessible API for applications to consume.

With raw data stored in your database, the next step is to expose it through a dedicated Query API. This layer sits between your application and the database, providing a secure, performant, and developer-friendly interface. A well-designed API abstracts the complexity of the underlying data schema, allowing frontend dApps, analytics dashboards, and other services to request specific datasets—like a user's NFT holdings, recent DEX trades, or protocol TVL—without writing SQL. Common patterns include GraphQL for flexible queries or a REST API for simpler endpoints. The choice often depends on your data's relational complexity and client needs.

GraphQL is particularly powerful for blockchain data due to its ability to fetch nested, related data in a single request. For example, a query for a Wallet could simultaneously request its transactions, and within each transaction, the tokenTransfers. This efficiency prevents the "over-fetching" common in REST. Setting up a GraphQL server with a library like Apollo Server or GraphQL Yoga involves defining a schema that maps your PostgreSQL tables to GraphQL types and resolvers. Resolvers are functions that contain the logic to fetch the requested data from your database, often using an ORM like Prisma or TypeORM to build the queries safely.

Implementing the API requires careful attention to performance and security. Key optimizations include:

Query Complexity Limits: Prevent overly complex GraphQL queries that could overload the database.
Pagination: Use cursor-based or offset pagination for large datasets (e.g., listing all transactions).
Caching: Implement a caching layer (e.g., Redis) for frequently accessed, immutable data like historical token prices or past blocks.
Rate Limiting: Protect your API from abuse by limiting requests per API key or IP address. Security is paramount; always sanitize inputs and use parameterized queries to prevent SQL injection, even when using an ORM.

A practical implementation involves creating specific endpoints or GraphQL queries for common application needs. For a DeFi dashboard, you might build a /protocols/{id}/metrics endpoint that returns Total Value Locked (TVL), user counts, and fee revenue over time by aggregating your deposit_events and withdrawal_events tables. For an NFT marketplace, a GraphQL query could fetch a user's collection, including metadata and current listings from other tables. Your API should return data in a clean, standardized JSON format, making it easy for developers to integrate.

Finally, document your API thoroughly. Use tools like Swagger/OpenAPI for REST or GraphQL's built-in introspection to auto-generate documentation. Include examples of requests and responses, authentication methods (like using an API key header), and error codes. A clear README or hosted docs page (using Postman or GraphQL Playground) drastically improves adoption. This completes the data pipeline: from raw blockchain events to a reliable, queryable service that powers user-facing applications with real-time, historical on-chain intelligence.

resource-links

DEVELOPER SETUP

Essential Resources and Tools

These resources cover the core components required to build a production-grade blockchain indexer and data pipeline, from raw node access to transformation, storage, and query layers.

Running an Archive Node

An archive node provides full historical blockchain state and is the foundation of any custom indexer. Unlike pruned or full nodes, archive nodes allow querying contract state at any block height.

Key setup considerations:

Client choice: Geth (Ethereum), Erigon (Ethereum), Nethermind, or Besu. Erigon reduces disk usage but has higher operational complexity.
Storage requirements: Ethereum archive nodes require ~12–15 TB of fast NVMe storage as of 2025.
RPC configuration: Enable eth_getLogs, trace_*, and historical state queries depending on your indexing needs.
Reliability: Use snapshot sync only for initial bootstrap, then switch to full verification.

Archive nodes are essential if you need deterministic re-indexing, historical balance tracking, or custom trace-based analytics.

EXPLORE

The Graph and Subgraphs

The Graph provides a hosted and decentralized indexing layer using GraphQL. Developers define a subgraph that maps smart contract events and calls into structured entities.

Why use The Graph:

Declarative indexing using AssemblyScript mappings
Built-in handling of reorgs and deterministic data
Query data via GraphQL without managing infrastructure

Limitations to understand:

Not suitable for complex cross-contract state joins
Historical backfills depend on subgraph start block
Advanced analytics often require exporting data downstream

The Graph is ideal for application-level indexing, dashboards, and APIs where time-to-market matters more than full flexibility.

EXPLORE

Streaming Indexers with Substreams

Substreams, developed by StreamingFast, enable high-performance blockchain indexing using Rust-based, deterministic data modules. Data is processed as a stream rather than block-by-block scripts.

Core advantages:

Parallelized execution with 10x–100x faster backfills
Strong typing and deterministic outputs
Native support for Ethereum, Solana, and other chains

Typical pipeline:

Ingest blocks from Firehose
Transform data into Protobuf outputs
Sink results into PostgreSQL, ClickHouse, or Kafka

Substreams are well-suited for analytics platforms, research pipelines, and teams that need reproducible datasets at scale.

EXPLORE

Data Storage and Analytics Layer

After indexing, data must be stored in systems optimized for analytical queries rather than RPC-style access.

Common storage choices:

PostgreSQL: relational consistency, easier joins, smaller datasets
ClickHouse: columnar storage for large-scale event and trace analytics
BigQuery: serverless analytics with SQL for multi-terabyte datasets

Best practices:

Partition tables by block number or timestamp
Store raw events separately from derived metrics
Use idempotent writes to handle reorgs

A well-designed storage layer enables fast dashboards, anomaly detection, and downstream machine learning workloads.

EXPLORE

Orchestration and Data Pipelines

Production indexers require orchestration to manage backfills, retries, and data quality checks.

Common tooling:

Apache Kafka for event streaming between indexers and consumers
Airflow or Dagster for scheduled backfills and transformations
dbt for SQL-based transformations and data modeling

Operational concerns:

Handle chain reorgs with rollback logic
Monitor lag between head block and indexed block
Version schemas to avoid breaking downstream consumers

A robust pipeline ensures your indexed data remains accurate, reproducible, and consumable across teams.

EXPLORE

BLOCKCHAIN INDEXING

Frequently Asked Questions

Common technical questions and solutions for developers building and maintaining blockchain indexers and data pipelines.

A blockchain indexer is a specialized service that processes raw blockchain data into a structured, queryable format. It works by subscribing to new blocks, extracting and decoding data from transactions and smart contract logs, and storing it in a database optimized for fast queries.

Core components include:

Ingestion Layer: Connects to node RPC endpoints (e.g., Erigon, Geth) to fetch blocks.
Processing Engine: Decodes ABI-encoded data, handles event logs, and normalizes data.
Storage Database: Typically a time-series (PostgreSQL/TimescaleDB) or columnar database for efficient analytics.
API Layer: Serves the processed data via GraphQL or REST.

Unlike a standard node, an indexer transforms on-chain state into relational models, enabling complex queries like "show me all NFT transfers for this wallet in the last month" that would be inefficient or impossible via direct RPC calls.