Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
LABS
Guides

How to Build a Custom Blockchain Indexer from Scratch

A technical guide for developers on constructing a custom indexer to query smart contract data efficiently. Covers RPC connection, event parsing, fork handling, and database optimization.
Chainscore © 2026
introduction
DEVELOPER GUIDE

How to Build a Custom Blockchain Indexer from Scratch

A practical guide to designing, implementing, and deploying a custom indexer to query on-chain data efficiently, bypassing the limitations of general-purpose solutions.

A blockchain indexer is a specialized service that transforms raw, sequential blockchain data into a structured, queryable database. While nodes provide direct access to blocks and transactions, they are inefficient for complex queries like "show all NFT transfers for this wallet" or "calculate total DEX volume last week." General indexers like The Graph offer a hosted solution, but building your own provides full control over data schema, real-time processing logic, and cost optimization for specific application needs. This guide walks through the core architecture and implementation steps.

The indexing pipeline follows an ETL (Extract, Transform, Load) pattern. First, you extract data by connecting to a node RPC endpoint (e.g., using Ethers.js or Viem) and listening for new blocks. For each block, you fetch transactions, receipts, and logs. The critical step is filtering; you should only decode events and calls relevant to your target contracts to avoid unnecessary computation. For example, if indexing Uniswap V3, you would filter for logs from the factory and pool addresses.

Next, you transform the raw data into your application's domain model. This involves parsing complex event logs and transaction inputs using the contract ABI. For a DeFi indexer, you might transform a Swap event into a structured record with fields for amountIn, amountOut, sender, and poolAddress. This is also where you handle data enrichment, such as fetching token decimals or calculating derived USD values using price oracles.

Finally, you load the transformed data into a persistent database. The choice of database is crucial for query performance. PostgreSQL is a robust, relational choice for complex joins, while TimescaleDB (a PostgreSQL extension) excels with time-series metrics. For massive-scale, append-only data, Apache Cassandra or ScyllaDB offer high write throughput. Schema design should mirror your query patterns; create indexes on frequently filtered columns like user_address, block_timestamp, or token_id.

To ensure reliability, your indexer must handle chain reorganizations (reorgs). A simple approach is to index blocks with a confirmation delay (e.g., 10 blocks). When a reorg is detected via the RPC, you must invalidate and re-index data from the fork point. Implement robust error handling and idempotent operations so reprocessing the same block does not create duplicate records. For production, run the indexer as a background service, monitored with logging and alerting for stalled syncs.

Deploying the indexer involves setting up the service, database, and connection pools. Use a process manager like PM2 or containerize with Docker. For Ethereum mainnet, consider using a dedicated node provider (Alchemy, Infura) with archive access. The final step is exposing the data via an API, typically a GraphQL or REST server that translates application queries into efficient database calls, completing your custom data infrastructure.

prerequisites
FOUNDATION

Prerequisites and Setup

Before writing a single line of code for your blockchain indexer, you need the right tools and a clear architectural plan. This section covers the essential software, libraries, and initial project structure required to build a robust, scalable indexer from the ground up.

The core of a custom indexer is a reliable connection to a blockchain node. You will need access to a JSON-RPC endpoint for the network you intend to index, such as Ethereum, Polygon, or Arbitrum. For development, you can run a local node (e.g., Geth, Erigon) or use a service like Alchemy, Infura, or a QuickNode. A local node offers the most control and data access but requires significant storage and synchronization time. For most projects, starting with a managed RPC provider is the fastest path to development.

Your primary programming environment will be Node.js (version 18 or later) or Python (3.9+). For this guide, we'll use Node.js due to its extensive Web3 ecosystem. You'll also need a package manager (npm or yarn) and a code editor like VS Code. The essential libraries include a Web3 client (ethers.js v6 or web3.js v4) for interacting with the blockchain, a database driver (we'll use pg for PostgreSQL), and a task runner or scheduler like Bull or node-cron for managing indexing jobs.

Project Structure

Start by initializing a new project and creating a logical directory structure. A typical setup includes folders for src/scripts (indexing logic), src/models (database schemas), src/utils (helper functions), and config (environment variables). Use dotenv to manage your RPC URL and database credentials securely. Your first script should test the RPC connection by fetching the latest block number using const blockNumber = await provider.getBlockNumber(). This verifies your foundational setup is working before diving into complex event listening and data processing.

key-concepts
FOUNDATIONAL KNOWLEDGE

Core Concepts for Indexer Development

Building a custom indexer requires understanding core architectural patterns, data sources, and performance tradeoffs. These concepts form the foundation for reliable, scalable blockchain data infrastructure.

04

Handling Chain Reorganizations

Blockchain reorgs, where canonical chains change, can corrupt an indexer's data. Mitigation strategies include:

  • Tracking Finality: Only process blocks beyond a finality depth (e.g., 64 blocks for Ethereum).
  • Reorg-Aware Processing: Design idempotent pipelines and maintain a block cursor table to track processed heights and hashes.
  • Rollback Logic: Implement a mechanism to invalidate and re-process data from the reorg point, often using event sourcing patterns.
05

Performance & Scaling

As chain activity grows, indexers must scale horizontally.

  • Parallel Processing: Shard indexing by block range or contract address.
  • Database Optimization: Use connection pooling, read replicas, and materialized views for common queries.
  • Memory Management: Cache frequently accessed data like contract ABIs or token metadata.
  • Metrics & Monitoring: Instrument with Prometheus to track blocks/sec, database latency, and error rates.
06

Testing & Validation

Ensuring data accuracy is non-negotiable. Implement a robust testing suite:

  • Unit Tests: Mock RPC calls to test parsing logic for specific event signatures.
  • Integration Tests: Run a local testnet (e.g., Hardhat, Anvil) and index it, comparing results against known state.
  • Data Integrity Checks: Run periodic sanity checks, like verifying that the sum of all DEX swap volumes matches an external API's reported total.
  • Differential Testing: Compare your indexer's output with a trusted reference like a block explorer's API.
architecture-overview
ARCHITECTURE GUIDE

How to Build a Custom Blockchain Indexer from Scratch

A step-by-step technical guide to designing and implementing a custom indexer for processing and querying on-chain data efficiently.

A blockchain indexer is a specialized service that transforms raw, sequential blockchain data into a structured, queryable database. Unlike a standard node that simply syncs blocks, an indexer listens for new blocks and transactions, extracts relevant data based on predefined logic—such as specific smart contract events or transaction patterns—and stores this processed data in an optimized format like PostgreSQL or Elasticsearch. This architecture is essential for applications requiring fast, complex queries, such as DeFi dashboards, NFT explorers, or on-chain analytics platforms, where direct RPC calls to a node would be prohibitively slow.

The core architecture consists of several key components. First, a block ingestion layer connects to one or more blockchain nodes via RPC (e.g., using ethers.js for Ethereum) to fetch new blocks. A data processing worker then decodes transactions and logs, applying your custom logic to filter and transform the data. This processed data is written to a persistent datastore. Finally, a query API (often a GraphQL or REST server) exposes the indexed data to your application. For resilience, you'll need robust error handling, idempotent processing to handle re-orgs, and a system to track the last indexed block.

Start by defining your data schema based on what you need to query. For an NFT marketplace indexer, you might create tables for transfers, mints, and sales. Use a lightweight framework like Subsquid or The Graph's Subgraph for a declarative approach, or build directly with a library like ethers.js or viem for full control. Your ingestion service should poll for new blocks or subscribe to events. Crucially, process blocks idempotently: always check the block hash against your database to handle chain reorganizations gracefully, rolling back and re-indexing if necessary.

For the datastore, PostgreSQL with its JSONB support is a robust choice for structured event data. Implement batching for database writes to improve performance. The query layer should use database indexes on frequently queried fields like block_number, from_address, or token_id. For production, consider separating components into microservices: a dedicated fetcher, a pool of processors, and the API. This allows you to scale each part independently. Monitoring is critical; track metrics like blocks processed per second, processing latency, and database query performance.

Testing your indexer involves both unit tests for data transformation logic and integration tests against a local testnet (like Hardhat or Anvil). Use a seed of historical blocks to verify correct data extraction. Remember, maintaining a custom indexer requires ongoing operational overhead for node reliability, database management, and schema migrations. However, the payoff is unparalleled query speed and flexibility for your specific use case, freeing your application from the limitations of generic chain APIs.

step-connect-rpc
FOUNDATION

Step 1: Connecting to an RPC Node and Subscribing to Blocks

The first step in building a custom blockchain indexer is establishing a real-time connection to the network. This requires interacting with a node's RPC endpoint to subscribe to new blocks.

A blockchain indexer's primary data source is a node running the network's client software, such as Geth for Ethereum or Erigon for Ethereum and Polygon. These nodes expose a JSON-RPC API, a stateless interface that allows external applications to query chain data and subscribe to events. To begin, you need access to a node's RPC endpoint. Options include running your own node (highest reliability), using a managed service like Alchemy or Infura, or connecting to a public endpoint (least reliable for production). The connection is typically made via WebSocket for real-time subscriptions or HTTP for one-off queries.

For real-time block processing, a WebSocket connection is essential. Unlike HTTP, WebSocket maintains a persistent, full-duplex connection, allowing the node to push new block headers to your application the moment they are finalized. The core subscription method is eth_subscribe with the parameter "newHeads". Once subscribed, your client will receive a stream of block header objects containing critical metadata: number, hash, parentHash, timestamp, and the transactions root. This stream is the heartbeat of your indexer, triggering the data extraction logic for each new block.

Implementing this requires a WebSocket client library. In Node.js, you can use the ws library. The code establishes a connection, sends the subscription request, and listens for messages. Error handling for connection drops and subscription IDs is crucial for resilience. Here is a basic implementation outline:

javascript
const WebSocket = require('ws');
const ws = new WebSocket('wss://eth-mainnet.g.alchemy.com/v2/YOUR_KEY');
ws.on('open', () => {
  ws.send(JSON.stringify({
    "id": 1,
    "method": "eth_subscribe",
    "params": ["newHeads"]
  }));
});
ws.on('message', (data) => {
  const message = JSON.parse(data);
  // Process subscription confirmation and incoming block headers
});

With the block header stream active, your indexer has a trigger but lacks the detailed data. The header alone does not contain transactions or logs. To index these, you must perform subsequent RPC calls for each new block number. The key methods are eth_getBlockByNumber (to get the full block with transaction hashes) and eth_getTransactionReceipt for each transaction hash to obtain logs. This pattern of subscribe-then-fetch is fundamental. Be mindful of RPC rate limits and implement queuing or batching to avoid being throttled by your node provider when fetching large amounts of data.

The choice of RPC provider significantly impacts performance and cost. For high-throughput chains, the speed of block propagation and the reliability of the WebSocket connection are critical. Providers often offer specialized "archive" nodes that retain full historical state, which is necessary for indexing past events. For this initial live-syncing step, a standard node suffices. The goal is to create a stable pipeline where every new block header reliably initiates your data processing workflow, forming the foundation for the extraction and transformation steps that follow.

step-parse-events
DATA EXTRACTION

Step 2: Parsing Transactions and Event Logs

After establishing a connection to an Ethereum node, the next step is to extract and decode the raw data from blocks. This involves parsing transaction receipts and their event logs, which contain the actionable state changes of smart contracts.

A block contains two primary data structures you need to parse: transactions and transaction receipts. The transaction object (eth_getTransactionByHash) holds the call data sent to a contract, like a function signature and arguments. The receipt (eth_getTransactionReceipt) is generated after execution and contains the outcome, including gas used, status (success/failure), and most importantly, the event logs. These logs are the immutable record of a contract's state changes, emitted via the LOG opcode.

Event logs are stored as topics and data. The first topic is always the keccak256 hash of the event signature (e.g., Transfer(address,address,uint256)). Subsequent topics are for indexed event parameters. Non-indexed parameters are stored in the data field, which requires ABI decoding. To parse this, you need the contract's Application Binary Interface (ABI). For a standard ERC-20 Transfer event, you would decode the log topics to get the from and to addresses (indexed) and the data field to get the value.

Here is a simplified Python example using Web3.py to fetch and decode logs for a known contract:

python
from web3 import Web3
w3 = Web3(Web3.HTTPProvider('YOUR_RPC_URL'))
# ERC-20 Transfer event ABI fragment
transfer_abi = {'anonymous': False, 'inputs': [{'indexed': True, 'name': 'from', 'type': 'address'}, {'indexed': True, 'name': 'to', 'type': 'address'}, {'indexed': False, 'name': 'value', 'type': 'uint256'}], 'name': 'Transfer', 'type': 'event'}
event_abi = [transfer_abi]
contract = w3.eth.contract(address='0x...', abi=event_abi)
# Get receipt for a specific transaction
receipt = w3.eth.get_transaction_receipt('0xTxHash')
# Parse logs
for log in receipt['logs']:
    parsed_event = contract.events.Transfer().process_log(log)
    print(f"Transfer: {parsed_event['args']['from']} -> {parsed_event['args']['to']} {parsed_event['args']['value']}")

Efficient parsing requires handling reorgs (chain reorganizations) and uncle blocks. A transaction in a block that gets orphaned will not have a receipt in the canonical chain. Your indexer must be able to discard or re-process data from orphaned blocks. Furthermore, logs can be anonymous (without a signature in the first topic) and contracts can use low-level log opcodes (LOG0-LOG4) directly, which may not conform to a standard ABI. For these cases, you must implement custom decoding logic based on the contract's bytecode analysis.

The output of this parsing stage is structured data ready for transformation. You should store the decoded event parameters, the block number, transaction hash, and log index. The log index is crucial for preserving the exact order of events within a transaction. This normalized data forms the foundation for building your indexed database, enabling fast queries for specific event types, addresses, or transaction histories that would be prohibitively slow to compute directly from an RPC node.

step-handle-forks
INDEXER RESILIENCE

Step 3: Handling Chain Reorganizations (Forks)

Chain reorganizations, where the canonical chain changes, are a fundamental challenge for any blockchain indexer. This step explains how to detect and handle forks to maintain data consistency.

A chain reorganization (reorg) occurs when a new, longer chain with more cumulative proof-of-work (or a heavier chain in proof-of-stake) replaces the previously accepted canonical chain. For your indexer, this means that blocks and transactions you have already processed may become invalid. Handling reorgs is critical for applications that require finality guarantees, such as tracking token balances or NFT ownership. The depth of a reorg—how many blocks are replaced—varies by chain; Ethereum mainnet reorgs are typically 1-2 blocks, while other chains may experience deeper forks.

To manage reorgs, your indexer must implement a rollback mechanism. This involves maintaining a data structure, often a simple list or stack, that tracks processed blocks in memory or a temporary cache before committing them to your primary database. When you receive a new block header, you must check its parent hash against your current chain tip. If it doesn't match, you have detected a fork. Your logic must then traverse backwards from your current tip to find the common ancestor—the last block shared by both chains.

Once the common ancestor is identified, you must revert all changes derived from the orphaned blocks. For a UTXO-based chain like Bitcoin, this means restoring spent outputs. For an account-based chain like Ethereum, you must revert state changes (balances, nonces, contract storage) and event logs. After the rollback, you then re-process the new chain of blocks from the common ancestor forward. Your database operations must be wrapped in transactions to ensure this atomic rollback and replay can occur without corrupting your dataset.

Here is a simplified pseudocode outline for a reorg handler in a polling-based indexer:

python
current_tip = get_latest_indexed_block()
new_block = fetch_block_from_rpc(height=current_tip.height + 1)

if new_block.parent_hash != current_tip.hash:
    # Fork detected
    old_chain, new_chain = find_fork_chains(current_tip, new_block)
    common_ancestor = find_common_ancestor(old_chain, new_chain)
    
    # Rollback
    for block_to_revert in reversed(old_chain[common_ancestor.height:]):
        revert_block_operations(block_to_revert)
    
    # Re-process new chain
    for block_to_process in new_chain[common_ancestor.height + 1:]:
        process_block(block_to_process)
else:
    # Normal linear progression
    process_block(new_block)

For production systems, consider subscribing to the chainReorg event via a WebSocket connection to your node's JSON-RPC endpoint (e.g., using eth_subscribe on Geth). This provides near-instant notification of reorgs without constant polling. Additionally, design your data schema to support efficient reversion—using append-only logs or versioned records can simplify this process. Testing is crucial: simulate reorgs on a testnet or a local development chain (like Ganache with evm_mine and evm_revert) to validate your indexer's resilience.

step-database-schema
DATABASE DESIGN

Step 4: Designing a Query-Optimized Database Schema

A well-designed database schema is the engine of your indexer, determining query speed, scalability, and the complexity of your data models. This step focuses on translating blockchain data into efficient relational structures.

The primary goal is to structure raw blockchain data—blocks, transactions, logs, and traces—into tables optimized for your application's specific queries. Avoid the trap of creating a monolithic table that mirrors block JSON; this leads to expensive joins and slow aggregations. Instead, practice normalization to reduce data redundancy and denormalization where necessary for read performance. For an Ethereum indexer, core tables typically include blocks, transactions, logs (for events), and traces. Each table should have a primary key (like block_hash or a composite key) and indexes on foreign keys and frequently filtered columns.

Indexing strategy is critical. While primary keys are indexed automatically, you must add secondary indexes on columns used in WHERE, JOIN, and ORDER BY clauses. For example, indexing transactions.from_address and logs.address is essential. However, over-indexing slows down writes, as each new block requires updating every index. Use partial indexes for common filters (e.g., WHERE topic0 = 'Transfer') and composite indexes for multi-column queries. For time-series analysis, consider partitioning large tables like logs by block_number to improve query performance on historical ranges.

Design for complex queries from the start. If your dapp needs to quickly fetch "all NFT transfers for a specific collection," your logs table should efficiently filter by address (the contract) and the first topic (topic0). Storing decoded event data in separate columns, like from_address and to_address for a Transfer, avoids the need to parse raw log data on every query. This denormalization trades storage space for massive read-speed gains. Tools like pg_stat_statements in PostgreSQL can help identify slow queries post-deployment for further optimization.

Implement data integrity with foreign key constraints (e.g., log.transaction_hash references transactions.hash) to ensure relational consistency. Use appropriate data types: NUMERIC or DECIMAL for precise token amounts, VARCHAR for addresses and hashes, and BYTEA for raw binary data. Plan for chain reorganizations by including an is_canonical boolean flag on blocks and cascading updates or soft deletes. Your schema is not static; it will evolve. Use migration tools (like Flyway or Alembic) to manage schema changes without downtime as you add support for new contracts or data points.

COMPARISON

Indexer Performance Optimization Strategies

Trade-offs between different architectural and operational approaches for high-throughput blockchain indexing.

Optimization StrategyBatch ProcessingReal-time StreamingHybrid Approach

Latency (Block to Index)

5-30 seconds

< 1 second

1-5 seconds

Throughput (TPS Handled)

50,000 TPS

10,000 - 20,000 TPS

30,000 - 50,000 TPS

Database Write Load

High, bursty

Consistent, moderate

Moderate, smoothed

Memory/CPU Overhead

Low

Very High

Medium

Fault Tolerance

High (replay batches)

Medium (checkpointing)

High (hybrid recovery)

Implementation Complexity

Low

Very High

High

Best For

Analytics, historical queries

DeFi dashboards, alerts

General-purpose applications

CUSTOM BLOCKCHAIN INDEXER

Common Issues and Troubleshooting

Building a custom indexer introduces unique technical challenges. This guide addresses frequent developer pain points, from data consistency to performance bottlenecks, with practical solutions.

This is often caused by block processing bottlenecks or RPC node instability. Common culprits include:

  • Slow RPC Node: Public nodes can be rate-limited or slow. Use a dedicated node provider like Alchemy, Infura, or run your own archival node.
  • Inefficient Processing Logic: Processing blocks sequentially is slow. Implement parallel processing for independent transactions. For Ethereum, use eth_getBlockByNumber with true for full transaction objects to batch data fetching.
  • Database Write Speed: Writing each event individually is inefficient. Use batch inserts and ensure your database (e.g., PostgreSQL, TimescaleDB) is properly indexed on columns like block_number and address.

Monitoring: Implement logging for processed block heights and compare with the chain's latest block via eth_blockNumber.

conclusion-next-steps
KEY TAKEAWAYS

Conclusion and Next Steps

You have built a foundational blockchain indexer. This guide covered the core architecture, data extraction, and storage. The next steps involve scaling, optimization, and adding advanced features.

Building a custom indexer provides unprecedented control over your data pipeline. You are no longer constrained by the limitations, costs, or data models of third-party services. You can tailor the indexer to index specific contracts, events, or transaction patterns that are critical for your dApp's logic. This is essential for complex DeFi analytics, NFT market intelligence, or on-chain governance tracking where generic APIs fall short.

To scale your indexer, consider these architectural improvements: implement a message queue like RabbitMQ or Kafka to decouple block ingestion from processing, allowing you to scale workers horizontally. Use database connection pooling and batch inserts to handle high write throughput. For historical data, run a separate 'backfill' process with a higher concurrency limit, while the main indexer stays synchronized with the chain tip. Monitor performance with metrics for blocks processed per second and database write latency.

Enhance your indexer with advanced features. Add real-time alerts by publishing parsed events to a WebSocket server for instant frontend updates. Implement data enrichment by fetching token metadata from a provider like the CoinGecko API or resolving ENS names. For complex logic, consider using The Graph's subgraph manifest as a declarative configuration for your indexer, then execute it with your custom runtime to combine ease-of-use with your infrastructure.

Your next project could be building an indexer for a specific use case. For example, create a MEV indexer that tracks sandwich attacks and arbitrage opportunities by analyzing transaction order and gas prices. Or, build a cross-chain indexer that uses LayerZero or Wormhole message logs to create a unified view of asset flows across Ethereum, Arbitrum, and Polygon. These specialized indexers are valuable tools for researchers and traders.

Finally, ensure your indexer is robust and maintainable. Write comprehensive unit and integration tests using a local testnet like Hardhat or Anvil. Implement circuit breakers and alerting for RPC provider failures. Document your data schema and indexing logic thoroughly. The code from this guide is a starting point; production systems require careful attention to error handling, monitoring, and data integrity to provide reliable, actionable on-chain intelligence.