How to Set Up a Blockchain Data Indexing Service

introduction

DEVELOPER TUTORIAL

How to Set Up a Blockchain Data Indexing Service

A practical guide to building a service that extracts, transforms, and queries on-chain data for applications.

Blockchain data indexing is the process of extracting raw data from a blockchain, transforming it into a structured format, and making it efficiently queryable. Unlike a standard database, blockchains are optimized for consensus and security, not for complex queries. An indexing service solves this by listening to new blocks, parsing transaction logs and internal calls, and storing the decoded data in a relational database like PostgreSQL or a search engine like Elasticsearch. This enables applications to ask questions like "show me all NFT transfers for this wallet in the last month" without scanning the entire chain.

The core architecture involves three main components: a block ingestion service, a data processor, and a query API. The ingestion service, often built using a client library like Ethers.js or Viem, connects to a node RPC endpoint to subscribe to new blocks. The data processor contains the business logic to decode these blocks using Application Binary Interfaces (ABIs) for specific smart contracts. Finally, the query API exposes the normalized data. For Ethereum, you would typically index events like Transfer(address,address,uint256) and track function calls to contracts like Uniswap or Aave.

Start by setting up a project to listen to the chain. Using Node.js and Viem, you can create a simple poller. The following code snippet connects to a Sepolia testnet RPC and fetches the latest block, logging all transaction hashes. This is the foundation of your ingestion layer.

javascript
import { createPublicClient, http } from 'viem';
import { sepolia } from 'viem/chains';

const client = createPublicClient({
  chain: sepolia,
  transport: http('YOUR_RPC_URL')
});

async function getBlock() {
  const blockNumber = await client.getBlockNumber();
  const block = await client.getBlock({ blockNumber, includeTransactions: true });
  console.log(`Block #${blockNumber} has ${block.transactions.length} txns`);
  block.transactions.forEach(tx => console.log(tx.hash));
}

After capturing raw blocks, you must decode the data. This requires the ABI of the smart contracts you're interested in. For example, to index ERC-20 token transfers, you would parse logs for the Transfer event. The processor should handle log decoding and state reconciliation, ensuring your database reflects the current chain state. A robust service must also handle chain reorganizations (reorgs) by tracking block finality and having a mechanism to revert and re-process orphaned blocks. Using a task queue like BullMQ can help manage the processing pipeline reliably.

For production, consider using specialized indexing frameworks to reduce boilerplate. The Graph allows you to define subgraphs with GraphQL schemas and mapping scripts. Alternatively, Subsquid offers a heavier-duty ETL (Extract, Transform, Load) framework with TypeScript support and direct database access. These tools manage syncing, reorgs, and scaling, letting you focus on the data transformation logic. The choice between building custom or using a framework depends on your need for control versus development speed.

Finally, expose the indexed data through a dedicated API. This could be a REST endpoint, a GraphQL server, or a gRPC service. Ensure your queries are performant by adding database indexes on frequently searched columns like block_number, from_address, and token_id. A complete indexing service turns the blockchain's immutable ledger into a powerful, queryable data layer, enabling everything from analytics dashboards and portfolio trackers to the complex logic of DeFi applications.

prerequisites

DEVELOPER GUIDE

Prerequisites for Building an Indexer

Essential technical foundations and infrastructure required to build a robust blockchain data indexing service.

Building a blockchain indexer is a complex data engineering task that requires a solid technical foundation. Before writing a single line of indexing logic, you must establish the core infrastructure that will ingest, process, and serve data. This involves making critical architectural decisions about your data pipeline, choosing the right database for your query patterns, and ensuring you have the necessary blockchain node access. The quality of your prerequisites directly determines the indexer's performance, reliability, and scalability.

The first and most critical prerequisite is reliable access to blockchain nodes. You cannot index data you cannot read. For Ethereum, this means running an archive node (like Geth or Erigon) or using a node provider service (such as Alchemy, Infura, or QuickNode). Archive nodes are essential because they provide full historical state, allowing you to replay and index past blocks. For other chains like Solana, you'll need an RPC endpoint with historical data capabilities. Ensure your node connection is stable and can handle the high volume of read requests your indexer will generate.

Next, you must select a persistent data store optimized for time-series and complex query patterns. While a simple project might start with PostgreSQL, production-grade indexers often require specialized databases. Time-series databases like TimescaleDB (a PostgreSQL extension) are excellent for block and transaction data. For highly relational on-chain data (like DeFi protocols with many entity relationships), a traditional relational database is suitable. Graph databases like Neo4j can model complex relationships, while dedicated OLAP databases like ClickHouse offer blazing-fast analytics on massive datasets.

Your development environment must be configured with the necessary tools and languages. Core programming languages for indexers include TypeScript/JavaScript (using libraries like ethers.js or viem), Python (with Web3.py), Go, or Rust. You'll also need a framework to structure your indexing logic. Popular choices include Substreams (for high-performance streaming), The Graph (for subgraph development), or custom solutions built with application-specific databases. Familiarity with Docker and containerization is crucial for consistent deployment across environments.

Finally, plan for monitoring and observability from day one. Indexers must run 24/7 and process data correctly. Implement logging (using structured JSON logs with tools like Pino or Winston), metrics collection (with Prometheus for tracking blocks processed, latency, and error rates), and alerting (via PagerDuty or Slack integrations). You should also design a mechanism for handling chain reorganizations (reorgs) and data correction to maintain consistency. A well-architected foundation saves immense refactoring effort later.

key-concepts-text

BUILDING BLOCKS

Core Concepts: Schemas, Mappings, and Indexers

Learn the fundamental components for transforming raw blockchain data into structured, queryable information.

A blockchain data indexing service transforms raw, sequential on-chain data into a structured, queryable database. This process involves three core components: the schema, which defines the structure of your output data; the mapping, which contains the logic for processing events and transactions; and the indexer, the runtime engine that executes this logic. Think of the schema as your database's blueprint, the mapping as the set of instructions for building it, and the indexer as the construction crew that does the work. This architecture is central to platforms like The Graph, SubQuery, and Subsquid.

The schema is written in GraphQL Schema Definition Language (SDL) and defines the entities (data models) your indexer will create and store. For example, to index a DEX, you might define Pool, Swap, and LiquidityProvider entities. Each entity has typed fields like id: ID!, token0: String, or totalValueLocked: BigDecimal. This schema dictates the structure of your final GraphQL API, allowing you to query complex relationships, such as all swaps for a specific pool in the last 24 hours. It's the contract between your indexing logic and your application's data needs.

Mappings contain the application logic that processes blockchain data. Written in TypeScript, AssemblyScript, or Rust, a mapping is a set of handler functions that are triggered by specific on-chain events or function calls. For instance, a handleSwap function is called every time a Swap event is emitted on a Uniswap V3 pool. Inside this handler, you decode the event log data, perform any necessary calculations, and then create or update Swap and Pool entities in your store according to the rules defined in your schema. This is where raw data becomes meaningful information.

The indexer is the service that scans the blockchain, listens for new blocks, and executes your mapping functions. It manages the connection to an Ethereum node (or other data source), processes blocks in sequence, and handles the underlying database. Advanced indexers offer features like deterministic indexing (ensuring the same input always produces the same output), real-time updates, and historical data re-indexing. When you deploy your subgraph to The Graph's hosted service or a decentralized network, you are deploying this complete package—schema, mappings, and the indexer runtime—to a node operated by an Indexer.

Setting up a service requires a workflow: 1) Define your schema based on the data your dApp needs. 2) Write mappings to populate the entities. 3) Generate TypeScript types from your schema for type-safe development. 4) Configure a manifest file (subgraph.yaml or project.yaml) that links your data source (contract address, start block), the events to watch, and the handlers to call. 5) Build and deploy the package to an indexing node. This process turns the firehose of blockchain data into a precise, efficient API for your application.

ARCHITECTURE

Indexing Solution Comparison: The Graph vs. Subsquid vs. Managed Services

A technical comparison of self-hosted indexing protocols and third-party managed services for developers building data pipelines.

Feature / Metric	The Graph (Subgraph)	Subsquid (Squid)	Managed Service (e.g., Covalent, Goldsky)
Core Architecture	Subgraph manifest + GraphQL API	TypeScript processor + GraphQL API	Proprietary API (REST/GraphQL)
Hosting Model	Decentralized Network or Self-Hosted	Self-Hosted (Docker)	Fully Managed SaaS
Query Language	GraphQL	GraphQL	Varies (GraphQL, REST)
Data Source	EVM & non-EVM chains via Substreams	Substrate, EVM, Fuel via Archives	Multi-chain via unified API
Development Overhead	High (Define schema, mappings, deploy)	Medium (Write processor, deploy infra)	Low (Call API, no infra)
Cost Model	GRT query fees or infra costs	Infrastructure hosting costs	Monthly subscription or pay-per-query
Time to First Query	Hours to days (development + sync)	Minutes to hours (processor + sync)	Minutes (API key activation)
Custom Logic Flexibility	High (in mappings)	Very High (full TypeScript)	Low (filter existing data)

step-1-schema-design

FOUNDATION

Step 1: Define Your GraphQL Schema

The GraphQL schema is the core contract of your indexing service, defining the data types and queries your API will expose to developers.

Your GraphQL schema, written in the Schema Definition Language (SDL), acts as the blueprint for your entire indexing service. It explicitly declares the entities (like Token, Transfer, Pool) you will index from the blockchain, the fields available on each entity (e.g., id, amount, timestamp), and the relationships between them. This schema is what developers will query against, so its design directly impacts the usability and performance of your API. Tools like The Graph's subgraph manifest or Subsquid's schema file are used to define this structure.

Start by modeling the core on-chain events you need to track. For an ERC-20 token indexer, your schema might define a Token entity with fields for id (the contract address), name, and symbol, and a Transfer entity linked to it. Crucially, you must define the relationships using GraphQL types: a Token can have a transfers field that is an array of Transfer objects, and a Transfer can have a token field pointing back. This creates a queryable graph of data. Always use ID! for unique identifiers and consider indexing fields with @index directives for faster queries.

Here is a simplified example schema for indexing token transfers:

graphql
type Token @entity {
  id: ID!
  name: String
  symbol: String
  transfers: [Transfer!]! @derivedFrom(field: "token")
}

type Transfer @entity {
  id: ID!
  from: Bytes!
  to: Bytes!
  amount: BigInt!
  timestamp: BigInt! @index
  token: Token!
}

The @derivedFrom directive creates a virtual reverse lookup, optimizing storage. The @index on timestamp enables efficient filtering. Define enums for known values (like transaction status) and use scalar types like BigInt and Bytes for blockchain-native data.

After defining your schema, you must generate type-safe data access classes or TypeScript interfaces. Using the Graph CLI (graph codegen) or Subsquid's squid-typeorm-codegen will automatically create these from your schema. This ensures your mapping code—which processes raw blockchain data—correctly populates the entities you've defined. A well-designed schema is iterative; start with the minimum viable entities and expand as you add support for more contract events or complex DeFi logic.

step-2-write-mappings

CORE LOGIC

Step 2: Write Mapping Functions

Mapping functions are the heart of your indexer. They define how raw blockchain event data is transformed and saved to your GraphQL schema.

A mapping function is a handler written in AssemblyScript or TypeScript that executes in response to specific on-chain events or function calls. When a Transfer event is emitted or a swap() function is called on a smart contract you are indexing, the corresponding mapping function you've defined is triggered. Its job is to take the raw, low-level data from the blockchain—like log data and transaction receipts—and convert it into the structured, queryable entities defined in your schema.graphql file. This process is what populates your database with meaningful, typed data.

Your mapping logic is defined in a src/mapping.ts file (or similar). It typically exports functions that correspond to the event handlers declared in your subgraph manifest (subgraph.yaml). For example, if you are indexing ERC-20 transfers, you would have a function like handleTransfer(event: TransferEvent): void. Inside this function, you load or create entities, set their fields using data from event.params, and then save them using entity.save(). This is where you implement the business logic of your indexer, such as calculating derived stats or updating aggregate balances.

A critical pattern is using the ID field to load existing entities. For instance, to update a User entity's balance after a transfer, you would call User.load(event.params.to.toHexString()). If the entity exists, it is loaded; if not, you create a new one. This ensures data consistency. Mapping functions must be deterministic; given the same blockchain data, they must always produce the same entity state. Avoid using random numbers or accessing off-chain APIs directly, as this will cause indexing failures and subgraph syncing errors.

Here is a concrete example for handling an ERC-20 Transfer event:

typescript
import { Transfer as TransferEvent } from '../generated/MyToken/MyToken'
import { User, Transfer } from '../generated/schema'

export function handleTransfer(event: TransferEvent): void {
  // 1. Create a new Transfer entity with a unique ID
  let transferId = event.transaction.hash.toHex() + '-' + event.logIndex.toString()
  let transfer = new Transfer(transferId)
  transfer.from = event.params.from.toHexString()
  transfer.to = event.params.to.toHexString()
  transfer.amount = event.params.value
  transfer.timestamp = event.block.timestamp
  transfer.save()

  // 2. Update the recipient's User entity balance
  let recipientId = event.params.to.toHexString()
  let recipient = User.load(recipientId)
  if (recipient == null) {
    recipient = new User(recipientId)
    recipient.balance = BigInt.fromI32(0)
  }
  recipient.balance = recipient.balance.plus(event.params.value)
  recipient.save()
}

This function creates a permanent record of the transfer and updates the aggregated balance for the receiving address.

Effective mapping functions are efficient and handle edge cases. Use BigInt for large numeric values like token amounts or ETH values. Remember that all entity fields must be saved; unsaved fields will be null when queried. After writing your mappings, you must regenerate the TypeScript bindings for your contract ABIs using the Graph CLI (graph codegen) to ensure the event.params types are correctly imported. Your subgraph's data integrity and performance depend on the correctness and efficiency of these mapping functions.

step-3-deployment

PRODUCTION DEPLOYMENT

Step 3: Deploy and Host Your Indexer

This guide covers the essential steps to deploy your custom blockchain indexer to a production environment, ensuring reliability, scalability, and data availability.

After developing and testing your indexer logic, the next critical phase is production deployment. This involves moving from a local development environment to a hosted service that can run 24/7. The core components you need to deploy are your indexing logic (e.g., a SubQuery project, a subgraph, or a custom service using The Graph's graph-node) and a database (typically PostgreSQL) to store the indexed data. You must also configure a reliable RPC endpoint for the blockchain you are indexing, as this is your data source. Services like Alchemy, Infura, or a dedicated node provider are essential for consistent uptime.

For deployment, you have several hosting options. Managed services like The Graph's Hosted Service (for subgraphs) or SubQuery's Managed Service abstract away infrastructure management, offering a simpler path to deployment. For full control, you can deploy using cloud infrastructure on AWS, Google Cloud, or DigitalOcean. This involves setting up a virtual machine, installing Docker, and running your indexer and database as containers. A common setup uses docker-compose to orchestrate a graph-node container connected to a Postgres container and an Ethereum client like geth or erigon.

Configuration is key for a stable deployment. You must set environment variables for your database connection string (postgresql://USER:PASSWORD@HOST/DB_NAME), the blockchain RPC endpoint, and any API keys. For subgraphs, your subgraph.yaml manifest must point to the correct contract addresses and start block. It's crucial to ensure your indexer can handle chain reorganizations; setting the ETHEREUM_REORG_THRESHOLD environment variable for a graph-node helps manage this. Always run a synchronization test on a testnet or a small block range on mainnet to verify data accuracy before full deployment.

Once deployed, you need to make your indexed data accessible. If using The Graph, you will publish your subgraph to the decentralized network or your hosted service, which generates a GraphQL endpoint (e.g., https://api.thegraph.com/subgraphs/name/your-name/your-subgraph). For custom services, you'll expose a query API, often GraphQL or REST, built with frameworks like Express.js or Apollo Server. This endpoint is what dApp frontends or other services will query to retrieve your indexed data. Securing this API with rate limiting and monitoring is an important subsequent step.

Monitoring and maintenance are ongoing responsibilities. You should set up logging (e.g., using the GRAPH_LOG environment variable) and metrics collection to track indexing health, sync status, and query performance. Tools like Prometheus and Grafana can be integrated. Be prepared to update your indexer for smart contract upgrades or schema changes, which may require redeploying with a new version. For subgraphs, this means updating the manifest and redeploying; the Graph Node will handle re-indexing from the specified start block.

step-4-query-frontend

FRONTEND INTEGRATION

Step 4: Query Data from a Frontend Application

Connect your frontend to the indexed blockchain data using GraphQL queries and a client library.

With your subgraph deployed and an API endpoint secured, you can now query data from a web application. The primary method is using GraphQL, a query language for APIs that allows you to request specific data fields. Unlike REST, GraphQL enables you to get all required data in a single request, which is ideal for complex blockchain data structures. You will need a GraphQL client library; Apollo Client and urql are popular choices for React applications. First, install your chosen client and configure it to point to your subgraph's public or private HTTP endpoint, which you obtained from your hosting service like The Graph's Hosted Service or a decentralized network.

Constructing a GraphQL query requires understanding your subgraph's schema. For example, to fetch recent Swap events from a Uniswap-like DEX, your query might request fields like id, amount0In, amount1Out, sender, and the related pair and token details. Here is a basic query structure using the useQuery hook from Apollo Client in a React component:

graphql
const GET_SWAPS = gql`
  query GetRecentSwaps {
    swaps(first: 10, orderBy: timestamp, orderDirection: desc) {
      id
      amount0In
      amount1Out
      sender
      pair {
        token0 {
          symbol
        }
      }
    }
  }
`;

This query fetches the 10 most recent swaps, ordered by timestamp.

After executing the query, your application receives a JSON response matching the query's structure. You must handle the loading, error, and data states provided by your GraphQL client. For real-time updates, you can use GraphQL subscriptions if your indexer supports them, allowing your UI to reflect new blockchain events as they are indexed. For performance, implement pagination using arguments like first, skip, and orderBy. Always cache query results client-side to minimize network requests and improve user experience. Remember to manage API keys securely if using a paid or private service, typically by storing them in environment variables rather than in your frontend code.

resource-links

DEVELOPER GUIDES

Essential Resources and Documentation

These resources explain how to design, deploy, and operate a blockchain data indexing service. Each card focuses on a concrete tool or documentation set used in production by Web3 teams indexing Ethereum, L2s, and other EVM-compatible networks.

The Graph Protocol: Subgraphs and Indexing Nodes

The Graph is the most widely used decentralized indexing protocol for querying blockchain data using GraphQL. Developers define subgraphs that map smart contract events and state into queryable entities.

Key implementation steps covered in the docs:

Define a GraphQL schema describing indexed entities
Write AssemblyScript mappings to transform on-chain events
Configure subgraph.yaml with contract addresses and start blocks
Deploy to The Graph Network or a self-hosted Graph Node

Operational considerations:

Indexing speed depends on block reorg handling and data source complexity
Hosted Service is deprecated; production deployments should target the decentralized network
Supports Ethereum, Arbitrum, Optimism, Base, Polygon, and others

Use The Graph when you need deterministic indexing, open schemas, and standardized tooling adopted by most DeFi and NFT protocols.

EXPLORE

Streaming Indexers with Substreams

Substreams is a high-performance blockchain data streaming framework developed by StreamingFast. It is designed for teams that need near–real-time indexing and fine-grained control over data transformations.

Core concepts covered in the documentation:

Modules written in Rust that process blocks, transactions, or logs
Deterministic execution using protobuf-based outputs
Parallelizable pipelines for scalable indexing
Backfilling historical data with the same logic as live streams

Why developers choose Substreams:

Significantly faster backfills compared to traditional ETL pipelines
Suitable for analytics, search indexes, and downstream databases
Integrates with sinks like PostgreSQL, ClickHouse, and BigQuery

Substreams is commonly used when GraphQL alone is insufficient and when teams require custom data pipelines optimized for throughput and latency.

EXPLORE

Managed Indexing with Goldsky

Goldsky provides managed blockchain indexing infrastructure, removing the need to operate your own nodes or indexers. It supports both The Graph subgraphs and Substreams-based pipelines.

What the documentation helps you set up:

Deploy subgraphs without running Graph Nodes
Configure webhook and database sinks for indexed data
Monitor indexing health, sync status, and error logs
Scale indexing across multiple chains from a single dashboard

Typical use cases:

Startups that need production-grade indexing without DevOps overhead
Teams indexing multiple L2s such as Optimism, Arbitrum, and Base
Applications that require reliable backfills and managed upgrades

Goldsky is useful when you want to focus on application logic instead of infrastructure while still controlling schemas and transformations.

EXPLORE

Raw Blockchain Data APIs: Alchemy and Covalent

If you prefer building a custom indexing service instead of using subgraphs, raw blockchain data APIs provide direct access to blocks, logs, traces, and decoded transactions.

Relevant documentation topics:

Alchemy: Enhanced JSON-RPC, WebSocket subscriptions, trace APIs
Covalent: Normalized REST APIs for tokens, balances, and historical data
Rate limits, pagination strategies, and chain coverage
Best practices for caching and replay protection

When this approach makes sense:

You need full control over indexing logic and storage
You are building proprietary analytics or risk models
You want to index non-standard contracts or edge-case data

Most teams pair these APIs with a custom pipeline using PostgreSQL, ClickHouse, or Elasticsearch to build an internal indexing service.

EXPLORE

BLOCKCHAIN DATA INDEXING

Frequently Asked Questions

Common questions and solutions for developers building or using blockchain indexing services.

Blockchain data indexing is the process of extracting, transforming, and structuring raw on-chain data into a queryable format. Native blockchain nodes store data in a way optimized for consensus and verification, not for complex queries. An indexer solves this by:

Parsing transaction logs and receipts to decode smart contract events.
Organizing data into relational databases (like PostgreSQL) or search engines (like Elasticsearch).
Enabling fast queries for user balances, NFT ownership, DeFi pool stats, or custom event histories.

Without an indexer, applications would need to process every block from genesis, a slow and resource-intensive operation. Services like The Graph, Subsquid, and Covalent provide generalized indexing, while many projects run custom indexers for specific logic.

conclusion

BUILDING YOUR INDEXER

Conclusion and Next Steps

You've learned the core components of a blockchain data indexing service. This final section outlines key considerations for production deployment and suggests advanced topics to explore.

Setting up a robust indexing service requires moving beyond a local development environment. For production, you must address infrastructure reliability and data integrity. Key steps include: deploying your indexer to a cloud provider with auto-scaling capabilities, implementing a robust database like TimescaleDB for time-series data, setting up comprehensive monitoring with Prometheus and Grafana, and establishing a disaster recovery plan with regular database snapshots. Security is paramount; ensure your RPC endpoints are from trusted providers like Alchemy or Infura, and never hardcode private keys in your application code.

To extend your service's capabilities, consider integrating with specialized indexing protocols. The Graph Protocol allows you to publish subgraphs, making your indexed data queryable via GraphQL for other developers. For real-time event streaming, explore solutions like Apache Kafka or Amazon Kinesis to decouple data ingestion from processing. Implementing a caching layer with Redis can dramatically improve query performance for frequently accessed data, such as token prices or recent transaction batches. These architectural decisions will define your service's scalability and user experience.

Your next development steps should focus on optimization and advanced features. Profile your indexer's performance to identify bottlenecks in block processing or database writes. Implement historical data backfilling strategies to sync from genesis or a specific block height efficiently. Explore indexing data from Layer 2 networks like Arbitrum or Optimism, which require handling specific bridge events and finality rules. Finally, document your API thoroughly using OpenAPI specifications and consider publishing client SDKs in multiple languages (JavaScript, Python, Go) to lower the integration barrier for other developers building on your indexed data.

How to Set Up a Blockchain Data Indexing Service for Developers