How to Architect a Scalable On-Chain Data Pipeline

introduction

DEVELOPER GUIDE

How to Architect a Scalable On-Chain Data Pipeline

A practical guide to designing robust data pipelines for blockchain applications, covering architecture patterns, tooling, and scaling strategies.

An on-chain data pipeline is a system that extracts, transforms, and loads (ETL) data from blockchain networks for analysis, indexing, or application use. Unlike traditional databases, blockchain data is immutable, decentralized, and structured as a sequence of blocks containing transactions and event logs. The primary challenge is efficiently querying this data, as direct RPC calls to nodes are slow for historical analysis and lack complex filtering. A well-architected pipeline decouples data ingestion from consumption, enabling real-time dashboards, historical analytics, and performant backend services for dApps. Core components typically include a blockchain client, a data indexer, a transformation layer, and a queryable database.

The first architectural decision is choosing an indexing strategy. A full archival node provides complete historical data but requires significant storage and sync time. Services like Chainscore, The Graph, or Covalent offer indexed data via APIs, abstracting infrastructure management. For custom needs, you can run your own indexer using frameworks like Subsquid or TrueBlocks. The pipeline flow begins with an ingestor that streams raw block data, often via WebSocket subscriptions to node providers like Alchemy or Infura for real-time updates. This data, including transactions, receipts, and logs, is then parsed to filter for relevant smart contract events using their Application Binary Interface (ABI).

After ingestion, raw data must be transformed into a structured format. This involves decoding hexadecimal event data into human-readable values using the contract ABI, calculating derived fields (like token price from a swap event), and handling data normalization (e.g., converting Wei to ETH). This transformation logic is often written in a general-purpose language like Python or TypeScript. The processed data is then loaded into a persistent store optimized for querying. PostgreSQL is a common choice for its relational model and JSON support, while TimescaleDB (a PostgreSQL extension) is ideal for time-series metrics. For massive-scale analytics, data lakes using Apache Parquet formats on AWS S3 queried by Trino or AWS Athena can be cost-effective.

Scalability is critical as chain activity grows. Implement checkpointing to track the last processed block, allowing the pipeline to resume after failures. Use message queues (like Apache Kafka or RabbitMQ) to decouple ingestion from processing, enabling parallel consumer workers. For data partitioning, segment your database by chain ID, block number ranges, or contract address to improve query performance. Monitor pipeline health with metrics for block processing latency, error rates, and database queue depth. A robust pipeline should also handle chain reorganizations (reorgs) by being able to revert data from orphaned blocks, which requires tracking block finality and maintaining idempotent operations.

Finally, expose the processed data to your application. This can be done via a dedicated REST or GraphQL API layer, or by allowing direct, secure read access to the analytics database. For real-time features, consider streaming updates through WebSockets or Server-Sent Events (SSE). Always backfill historical data for new contracts by replaying blocks from a specific starting height. By architecting with modular components—separating ingestion, transformation, storage, and serving—you create a pipeline that can scale with your data needs, swap out underlying technologies, and provide reliable, queryable on-chain data for any application. Open-source examples can be found in the Chainscore GitHub repository for practical reference.

prerequisites

PREREQUISITES AND CORE TECHNOLOGIES

How to Architect a Scalable On-Chain Data Pipeline

Building a robust data pipeline for blockchain analytics requires a foundational understanding of core technologies and architectural patterns. This guide outlines the essential prerequisites and components.

An on-chain data pipeline ingests, processes, and serves data from blockchain networks. Unlike traditional databases, blockchain data is immutable, decentralized, and structured as a chain of blocks containing transactions and smart contract logs. The primary challenge is handling the volume, velocity, and complexity of this data in a reliable and scalable way. Core prerequisites include proficiency in a backend language like Python or Go, familiarity with SQL and NoSQL databases, and a solid grasp of Ethereum concepts such as blocks, transactions, logs, and the EVM.

The first architectural decision is the data ingestion method. You can use a JSON-RPC client to query a node directly, but this is slow for historical data. For production systems, you need a more efficient approach. Archival nodes provide full historical state but are resource-intensive. Alternatively, specialized data providers like The Graph (for indexed subgraphs) or Chainscore (for real-time event streaming) abstract away node management and offer higher-level APIs. Your choice depends on your latency requirements, data freshness needs, and whether you need raw or pre-processed data.

Once data is ingested, it must be transformed and stored. A common pattern is to use an event-driven architecture. Raw block data is streamed into a message queue like Apache Kafka or Amazon Kinesis. Consumers then process these events: decoding smart contract ABIs, calculating derived metrics, and normalizing data into analytical schemas. The processed data is typically written to both a time-series database like TimescaleDB for metrics and a data warehouse like Snowflake or BigQuery for complex queries. This separation ensures performance for both real-time dashboards and batch analysis.

Scalability is achieved through partitioning and parallel processing. Since blockchain data is append-only and ordered by block number, it's ideal for sharding. You can partition your processing workload by block range across multiple workers. Tools like Apache Spark or Dagster can orchestrate these batch pipelines. For real-time streams, ensure your consumers are stateless and can be horizontally scaled. Database indexing is also critical; you must create indexes on frequently queried fields like block_number, transaction_hash, from_address, and to_address to maintain query performance as your dataset grows into terabytes.

Finally, consider data integrity and monitoring. Implement idempotent data processing to handle retries without creating duplicates. Use data lineage tools to track the flow of data from source to destination. Set up comprehensive monitoring for pipeline health: track blocks processed per second, consumer lag in your message queue, and database query latency. A well-architected pipeline is not just about moving data; it's about providing a reliable, scalable, and maintainable foundation for all downstream analytics, DeFi dashboards, and on-chain applications.

data-source-selection

ARCHITECTURE FOUNDATION

Selecting and Evaluating Data Sources

The quality and reliability of your data sources determine the integrity of your entire on-chain data pipeline. This guide details the criteria and methods for selecting and evaluating sources like RPC nodes, indexers, and subgraphs.

An on-chain data pipeline begins with raw data ingestion. The primary source is an RPC (Remote Procedure Call) node, which provides direct access to a blockchain's state and transaction history. For Ethereum and EVM chains, you interact with nodes via the JSON-RPC API to fetch blocks, transactions, logs, and traces. The choice between running your own node (e.g., using Geth, Erigon) versus using a managed service (e.g., Alchemy, Infura, QuickNode) involves trade-offs in cost, latency, reliability, and rate limits. Self-hosted nodes offer maximum control and data sovereignty but require significant operational overhead.

For complex querying of historical data, specialized indexing services are essential. These services transform raw blockchain data into queryable formats. Key providers include:

The Graph: Uses subgraphs to index and serve data via GraphQL, ideal for dApp-specific event data.
Covalent: Offers a unified API to access historical transactions, balances, and log events across many chains.
Blockchain-specific indexers: Like Solana's Helius or Sui's Indexer, which provide optimized access to non-EVM data. Evaluate these based on the chains supported, data freshness (indexing lag), query cost, and the complexity of the data schema you need to build.

When evaluating any data source, establish clear reliability and performance metrics. Monitor uptime and latency—a source with 99.9% SLA and sub-100ms response time is critical for real-time applications. Check data completeness; some archive nodes or indexers may not store full historical trace or state data. Rate limiting and cost models (per request, monthly tiers) directly impact scalability. Always implement fallback providers and retry logic in your pipeline to handle intermittent failures from any single source, ensuring high availability for your downstream applications.

ARCHITECTURE DECISION

Batch vs. Streaming Data Ingestion

A comparison of data ingestion strategies for on-chain data pipelines, detailing trade-offs in latency, cost, and complexity.

Feature	Batch Ingestion	Streaming Ingestion	Hybrid (Lambda)
Data Latency	Minutes to hours	< 1 second	Seconds to minutes
Use Case Fit	Historical analysis, reporting	Real-time alerts, dashboards	Near-real-time analytics
Infrastructure Cost	Low to medium	High	Medium to high
Implementation Complexity	Low	High	Medium
Fault Tolerance	High (re-run jobs)	Medium (requires checkpointing)	High (batch fallback)
Data Freshness for Queries	Stale	Real-time	Near-real-time
Typical Tools	Airflow, Spark, BigQuery	Kafka, Flink, Faust	Spark Structured Streaming, Delta Lake
Best for Chain Re-orgs	Easy to handle	Complex to handle	Moderate complexity

pipeline-architecture

CORE PIPELINE ARCHITECTURE AND COMPONENTS

How to Architect a Scalable On-Chain Data Pipeline

A robust on-chain data pipeline is essential for building performant dApps, analytics dashboards, and automated trading systems. This guide outlines the core architectural components and design patterns for a scalable, reliable pipeline.

An on-chain data pipeline ingests, processes, and serves data from blockchain networks. Unlike traditional data systems, it must handle event-driven data streams, manage finality delays, and process structured log data from smart contracts. The primary goal is to transform raw, low-level blockchain data—like transaction receipts and logs—into a queryable, application-ready format. A well-architected pipeline is modular, separating concerns like data ingestion, transformation, storage, and serving to ensure maintainability and scalability.

The architecture typically consists of four core layers. The Ingestion Layer connects to blockchain nodes via RPC (e.g., using providers like Alchemy, Infura, or a dedicated node) to stream new blocks and logs. The Processing/Transformation Layer decodes raw log data using contract ABIs, normalizes it, and applies business logic. This is often built with stream-processing frameworks like Apache Flink or Apache Kafka Streams. The Storage Layer persists the processed data, with choices ranging from PostgreSQL for relational data to time-series databases like TimescaleDB for metrics, or data lakes for raw archival.

For real-time applications, implementing an Indexing Strategy is critical. Instead of querying the chain directly for historical data, your pipeline should maintain derived data stores (indices) that are optimized for your query patterns. Common patterns include building event tables that map user addresses to their transactions, or aggregate tables that store pre-computed totals like daily trading volume for a specific DEX. Tools like The Graph offer a decentralized indexing protocol, while self-hosted solutions might use PostgreSQL with appropriate indexes.

Scalability challenges include handling chain reorganizations (reorgs), managing RPC rate limits, and ensuring data consistency. Design your ingestion to be idempotent and to track the latest processed block height persistently. Use message queues (e.g., Kafka, RabbitMQ) to buffer data between ingestion and processing, allowing components to scale independently. For high-throughput chains like Solana or Polygon, consider parallelizing ingestion by sharding data streams based on block numbers or program IDs.

A practical implementation might start with a simple listener script using Ethers.js or Web3.py, but will quickly require more robust tooling. Frameworks like Apache Airflow or Prefect can orchestrate batch backfilling jobs, while Vector or Fluentd can handle log collection. The key is to instrument everything with metrics (using Prometheus) and logging to monitor pipeline health, latency, and any data gaps caused by node provider issues.

Ultimately, the choice between building versus using a managed service (like Chainscore, Covalent, or Goldsky) depends on your team's resources and data needs. Building offers maximum flexibility and cost control for unique requirements, while managed services provide reliability and faster time-to-market. Whichever path you choose, a clear separation of layers and a focus on idempotent, fault-tolerant processes will form the foundation of a scalable pipeline.

handling-reorgs

ARCHITECTING DATA PIPELINES

Handling Chain Reorganizations and Data Corrections

A guide to building resilient data pipelines that can handle blockchain reorganizations, ensuring data accuracy and system reliability.

A chain reorganization (reorg) occurs when a blockchain's consensus mechanism discards a previously accepted block in favor of a longer, competing chain. This is a normal part of Proof-of-Work and Proof-of-Stake networks like Ethereum and Solana, but it invalidates any data your pipeline has already processed from the orphaned blocks. An unhandled reorg can corrupt your database with incorrect transaction data, token balances, and event logs, leading to downstream failures in applications. Architecting for reorgs is essential for any service providing accurate on-chain data, from block explorers to DeFi dashboards.

The core strategy involves implementing a buffer and confirmation system. Instead of processing blocks as soon as they are seen, your pipeline should wait for a sufficient number of subsequent blocks—the confirmation depth. A common depth is 15-20 blocks for Ethereum, which statistically makes a reorg beyond that point extremely unlikely. During this buffer period, you must store incoming block data in a temporary, mutable state. Only after a block is considered final should its data be written to your primary, immutable datastore. This approach trades minimal latency for critical data integrity.

Your ingestion service needs to monitor the chain head continuously. When a reorg is detected—signaled by a decrease in block height or a change in parent hashes—you must roll back your processing. This involves:

Identifying the fork point (the last common block between the old and new chain).
Deleting or marking as invalid all data derived from blocks after the fork point from your primary store.
Re-processing the new canonical blocks from the fork point forward. Efficient design requires tracking block ancestry and maintaining idempotent data transformation jobs to handle this re-processing without side effects.

For scalable architectures, consider separating the chain follower from the data processor. The follower, using a provider like Chainscore's real-time streams, tracks the chain head and manages the reorg-aware buffer, emitting confirmed block events. The processor subscribes to these finalized events, transforming and loading the data. This decoupling allows the processor to fail and replay events without missing blocks. Use a message queue (e.g., Apache Kafka, Amazon SQS) with persistent offsets between these services to guarantee at-least-once delivery and processing during recovery scenarios.

Data correction logic must be baked into your data models. Use compound keys that include block number and transaction index, not just transaction hash, as a hash alone is not unique across chains. Implement soft deletes or versioned records in your database. For example, instead of overwriting an account balance, append a new record with the updated balance and block number; the current state is the record with the highest confirmed block number. This pattern, similar to an event ledger, makes correcting reorgs a matter of deleting the invalidated records rather than complex state reconstruction.

Finally, test your pipeline rigorously. Use development networks (testnets) and tools like Hardhat or Anvil to simulate reorgs by mining competing chains. Monitor key metrics: reorg detection latency, data rollback duration, and confirmation depth effectiveness. A robust pipeline isn't defined by avoiding reorgs—which is impossible—but by its ability to detect and correct them automatically, ensuring the served data consistently reflects the single, agreed-upon canonical chain.

schema-design

ARCHITECTURE GUIDE

Database Schema Design for Blockchain Data

A practical guide to designing scalable database schemas for ingesting, storing, and querying on-chain data efficiently.

Blockchain data presents unique challenges for traditional database design. Unlike conventional application data, on-chain information is immutable, append-only, and highly interconnected. A typical schema must handle entities like blocks, transactions, logs, token transfers, and internal traces. The primary goal is to structure this raw data into a queryable format that supports complex analytics, dashboards, and application backends without sacrificing performance as the dataset grows into the terabytes.

The foundation of any blockchain data pipeline is the extract, transform, load (ETL) process. You first extract raw data from a node's RPC endpoints (using eth_getBlockByNumber, eth_getTransactionReceipt). This data is then transformed—parsing hex values into decimals, decoding event logs using Application Binary Interfaces (ABIs), and flattening nested structures. Finally, it's loaded into your chosen database. Tools like The Graph's Subgraphs, Covalent, or custom indexers using ethers.js or viem automate much of this, but understanding the underlying flow is crucial for debugging and optimization.

A well-designed schema for Ethereum-like chains typically uses a star schema with fact and dimension tables. A central fact_transactions table would store core fields like hash, block_number, from_address, to_address, value, and gas_used. This table references dimension tables like dim_block (block details), dim_address (a curated list of addresses), and dim_token (ERC-20/721 contracts). Separating static dimensions from transactional facts improves query performance and reduces storage costs through normalization.

For query efficiency, strategic indexing is non-negotiable. You must create indexes on all foreign key columns and frequently filtered fields. Common examples include indexes on block_number (for time-range queries), from_address/to_address (for wallet activity), and transaction_hash (for quick lookups). For time-series aggregates, consider using database-specific features like PostgreSQL table partitioning on block_timestamp to automatically manage old data and speed up queries on recent blocks.

Scalability requires planning for data volume. A full Ethereum archive node exceeds 12TB. You can reduce this by storing only the data your application needs—perhaps only ERC-20 transfers and specific smart contract events. Use columnar storage formats like Parquet in a data lake (e.g., AWS S3, Google Cloud Storage) for cost-effective historical analysis, and keep a hot, indexed relational database (e.g., PostgreSQL, TimescaleDB) for the most recent 3-6 months of data to serve low-latency API requests.

Finally, maintain data integrity and ease of use. Implement idempotent ETL jobs to handle re-orgs safely. Use schema migration tools (like Flyway or Liquibase) to manage changes. Provide clear views or materialized views for common queries, such as current_token_balances or daily_transaction_volume. By treating blockchain data as a specialized time-series dataset and applying these database principles, you build a robust foundation for any Web3 application.

tools-and-libraries

ARCHITECTURE

Essential Tools and Libraries

Building a scalable on-chain data pipeline requires specialized tools for indexing, querying, and transforming blockchain data. This guide covers the core components.

The Graph Protocol

A decentralized protocol for indexing and querying blockchain data using GraphQL. Developers write subgraphs to define which data to index from smart contracts and how to transform it.

Core Components: Subgraph manifest, GraphQL schema, and mapping scripts.
Use Case: Powering front-end dApps with efficient, cached queries instead of direct RPC calls.
Hosted Service: Managed service for subgraph deployment (being deprecated for The Graph Network).

1k+

Deployed Subgraphs

EXPLORE

Subsquid

An open-source framework for building high-performance APIs (GraphQL or REST) for blockchain data. It processes on-chain data into a queryable database.

Architecture: Uses Squid SDK for defining data schemas and Processors for historical and real-time data ingestion.
Advantage: Offers full control over the data pipeline and database, enabling complex transformations and aggregations.
Deployment: Can be self-hosted or deployed to Subsquid's managed cloud service.

EXPLORE

Covalent Unified API

A unified API providing indexed blockchain data across 200+ supported networks. It abstracts away the complexity of running indexers.

Key Feature: Class A endpoints return rich, decoded log event data and NFT metadata without requiring custom indexing logic.
Data Coverage: Historical and real-time balances, transactions, log events, and NFT data.
Use Case: Rapid prototyping and applications needing multi-chain data without infrastructure overhead.

200+

Supported Chains

EXPLORE

Chainlink Functions & Data Streams

Oracle services for fetching, computing, and delivering off-chain data to smart contracts and data pipelines.

Chainlink Functions: Serverless environment to run custom JavaScript logic fetching data from any API, returning it on-chain.
Data Streams: Provides low-latency market data (e.g., price feeds) with faster update times (~400ms) for high-frequency applications.
Pipeline Integration: Can be used as a trusted data source within an indexing pipeline.

EXPLORE

Ponder

A local development framework for building type-safe, performant backend APIs on indexed Ethereum data. It uses a SQLite database by default.

Workflow: Define your data schema and write type-safe indexing logic in TypeScript. Ponder syncs historical data and listens for new blocks.
Key Benefit: Designed for developer experience with hot reloading, type inference, and easy testing.
Output: Creates a local GraphQL API from your indexed data, suitable for dApp backends.

EXPLORE

Goldsky

A managed platform for streaming real-time blockchain data to data warehouses (Snowflake, BigQuery) or applications via GraphQL.

Core Offering: Streaming Subgraphs that push indexed data to a destination with sub-second latency.
Destinations: Supports Kafka, HTTP webhooks, and direct integrations with Snowflake and BigQuery for analytics.
Use Case: Building real-time dashboards, analytics platforms, or triggering off-chain workflows based on on-chain events.

EXPLORE

scaling-multi-chain

SCALING AND MULTI-CHAIN ARCHITECTURE

How to Architect a Scalable On-Chain Data Pipeline

Designing a robust data pipeline for blockchain applications requires a multi-layered approach to handle indexing, transformation, and cross-chain queries efficiently.

An on-chain data pipeline extracts, transforms, and loads (ETL) data from blockchain nodes into a queryable format. The core challenge is handling the volume and velocity of data across multiple chains. A scalable architecture typically involves three layers: an ingestion layer that subscribes to blockchain events via RPC nodes, a processing layer that decodes and normalizes this data, and a serving layer that exposes APIs for applications. For high throughput, you must design for idempotency, handle chain reorganizations, and implement efficient data schemas from the start.

The ingestion layer is your connection to the raw blockchain. Instead of polling RPC endpoints, use a subscription model with WebSocket connections to listen for new blocks and logs. Tools like Ethers.js v6 or viem are essential here. For multi-chain support, you need a dedicated ingestion service per target chain (Ethereum, Polygon, Arbitrum). Each service should log to a durable message queue like Apache Kafka or Amazon SQS to decouple ingestion from processing, ensuring no data loss during downstream failures. Always implement retry logic with exponential backoff for RPC calls.

In the processing layer, raw block data is transformed into structured application data. This involves decoding event logs using ABI definitions, calculating derived states (like token balances), and handling edge cases. Use a stream-processing framework like Apache Flink or Apache Spark Streaming for stateful operations. For example, calculating a user's total DeFi exposure across protocols requires joining events from multiple contracts. Store the results in a time-series database like TimescaleDB or a columnar store like Apache Druid optimized for aggregations over large datasets.

The final serving layer provides low-latency access to the processed data. A GraphQL API powered by Hasura or Apollo Server offers flexible queries for front-end applications. For complex analytical queries, consider a dedicated OLAP database. To ensure scalability, implement data partitioning by date or chain ID and use read replicas. A critical best practice is to build idempotent pipelines; processing the same block twice should yield the same database state, which is vital for recovery after errors or chain reorgs.

Monitoring and reliability are non-negotiable. Implement comprehensive logging of block heights processed, RPC errors, and queue depths. Use alerts for processing lag or increased error rates. For a truly resilient multi-chain pipeline, design each chain's pipeline as an independent, fault-isolated service. This prevents a failure on one chain (like an RPC outage) from cascading to others. Your architecture should allow for adding new chains by replicating the ingestion and processing modules with chain-specific configurations, enabling scalable expansion across the ecosystem.

monitoring-reliability

MONITORING, ALERTING, AND ENSURING RELIABILITY

How to Architect a Scalable On-Chain Data Pipeline

Building a robust data pipeline for blockchain applications requires a deliberate architecture focused on observability and fault tolerance. This guide outlines the core components and strategies for ensuring your pipeline scales reliably under load.

A scalable on-chain data pipeline ingests raw blockchain data—blocks, transactions, logs—and transforms it into structured, queryable information for applications. The core architectural challenge is managing stateful stream processing at scale. Unlike traditional databases, blockchain data is immutable and append-only, but processing it involves tracking complex state like token balances or liquidity pool reserves over time. A robust pipeline typically consists of three layers: an ingestion layer (RPC nodes, specialized indexers), a processing/transformation layer (stream processors, indexing logic), and a serving layer (databases, APIs). Each layer must be designed for horizontal scaling and graceful degradation.

Reliability hinges on comprehensive monitoring and observability. You need metrics at every stage: RPC node health (latency, error rates, chain reorganization depth), data processing throughput (blocks/sec, events/sec), and data freshness (lag behind chain head). Tools like Prometheus for metrics collection and Grafana for dashboards are standard. For Ethereum-based chains, monitor specific RPC methods like eth_getLogs performance. Implement structured logging with correlation IDs to trace a single transaction's journey through your entire pipeline, which is critical for debugging failed or delayed data points.

Proactive alerting prevents data staleness and application downtime. Set alerts for critical failures: RPC provider disconnection, processing queue backlogs exceeding a threshold (e.g., 100 blocks), or a sustained increase in data transformation errors. Use a tiered alerting system—PagerDuty or Opsgenie for critical issues, Slack channels for warnings. For example, an alert should trigger if your pipeline's block lag exceeds 50 blocks for more than 5 minutes, indicating a processing bottleneck or upstream issue that needs immediate investigation.

To ensure data correctness, implement data quality checks and idempotent processing. Write idempotent handlers that produce the same database state if the same block is processed multiple times, which is essential for handling chain reorgs. Run periodic integrity checks, such as comparing the total token supply derived from your processed data against a direct on-chain call to the contract's totalSupply() function. Services like Great Expectations or custom scripts can validate schema consistency and business logic invariants daily.

Design for failure with retry logic, dead-letter queues (DLQs), and checkpointing. If processing a block fails, the event should be retried with exponential backoff before being moved to a DLQ for manual inspection. Use persistent checkpointing (e.g., storing the last processed block number in a durable store like Redis or PostgreSQL) to allow processors to restart from the last known good state. For high-throughput chains like Solana or Polygon, consider sharding your pipeline by block range or address prefix to parallelize the workload.

Finally, plan for cost optimization and vendor risk management. RPC calls are a major cost center. Cache static data and batch requests where possible. Avoid redundant polling by using WebSocket subscriptions for new block headers. Mitigate single-point failures by integrating multiple RPC providers (e.g., Alchemy, Infura, QuickNode) with automatic failover logic. Regularly test your failover procedure and load-test your pipeline to identify breaking points before they occur in production, ensuring consistent data delivery for your downstream dApps and analytics.

ON-CHAIN DATA PIPELINES

Frequently Asked Questions

Common technical questions and solutions for developers building scalable data infrastructure on the blockchain.

An indexer is a specific type of data pipeline component that processes and organizes raw blockchain data into a queryable format, often for a single protocol or use case (e.g., The Graph subgraph). A data pipeline is the broader architecture that encompasses data ingestion, transformation, and delivery. It includes indexers, but also components for real-time streaming (e.g., Chainscore Streams), batch processing, storage (e.g., data lakes on Filecoin or Arweave), and serving layers (APIs, subgraphs). Think of an indexer as a specialized factory line, while the pipeline is the entire supply chain from raw material to finished product.

resource-links

DEVELOPER RESOURCES

Further Resources and Documentation

These resources cover indexing, ingestion, storage, and querying patterns required to build a scalable on-chain data pipeline. Each card links to primary documentation or battle-tested tooling used in production blockchain analytics systems.

The Graph Protocol: Indexing On-Chain Data

The Graph is the most widely used decentralized indexing protocol for EVM-compatible chains. It allows teams to transform raw blockchain data into queryable GraphQL APIs using deterministic mappings.

Key concepts to understand:

Subgraphs define which contracts, events, and calls are indexed
Mappings written in AssemblyScript transform logs into entities
Graph Node handles block ingestion, reorg handling, and indexing

For scalable pipelines, teams often combine The Graph for hot, query-optimized data with a separate warehouse for historical analytics. The Graph supports Ethereum, Arbitrum, Optimism, Polygon, Base, and more. It is suitable when you need low-latency reads, deterministic indexing, and standardized schemas across environments.

EXPLORE

Ethereum JSON-RPC and Execution Clients

Direct ingestion from Ethereum JSON-RPC endpoints is the foundation of custom on-chain data pipelines. Execution clients like Geth, Nethermind, and Erigon expose block, transaction, receipt, and trace data.

Important RPC methods for data pipelines:

eth_getBlockByNumber for block-level ingestion
eth_getLogs for event-based indexing
trace_block or debug_traceTransaction for internal calls

For scale, teams typically:

Run their own nodes to avoid rate limits
Use Erigon for faster historical sync and archive access
Stream data into Kafka or cloud queues for downstream processing

This approach provides maximum flexibility but requires handling reorgs, backfilling, and schema versioning manually.

EXPLORE

Kafka-Based Streaming Architectures

Apache Kafka is commonly used as the event backbone for scalable blockchain data pipelines. Blocks, logs, and traces are ingested once and fanned out to multiple consumers.

Typical pipeline layout:

Ingest blocks and logs from RPC nodes
Publish normalized events to Kafka topics
Consumers handle indexing, alerting, and storage independently

Advantages of Kafka-based designs:

Horizontal scalability under high throughput
Built-in replay for backfills and reprocessing
Clear separation between ingestion and analytics

Many teams use Kafka alongside Flink or Spark Streaming for real-time enrichment before persisting data to OLAP stores like ClickHouse or BigQuery.

EXPLORE

BigQuery Public Blockchain Datasets

Google BigQuery hosts public blockchain datasets for Ethereum and other networks, maintained in partnership with ecosystem providers. These datasets include blocks, transactions, logs, and token transfers.

Why teams use BigQuery datasets:

Immediate access to terabytes of historical data
SQL-based analytics without managing ingestion
Easy integration with BI tools and notebooks

Common use cases:

Validating custom pipeline outputs
Running one-off historical analyses
Prototyping schemas before building ingestion

While not suitable for real-time workloads, BigQuery datasets are useful for benchmarking, backtesting, and exploratory research.

EXPLORE

ClickHouse for On-Chain Analytics Storage

ClickHouse is a high-performance columnar OLAP database frequently used to store decoded on-chain data at scale. It supports billions of rows with sub-second analytical queries.

Why ClickHouse fits blockchain workloads:

Efficient storage of append-only event data
Fast aggregations over time-series blocks
Native support for materialized views

Typical schemas include:

Blocks and transactions by number and timestamp
Event logs partitioned by contract and topic
Token balances and transfer tables

ClickHouse is often paired with Kafka for ingestion and powers dashboards, APIs, and internal research tools.

EXPLORE