Rollup monitoring is essential for maintaining the health, security, and performance of your Layer 2 network. Unlike monolithic chains, rollups introduce unique components like sequencers, provers, and bridges that require specialized oversight. A robust monitoring system tracks data availability, state commitment latency, transaction finality, and bridge security. Without it, you risk silent failures, degraded user experience, and potential security vulnerabilities. The goal is to achieve observability—not just collecting logs, but deriving actionable insights into system behavior.
Setting Up Rollup Monitoring Systems
Setting Up Rollup Monitoring Systems
A step-by-step tutorial for developers to implement comprehensive monitoring for rollup infrastructure, covering key metrics, tools, and alerting strategies.
The monitoring stack typically consists of three layers: data collection, processing/aggregation, and visualization/alerting. For collection, you'll need agents to scrape metrics from your sequencer node (e.g., Geth or Erigon fork), the prover service, and the bridge contracts. Key metrics include rollup_sequenced_batches, rollup_l1_submission_delay, prover_batch_proof_time, and bridge_total_value_locked. Tools like Prometheus are standard for pulling and storing this time-series data. Logs from these services should be aggregated using Loki or a similar service for tracing specific transaction journeys.
Here's a basic Prometheus configuration snippet to scrape a rollup node's metrics endpoint:
yamlscrape_configs: - job_name: 'rollup_sequencer' static_configs: - targets: ['sequencer-host:9090'] metrics_path: '/metrics'
You must instrument your rollup node's code to expose these custom metrics. For an OP Stack chain, you would monitor the op_node and op_geth health endpoints. For a zkRollup like zkSync Era, you would track the server and prover components. The processing layer often uses Grafana for dashboards and Alertmanager to route alerts based on threshold rules, such as a sequencer being down for more than 5 minutes.
Critical alerts should be configured for core failure scenarios. These include: SequencerIsDown, HighL1SubmissionDelay (e.g., >30 minutes), DataAvailabilityError from the DAC or L1, ProverQueueBacklog exceeding a safe limit, and BridgeActivityAnomaly indicating a potential exploit. Alerts should be routed to appropriate channels like PagerDuty, Slack, or OpsGenie. It's also crucial to monitor the economic security of the system by tracking the bond size of validators/provers and the challenge period status for optimistic rollups.
Finally, effective monitoring extends beyond infrastructure to the user experience. Implement synthetic transactions that periodically send test transfers through the bridge and measure the end-to-end confirmation time. Use blockchain explorers like Blockscout (for your rollup) and Etherscan (for L1) as external data sources to verify state consistency. By combining low-level system metrics with high-level application checks, you create a defense-in-depth monitoring strategy that can identify issues from the hardware layer all the way to the end-user transaction.
Prerequisites and Setup
Essential tools and configurations required to build a robust monitoring system for rollup networks.
Effective rollup monitoring requires a foundational stack of tools and services before you begin writing custom alerts or dashboards. The core components are a blockchain node (either an execution client for the L1 or a sequencer RPC for the L2), a time-series database for storing metrics, and a visualization/alerting platform. For production systems, you'll need dedicated infrastructure for each component to ensure reliability and data isolation. Popular stacks include running a Geth or Erigon node for Ethereum, Prometheus for metrics collection, and Grafana for dashboards and alerting.
The first critical step is establishing reliable data ingestion. You must run or have access to a node with the appropriate RPC endpoints. For monitoring an Optimism or Arbitrum rollup, you need a connection to the sequencer's RPC (https://mainnet.optimism.io) and a connection to the L1 (e.g., Ethereum Mainnet) to track bridge contracts and dispute events. Use tools like Prometheus Node Exporter for system metrics and a custom exporter (often written in Go or Python) to query the node's JSON-RPC API and convert blockchain data into Prometheus metrics, such as rollup_block_height, pending_transactions, and gas_price.
Configuration is key to a maintainable system. Your Prometheus scrape_configs must define jobs for your node exporter and custom blockchain exporter. A typical alert rule in Prometheus YAML might watch for a stalled sequencer: expr: increase(rollup_block_height[5m]) == 0. For visualizing this data, Grafana dashboards should be built to show real-time chain health, including blocks per second, transaction pool size, and bridge finalization delays. Always secure these endpoints; use firewalls, VPNs, or authentication proxies for Prometheus and Grafana interfaces exposed to the internet.
Beyond the base setup, consider integrating log aggregation with Loki or ELK Stack to parse node logs for errors, and set up alert managers like Alertmanager to route notifications to Slack, PagerDuty, or email. For teams not wanting to manage this infrastructure, third-party services like Chainstack, Blockdaemon, or Tenderly provide managed nodes with enhanced APIs and built-in monitoring features, which can significantly reduce initial setup time while providing production-ready reliability and uptime guarantees.
Setting Up Rollup Monitoring Systems
A practical guide to building observability for rollup infrastructure, covering essential metrics, data sources, and alerting strategies.
Rollup monitoring requires a multi-layered approach, as you must track both the health of the underlying L1 settlement layer and the internal state of the rollup's own execution environment. At a minimum, your system should monitor sequencer health, data availability, state commitment finality, and cross-chain messaging. For example, an Optimism or Arbitrum node operator needs to track the sequencer_pending_tx_count to detect transaction processing backlogs, while also verifying that batch submissions to Ethereum are succeeding and not exceeding gas limits. This dual-layer visibility is non-negotiable for maintaining user trust and system reliability.
The primary data sources for monitoring are node RPC endpoints, blockchain explorers, and dedicated indexers. You should instrument your rollup node's JSON-RPC API to collect metrics like eth_blockNumber propagation delay and net_peerCount. For Ethereum L1 dependencies, use services like Alchemy or Infura, or your own archival node, to monitor contract events from the rollup's Inbox and Bridge contracts. Prometheus is the industry-standard tool for scraping and storing these time-series metrics, while Grafana provides the visualization layer. A critical alert might trigger if the rollup_state_root_lag exceeds 100 blocks, indicating a potential halt in state progression.
Effective alerting separates operational noise from genuine incidents. Configure alerts based on thresholds (e.g., sequencer downtime > 2 minutes), absences (e.g., no new batches for 10 minutes), and anomalies (e.g., a 300% spike in failed transactions). Use a tool like Alertmanager to route alerts to Slack, PagerDuty, or email. For instance, a key alert for a zkSync Era validator would monitor the frequency of zkSync_proof_submissions to Ethereum; a missed window could stall withdrawals. Always include contextual information in alerts, such as the affected chain ID and the last known good block hash, to accelerate diagnosis.
Beyond basic uptime, you must monitor for economic security and data integrity. Track the rollup's bond or stake on the L1 to ensure it's sufficiently collateralized. Monitor the cost and latency of forced inclusion transactions, a user's escape hatch if the sequencer censors them. For optimistic rollups, alert on the challenge period status and any submitted fraud proofs. For ZK rollups, verify the validity proof submission latency and verification success rate. These metrics guard against liveness failures and ensure the system's cryptographic guarantees are functioning as designed.
Finally, implement structured logging and distributed tracing for deep diagnostics. Logs from your rollup node's geth or reth instance should be ingested into a system like Loki or Elasticsearch. Trace individual transaction journeys from user submission through mempool, sequencing, batch creation, L1 submission, and finalization. This trace data is invaluable when debugging issues like a transaction that is finalized on L1 but not appearing in the rollup's state. A robust monitoring setup is not a one-time task; it requires continuous refinement of dashboards and alerts as the network upgrades and usage patterns evolve.
Essential Monitoring Tools
A robust monitoring stack is critical for rollup security and performance. These tools provide the observability needed to track sequencer health, bridge activity, and fraud proofs.
Economic Security Dashboards
Monitor the economic security of the rollup, particularly for Optimistic Rollups. Track the total value bonded in the fraud proof system and the value locked in the bridge. A significant drop in bonded value relative to bridge TVL increases security risk. Dashboards should also track the challenger set's health and activity.
- Vital Statistic: Ratio of bonded ETH to bridge TVL.
- Alert Threshold: Bonded value falls below a predefined safety multiple of bridge TVL.
Core Metrics by Rollup Type
Key performance indicators and operational data points to track for different rollup architectures.
| Metric / Event | ZK Rollups | Optimistic Rollups | Validiums |
|---|---|---|---|
State Finality Time | ~10 min | ~7 days | ~10 min |
Data Availability Layer | On-chain | On-chain | Off-chain (DAC/Celestia) |
Proof/Dispute Submission Interval | Every batch | Only if fraud is suspected | Every batch |
Primary Cost Driver | ZK proof generation | L1 calldata & bond posting | Off-chain data & ZK proof |
Critical Monitoring Alert | Proof verification failure on L1 | State root challenge initiated | Data availability challenge or proof failure |
Gas Fee Tracking Complexity | Medium (L1 verify + batch) | High (L1 dispute windows) | Medium (L1 verify + DA proof) |
Sequencer Liveness Check | |||
Required Trust Assumption | Cryptographic (validity proof) | Economic (fraud proof bond) | Cryptographic + Data Committee |
Implementation: Setting Up Prometheus and Grafana
A step-by-step guide to deploying a robust monitoring stack for rollup node operators, enabling real-time visibility into system health, performance, and consensus metrics.
Effective rollup node operation requires comprehensive monitoring to ensure high availability, performance stability, and consensus participation. A Prometheus and Grafana stack provides this visibility by collecting, storing, and visualizing time-series metrics. Prometheus acts as the metrics collection and storage engine, pulling data from instrumented services like your rollup client (e.g., OP Stack, Arbitrum Nitro) and the underlying execution and consensus layer clients. Grafana serves as the visualization layer, allowing you to build dashboards that display key performance indicators (KPIs) such as block production latency, transaction throughput, peer counts, and system resource usage.
The first step is installing and configuring Prometheus. After downloading the latest release from the official website, you define a prometheus.yml configuration file. This file specifies which targets to scrape (your nodes) and how often. A crucial configuration is setting up service discovery for dynamic environments, though for a static setup, you list targets directly. For a rollup sequencer, you would typically scrape metrics from ports like :7300 for the rollup client's metrics endpoint, :6060 for the execution client (e.g., Geth), and :8080 for the consensus client (e.g., Lighthouse).
Next, you must expose metrics from your rollup node software. Most modern rollup clients have built-in Prometheus support. For an OP Stack node, you enable metrics by setting the --metrics.enabled flag and specifying a port (--metrics.port=7300). Similarly, ensure your execution and consensus clients are configured to expose their metrics endpoints. The key is verifying that the /metrics HTTP endpoint on each service returns data. You can test this with a simple curl localhost:7300/metrics command. Prometheus will then periodically HTTP GET this endpoint to collect the data.
With data flowing into Prometheus, you deploy Grafana to create actionable dashboards. After installation, you add your Prometheus server as a data source within Grafana's UI. The power lies in crafting PromQL queries to extract meaningful insights. For example, to monitor sequencer health, you might track rollup_sequencer_blocks_proposed to ensure continuous block production, or increase(rollup_sequencer_tx_processed_total[5m]) to visualize transaction throughput. For system health, use node exporter metrics like node_memory_MemAvailable_bytes and node_cpu_seconds_total. Grafana allows you to plot these queries on graphs, set up alert rules based on thresholds, and organize them into a cohesive dashboard.
To move from monitoring to alerting, configure Prometheus Alertmanager. This involves defining alert.rules files that contain conditions which, when met, trigger alerts. A critical rule for a rollup operator might be: ALERT SequencerDown IF up{job="rollup-node"} == 0 FOR 1m. This checks if the metrics endpoint is unreachable. Alertmanager then handles routing, grouping, and silencing of these alerts, sending notifications via channels like email, Slack, or PagerDuty. This creates a proactive system where operators are notified of issues like high memory usage, stalled block production, or peer connection loss before they impact network service.
Finally, consider advanced configurations for production resilience. Run Prometheus and Grafana in Docker containers or orchestrate them with Kubernetes for easy management and scaling. Implement long-term storage for metrics by integrating Prometheus with remote write targets like Thanos or Cortex, which is essential for analyzing historical performance trends. Regularly update your dashboards and alerting rules to match new versions of your rollup software and incorporate community best practices. A well-tuned monitoring stack is not a set-and-forget tool but a critical component of operational excellence for any rollup node operator.
Code Snippets for Custom Metrics
Implement custom monitoring dashboards for rollups using Prometheus, Grafana, and the Chainscore API to track performance, security, and economic health.
Rollups require specialized monitoring beyond standard node metrics. Key custom metrics include sequencer health (block production latency, batch submission success rate), data availability layer status (DA submission latency, blob confirmation time), prover performance (proof generation time, success rate), and economic security (sequencer bond value, fraud proof/challenge window status). These metrics provide early warnings for liveness failures, congestion, and security degradation. Tools like Prometheus for metric collection and Grafana for visualization form the core of a robust monitoring stack.
To collect custom metrics, you need to instrument your rollup node software. Below is a Python example using the prometheus_client library to expose a gauge for sequencer batch submission latency. This script simulates measuring the time between batch creation and its successful inclusion on the L1.
pythonfrom prometheus_client import Gauge, start_http_server import time import random # Define a custom Prometheus Gauge BATCH_SUBMISSION_LATENCY = Gauge('rollup_batch_submission_latency_seconds', 'Latency of batch submission to L1 in seconds') def simulate_batch_submission(): """Simulates a batch submission and records its latency.""" start_time = time.time() # Simulate network delay and L1 confirmation time time.sleep(random.uniform(2.0, 10.0)) latency = time.time() - start_time # Set the gauge value BATCH_SUBMISSION_LATENCY.set(latency) print(f"Batch submitted with latency: {latency:.2f}s") if __name__ == '__main__': # Start Prometheus metrics HTTP server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") # Simulate periodic batch submissions while True: simulate_batch_submission() time.sleep(30)
Run this script, and Prometheus will scrape metrics from http://localhost:8000. The rollup_batch_submission_latency_seconds gauge will then be available for graphing in Grafana.
For L1 state and on-chain data, integrate the Chainscore API. This provides verified metrics like sequencer bond balances, fraud proof window status, and bridge activity without requiring complex event indexing. The following snippet fetches the current economic security metrics for a specified rollup, which you can feed into your Prometheus instance.
javascript// Node.js example using axios to fetch Chainscore API data const axios = require('axios'); const { Gauge, Registry } = require('prom-client'); // Create a custom Prometheus registry and gauge const registry = new Registry(); const sequencerBondGauge = new Gauge({ name: 'rollup_sequencer_bond_eth', help: 'Sequencer bond value in ETH', registers: [registry], }); async function updateChainscoreMetrics() { try { // Replace with your actual API key and rollup identifier const response = await axios.get( 'https://api.chainscore.dev/v1/rollups/optimism/metrics/economic-security', { headers: { 'x-api-key': 'YOUR_API_KEY' } } ); const { sequencerBondEth } = response.data; // Update the Prometheus gauge with the live value sequencerBondGauge.set(parseFloat(sequencerBondEth)); console.log(`Updated sequencer bond gauge: ${sequencerBondEth} ETH`); } catch (error) { console.error('Failed to fetch Chainscore metrics:', error.message); } } // Update metrics every 60 seconds setInterval(updateChainscoreMetrics, 60000); // Expose metrics endpoint for Prometheus require('http').createServer(async (req, res) => { if (req.url === '/metrics') { res.setHeader('Content-Type', registry.contentType); res.end(await registry.metrics()); } }).listen(8080);
In Grafana, create dashboards using your custom Prometheus metrics. Key panels to build include: a time-series graph for rollup_batch_submission_latency_seconds with alerts for spikes over 30 seconds; a stat panel for rollup_sequencer_bond_eth with a warning threshold; and a heartbeat panel for prover status. Use Grafana Alerting to configure notifications to Slack, PagerDuty, or email when critical metrics breach thresholds, such as sequencer downtime or a significant drop in bond value. This end-to-end pipeline—custom export, external API integration, visualization, and alerting—creates a production-grade monitoring system tailored to your rollup's specific risks.
Troubleshooting Common Issues
Common problems encountered when setting up monitoring for rollups, with solutions for developers.
A failing sequencer health check typically indicates a connectivity or state issue. Common causes include:
- RPC Endpoint Issues: The monitoring service cannot reach your sequencer's RPC endpoint (
http://localhost:8545). Verify the node is running and the port is open. - Chain ID Mismatch: Your monitoring tool is configured for the wrong chain ID. Confirm the
CHAIN_IDin your rollup config matches the one in your monitoring dashboard. - Block Production Halted: The sequencer has stopped producing blocks. Check sequencer logs for errors and verify the batcher and proposer components are functioning.
- High Latency: Response time from the sequencer exceeds the health check threshold (often 5-10 seconds). This can be due to high load or system resource constraints.
First, run a manual curl command to the RPC endpoint: curl -X POST -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' http://localhost:8545. If this fails, the issue is with your node, not the monitor.
Recommended Alert Rules and Thresholds
Critical alert configurations for detecting anomalies in sequencer, prover, and bridge operations.
| Alert Type | Severity | Recommended Threshold | Action Required |
|---|---|---|---|
Sequencer Liveness | Critical |
| Immediate PagerDuty |
Proving Latency | High |
| Investigate within 1 hour |
State Root Finality Delay | High |
| Investigate within 1 hour |
Bridge Deposit/Withdrawal Failure Rate | High |
| Investigate within 2 hours |
L1 Gas Price Spike | Medium |
| Monitor and adjust batch size |
RPC Error Rate (5xx) | Medium |
| Check node health |
Batch Submission Cost | Informational |
| Review gas optimization |
External Resources and Documentation
Primary documentation and tooling references for designing, deploying, and operating rollup monitoring systems in production. These resources focus on node health, fault detection, data availability, and alerting.
Frequently Asked Questions
Common questions and troubleshooting steps for developers implementing rollup monitoring and alerting systems.
Monitoring a rollup requires observing two distinct layers. L1 (Ethereum) monitoring tracks the canonical state and security guarantees, focusing on:
- Batch/State root submissions to the L1 bridge contract.
- Challenge periods and fraud proof windows.
- Sequencer status via L1 contract calls.
L2 (Rollup) monitoring tracks the execution environment and user experience, including:
- Sequencer health (RPC endpoint availability, block production).
- Transaction lifecycle (queueing, execution, finality).
- Cross-chain message delivery (L1->L2 and L2->L1).
A complete system must correlate events across both layers to detect failures like a sequencer producing blocks but failing to post them to L1.
Conclusion and Next Steps
You have now configured a foundational monitoring system for your rollup. This guide covered the essential components: data collection, alerting, and visualization.
A robust monitoring stack is not a one-time setup but an evolving system. Your next step should be to define Service Level Objectives (SLOs) and Service Level Indicators (SLIs). For a rollup, key SLIs include sequencer liveness, batch submission latency, L1 confirmation time, and state root finality. Tools like Prometheus can calculate error budgets and alert you when you're at risk of violating an SLO, shifting monitoring from reactive to proactive management.
To deepen your observability, integrate distributed tracing using Jaeger or Tempo. This is critical for debugging cross-layer transactions. You can instrument your sequencer, prover, and node software with OpenTelemetry to trace a user transaction from its submission on L2, through batch creation and proof generation, to its finalization on the L1. Correlating logs, metrics, and traces provides a complete picture of system behavior and failure points.
Finally, consider automating responses to common alerts. Using the Prometheus Alertmanager with webhook integrations, you can create runbooks that automatically restart a stalled service, failover to a backup sequencer, or post detailed incident summaries to a team channel. The goal is to reduce mean time to resolution (MTTR). Regularly review and test your alerting rules to prevent alert fatigue and ensure they remain relevant as your rollup's architecture evolves.