Free 30-min Web3 Consultation
Book Now
Smart Contract Security Audits
Learn More
Custom DeFi Protocol Development
Explore
Full-Stack Web3 dApp Development
View Services
Free 30-min Web3 Consultation
Book Now
Smart Contract Security Audits
Learn More
Custom DeFi Protocol Development
Explore
Full-Stack Web3 dApp Development
View Services
Free 30-min Web3 Consultation
Book Now
Smart Contract Security Audits
Learn More
Custom DeFi Protocol Development
Explore
Full-Stack Web3 dApp Development
View Services
Free 30-min Web3 Consultation
Book Now
Smart Contract Security Audits
Learn More
Custom DeFi Protocol Development
Explore
Full-Stack Web3 dApp Development
View Services
LABS
Guides

How to Organize Rollup Incident Response

A practical guide for rollup developers and operators on establishing a formal incident response process, from detection and triage to communication and post-mortem analysis.
Chainscore © 2026
introduction
OPERATIONAL GUIDE

How to Organize Rollup Incident Response

A structured framework for development teams to prepare for and manage security incidents, downtime, or critical bugs in rollup environments.

Effective rollup incident response begins with preparation, not reaction. Unlike monolithic blockchains, rollups introduce unique failure modes across the sequencer, data availability layer, bridge contracts, and proving system. Your first step is to define a clear incident severity matrix. Categorize events by impact: SEV-1 for total network halt or fund loss, SEV-2 for degraded performance or high-risk bugs, and SEV-3 for non-critical issues. Assign an on-call rotation from core engineering and DevOps, ensuring 24/7 coverage with defined escalation paths to key stakeholders and external auditors like OpenZeppelin.

Establish a dedicated, private war room channel (e.g., in Slack or Discord) for immediate coordination. This channel should be pre-configured with critical alerts from monitoring tools. Essential monitoring targets include: sequencer health (block production, RPC latency), data availability submission success rates, L1 bridge contract event logs for suspicious withdrawals, and prover status for validity or fraud proofs. Tools like Tenderly, Blocknative, or custom alerting via Prometheus/Grafana are commonly used. Automate alerts to trigger based on predefined thresholds, such as sequencer downtime exceeding 5 minutes.

When an incident is declared, the first responder's role is to diagnose and contain. Immediately gather logs and metrics to identify the incident's scope: Is it isolated to the sequencer, or is the L1 bridge affected? For a halted sequencer, the response may involve failover to a backup or manual transaction ordering. If a bug is suspected in a smart contract, the priority is to pause the vulnerable module—most rollup bridges include pause functions controlled by a multisig or timelock. Document every action taken in a shared log, as this will be crucial for post-mortem analysis and communicating with users.

Communication is critical. Prepare templated announcements for different severity levels. For a SEV-1 incident, immediately update the public status page and post a concise alert on social media (e.g., "We are investigating an issue with the sequencer. Transmissions are temporarily paused."). Provide regular, honest updates even if the root cause is unknown. For developers, maintain a real-time feed in your public Discord or Telegram support channel. Transparency builds trust; obscuring the severity or ETA for resolution often leads to greater community backlash and speculation.

After resolution, conduct a blameless post-mortem within 72 hours. The report should detail the timeline, root cause, impact metrics (downtime, affected users, financial impact), and, most importantly, actionable follow-up items. Examples include: "Add circuit breaker to sequencer batch logic," "Increase test coverage for edge-case withdrawal proofs," or "Implement more granular bridge pausing." Publish a summarized version of this post-mortem publicly. This practice, adopted by teams like Optimism and Arbitrum, demonstrates accountability and helps the entire ecosystem learn from the event.

prerequisites
ROLLUP OPERATIONS

Prerequisites for Effective Incident Response

A structured, pre-defined response plan is critical for minimizing downtime and financial loss during a rollup incident. This guide outlines the essential components to establish before an emergency occurs.

Effective incident response for a rollup begins with clear ownership and communication channels. Designate a primary on-call engineer and establish escalation paths to senior developers or protocol architects. Define communication protocols using tools like Discord war rooms, Telegram groups, or PagerDuty to ensure rapid, coordinated action. Crucially, maintain a public-facing status page (e.g., using Statuspage or a GitHub issue) to provide transparent, real-time updates to users and dApp developers, managing community expectations and reducing panic during an outage.

Technical preparedness requires comprehensive monitoring and alerting. Implement observability stacks that track core sequencer health metrics: batch_submission_latency, L1_gas_spike_detection, state_root_commitment_delay, and RPC_endpoint_availability. Set up alerts for deviations from baseline performance using tools like Prometheus, Grafana, or Datadog. Additionally, maintain ready access to sequencer private keys, multi-sig wallets, and upgrade contracts on the L1. These must be securely stored but instantly accessible to authorized responders to execute emergency pauses, upgrades, or fund recovery.

Documentation is the backbone of any response. Create and regularly update a runbook with step-by-step procedures for common failure scenarios. This should include:

  • Sequencer halt and restart procedures
  • Forced transaction inclusion via L1
  • Contract upgrade and pausing mechanisms
  • Cross-chain bridge pause/unpause commands Test these procedures regularly in a testnet or devnet environment that mirrors mainnet configurations. Familiarity with these steps under non-stressful conditions prevents critical mistakes during a live incident.

Finally, establish legal and operational protocols. Define decision-making authority for invoking emergency measures, which may involve a decentralized multisig or a pre-authorized committee. Understand the regulatory and contractual implications of pausing a network or rolling back state. Coordinate with key ecosystem partners—such as major bridges (LayerZero, Wormhole), oracles (Chainlink), and DEXs—to ensure their systems are aligned with your response actions, preventing fragmented liquidity or arbitrage attacks during the recovery process.

key-concepts-text
OPERATIONAL FRAMEWORK

How to Organize Rollup Incident Response

A structured incident response plan is critical for rollup operators to minimize downtime and protect user funds during a protocol failure.

Effective rollup incident response begins with preparation. This involves establishing a clear on-call rotation for core engineering and DevOps teams, documented communication channels (e.g., private war rooms, public status pages), and pre-defined severity classifications. For example, a Severity 1 (S1) incident might be a sequencer halt causing a total liveness failure, while an S2 could be a critical bug in a bridge contract. Teams should maintain a runbook with immediate diagnostic commands, such as checking sequencer health via an RPC endpoint (curl -X POST https://rollup-rpc.example.com -H "Content-Type: application/json" --data '{"jsonrpc":"2.0","method":"eth_blockNumber","id":1}').

When an incident is detected, the first phase is identification and declaration. The on-call engineer confirms the issue—such as stalled block production, a spike in failed transactions, or an exploit alert—and formally declares an incident within the designated channel. This triggers the assembly of the incident commander and relevant responders. Clear, time-stamped logs are essential. The team must quickly determine the incident's scope: Is it isolated to the sequencer, a data availability layer, a bridge, or the underlying smart contracts? This assessment dictates the next steps and communication strategy.

The core of the response is containment and mitigation. The goal is to stop the damage and restore basic functionality. For a sequencer failure, this may involve failing over to a backup node or, in extreme cases, coordinating a temporary pause via the upgrade multisig. If a vulnerability is found in a bridge, the team might need to temporarily disable deposits. All actions must be executed through the protocol's governance or emergency security council, following pre-authorized multisig procedures. During this phase, internal communication is paramount to avoid conflicting actions.

Parallel to technical mitigation is stakeholder communication. A status page should be updated with confirmed facts, impacted services, and estimated time to resolution. Transparent, timely updates build trust. For major incidents affecting user funds, a post-mortem must be published, detailing the root cause, response timeline, and corrective actions. This document, following a template like those from Google or GitLab, is a cornerstone of operational maturity and is often required by ecosystem partners and auditors.

Finally, the post-incident review is where long-term improvements are made. The team conducts a blameless retrospective to analyze the response: Were detection times adequate? Were runbooks followed? What new monitoring or circuit breakers could prevent recurrence? Findings are translated into concrete action items, such as deploying additional sequencer health checks, improving alert granularity, or updating contract pausing mechanisms. This cycle of preparation, response, and learning fortifies the rollup's resilience against future incidents.

incident-types
INCIDENT RESPONSE

Common Rollup Incident Types

Understanding common failure modes in rollups is the first step to building an effective response plan. This guide categorizes the primary technical and economic incidents that can affect L2 networks.

SEVERITY CLASSIFICATION

Incident Severity and Response Matrix

A framework for classifying rollup incidents by impact and defining the required response protocol.

Severity LevelImpact DescriptionExample ScenariosInitial Response TimeCommunication ProtocolEscalation Path

SEV-1: Critical

Total network halt, loss of funds, or critical consensus failure.

Sequencer downtime > 2 hours, invalid state root commitment, bridge exploit.

< 15 minutes

Public announcement + dedicated status page + all social channels.

Immediate escalation to core engineering and executive team.

SEV-2: High

Major service degradation, partial downtime, or significant performance issues.

Sequencer lag > 30 blocks, RPC endpoint failure, >50% fee spike.

< 1 hour

Public status page update + core community channels (Discord/TG).

Escalation to on-call SRE and protocol leads within 1 hour.

SEV-3: Medium

Non-critical bug, minor performance degradation, or UI/UX issues.

Explorer displaying stale data, minor RPC latency, non-critical contract bug.

< 4 hours

Internal tracking + notification in developer channels.

Assigned to engineering team for next business-day resolution.

SEV-4: Low

Cosmetic issues, documentation errors, or feature requests.

Typos on website, incorrect API docs, non-blocking UI bug.

Next business day

Internal ticketing system (e.g., Jira, Linear).

Routed to appropriate product or developer relations team.

response-process
ROLLUP OPERATIONS

Step-by-Step Incident Response Process

A structured framework for organizing and executing a rapid, effective response to security incidents or critical failures on a rollup network.

A formalized incident response process is critical for rollup operators to minimize downtime, financial loss, and reputational damage. Unlike monolithic blockchains, rollups introduce unique failure modes involving sequencers, data availability layers, and bridge contracts. The primary goals are containment, eradication, and recovery. This process should be documented, rehearsed via tabletop exercises, and integrated with on-chain governance mechanisms for protocol-level upgrades if necessary. Key stakeholders include the core dev team, sequencer operators, validators, and a designated communications lead.

The first phase is Preparation and Detection. This involves establishing monitoring for key metrics: sequencer liveness, transaction inclusion latency, state root divergence, and bridge fund balances. Tools like Prometheus, Grafana, and custom health checks for the rollup node's RPC endpoints are essential. Detection can come from automated alerts, user reports on social channels, or security monitoring services like Forta. Immediately upon detection, the incident commander is activated and a private, dedicated communication channel (e.g., a war room in Slack or Discord) is established to coordinate the response away from public view.

Next is Containment and Analysis. The immediate action is to assess the scope: Is the sequencer halted? Is the bridge paused? Are funds at risk? Short-term containment may involve using administrative functions, like pausing the L1CrossDomainMessenger contract in Optimism-style rollups, to prevent further malicious transactions. Simultaneously, the team conducts forensic analysis. This includes examining sequencer logs, analyzing the faulty batch or state transition, and tracing the incident's root cause. The analysis must differentiate between a software bug, a malicious exploit, or a failure in the underlying data availability layer (like Celestia or EigenDA).

The Eradication and Recovery phase involves deploying a fix. For a sequencer bug, this may mean patching the node software, restarting with a corrected version, and ensuring it re-syncs correctly. For a smart contract vulnerability, a governance-approved upgrade is required. Recovery must be carefully orchestrated: the fixed sequencer begins producing new blocks, verifiers confirm the new chain is valid, and the bridge is unpaused. A crucial step is ensuring the recovered chain's state is consistent and that no double-spends or incorrect state transitions occurred during the incident window. This often requires manual verification of critical state hashes.

Finally, conduct Post-Incident Review. Document a detailed timeline, root cause, impact assessment, and corrective actions. This report should be public (following a responsible disclosure period) to maintain transparency. Implement the corrective actions, which may include code audits, improved monitoring rules, or updates to the incident response plan itself. For example, after the 2022 Optimism outage due to a Geth dependency bug, the team published a post-mortem and improved their node health-check systems. This phase closes the loop, turning the incident into a learning opportunity to strengthen the rollup's resilience.

essential-tools
ROLLUP INCIDENT RESPONSE

Essential Monitoring and Tooling

A structured approach to detecting, analyzing, and resolving issues in rollup environments. This framework covers the tools and processes needed for effective incident management.

communication-plan
OPERATIONAL GUIDE

How to Organize Rollup Incident Response

A structured communication plan is critical for managing security incidents, protocol failures, and network outages in rollup environments. This guide outlines the key components and steps for an effective response.

An incident response plan for a rollup must address the unique technical and social coordination challenges of Layer 2 systems. Unlike monolithic chains, a rollup incident can involve the sequencer, data availability layer, bridge contracts, and the underlying L1 settlement layer. Your plan should define clear severity tiers (e.g., Sev-1: Total network halt, Sev-2: Partial degradation) and assign specific on-call responsibilities for engineering, DevOps, and community teams. The first step is immediate internal triage using monitoring tools like block explorers, sequencer health checks, and bridge fund tracking to confirm the incident's scope.

Internal communication must be swift and structured. Use a dedicated, private channel (e.g., a war room in Slack or Discord) for the core response team. The first message should follow a standard template: Incident Title, Severity, Start Time, Affected Systems (Sequencer/Prover/Bridge), and Initial Impact. Designate a single Incident Commander to coordinate technical mitigation and a Communications Lead to manage external messaging. This separation prevents conflicting information and allows engineers to focus on resolution. All actions and discoveries should be logged in a shared document for post-mortem analysis.

External communication requires transparency balanced with precision. For a Sev-1 incident, immediately post a notice on the project's official status page and main social channel (e.g., Discord announcement or Twitter/X). The initial message should acknowledge the issue, state what is being investigated, and indicate when the next update will be provided—even if the root cause is unknown. Avoid technical speculation. Subsequent updates should follow at regular intervals (e.g., every 30-60 minutes) to maintain trust. For developers, use dedicated channels like a Telegram/Signal group for ecosystem partners and major dApp integrators.

The resolution and post-mortem phase is crucial for long-term resilience. Once the incident is mitigated, publish a preliminary "All Clear" notice, followed by a commitment to a detailed post-mortem report within a defined timeframe (typically 3-7 days). The public report should include: a timeline of events, root cause analysis, impact assessment (e.g., funds at risk, downtime duration), and a list of concrete corrective actions. For example, after a sequencer outage, actions might include implementing multi-sequencer failover or improving L1 gas price monitoring. This transparency is essential for maintaining validator, developer, and user trust in the rollup's operational integrity.

ROLLUP OPERATIONS

Frequently Asked Questions

Common questions and solutions for managing incidents in rollup environments, from sequencer downtime to state recovery.

Immediately verify the scope and cause. Check the sequencer's health endpoint (e.g., http://sequencer:7300/health), review logs for errors, and confirm if the issue is isolated or network-wide. Your primary goal is to prevent a state divergence between L1 and L2.

  1. Activate Monitoring Alerts: Ensure your team is notified via PagerDuty, Slack, or OpsGenie.
  2. Public Communication: Update your status page (e.g., using Statuspage.io) and post to social channels (Twitter, Discord) to inform users of degraded service.
  3. Failover Assessment: Determine if you have a hot standby sequencer ready to take over, or if you need to initiate manual recovery procedures.
conclusion
INCIDENT RESPONSE

Conclusion and Next Steps

A structured incident response plan is not a one-time document but a living framework that must be tested and refined. This final section outlines how to operationalize your plan and where to focus your ongoing security efforts.

Implementing your rollup incident response plan requires moving from theory to practice. Start by conducting a tabletop exercise with your core team. Simulate a scenario like a sequencer failure or a critical vulnerability in a bridge contract. Walk through each step of your plan: initial detection via your monitoring dashboard, internal communication via your designated Slack channel or PagerDuty, and the execution of your documented mitigation procedures. This dry run will expose gaps in your processes, such as unclear role assignments or missing escalation contacts, before a real crisis occurs.

Continuous improvement is critical. After any exercise or real incident, hold a formal post-mortem analysis. Document the timeline, root cause, impact, and most importantly, the action items for process improvement. Tools like Jira or Linear can track these tasks. Publicly sharing a sanitized version of this analysis, as teams like Optimism and Arbitrum have done, builds community trust and contributes to ecosystem-wide security knowledge. Regularly update your contact lists, runbook procedures, and tool configurations to reflect new team members, upgraded infrastructure, and lessons learned.

Your technical foundation must evolve alongside your processes. Invest in robust monitoring that goes beyond basic uptime. Implement canary deployments for sequencer upgrades and critical smart contracts to detect issues in a controlled environment. Utilize fraud proof or validity proof alerting to detect invalid state transitions. Consider engaging with professional audit firms for regular security reviews and bug bounty platforms like Immunefi to crowdsource vulnerability discovery. The security landscape for rollups is dynamic; staying ahead requires proactive investment in both human processes and automated defenses.

Finally, engage with the broader ecosystem. Participate in security forums and working groups within your rollup stack's community, such as the OP Stack Security Council or the Arbitrum DAO. Sharing threat intelligence and coordinating on cross-chain vulnerability disclosures makes the entire network more resilient. The next step is to begin drafting your plan, starting with the incident severity matrix and communication tree outlined in this guide. For further reading, consult resources like the L2BEAT Risk Framework and the Ethereum Rollup Security Checklist.