How to Set Up a Node Alerting System: A Developer Guide

introduction

INTRODUCTION

Setting Up Node Alerting Systems

A guide to building robust monitoring and alerting for blockchain nodes to ensure operational integrity and uptime.

Running a blockchain node—be it for Ethereum, Solana, or Cosmos—requires constant vigilance. Unlike traditional web servers, nodes must maintain consensus, process transactions, and stay synchronized with a global peer-to-peer network. Downtime or performance degradation can lead to missed blocks, slashing penalties, or an inability to submit transactions. An effective node alerting system is not a luxury; it's a critical component of operational infrastructure. It transforms passive monitoring into proactive incident response, allowing operators to address issues before they escalate into service outages or financial loss.

At its core, a node alerting system monitors key health metrics and triggers notifications when thresholds are breached. Essential metrics to track include node sync status, peer count, memory/CPU usage, disk space, and block height lag. For Proof-of-Stake networks, you must also monitor validator status, signing performance, and jail/unbonding events. Tools like Prometheus are commonly used to scrape these metrics from node clients (e.g., Geth, Prysm, Cosmos SDK) and their exposed endpoints. The scraped data is then visualized in dashboards using Grafana, providing a real-time overview of node health.

The alerting logic, often defined in a tool like Alertmanager (which pairs with Prometheus), determines when a condition becomes a problem. For example, you might set an alert to fire if the eth_syncing metric returns true for more than 5 minutes, or if available disk space falls below 20%. The key is to define actionable alerts—signals that require a specific human or automated response. Avoid alert fatigue by filtering out transient noise and focusing on symptoms that indicate a genuine threat to node functionality or security.

Once an alert is triggered, it needs to reach the right person through a reliable channel. Common notification integrations include Slack, Discord, Telegram, PagerDuty, and email. For critical, 24/7 operations, use an escalation policy that rotates on-call duties. A basic alert might post to a team channel, while a severe alert—like a validator being slashed—could trigger SMS and phone calls. The notification should include all relevant context: the alert name, affected node, metric values, and a link to the relevant dashboard for immediate investigation.

Implementing this stack involves several components. First, instrument your node to expose metrics, often via the --metrics flag and a port like :9090. Next, deploy Prometheus with a configuration file (prometheus.yml) to scrape these targets. Then, set up Alertmanager with routing rules to handle different alert severities and receivers. Finally, configure Grafana to use Prometheus as a data source and build dashboards. The following code snippet shows a minimal Prometheus alert rule for detecting a stalled Ethereum node:

yaml
groups:
- name: node_alerts
  rules:
  - alert: NodeNotSyncing
    expr: eth_syncing == 1
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Node is still syncing after 5 minutes"

Beyond basic availability, consider advanced monitoring for MEV relays, RPC endpoint performance, and gas price fluctuations if your node serves transactions. Regularly test your alerting pipeline with controlled failures to ensure it works when needed. The goal is to create a resilient feedback loop: monitoring detects an anomaly, alerting notifies the team, and runbooks guide the remediation, ultimately minimizing Mean Time To Recovery (MTTR). A well-tuned alerting system is the foundation for trustworthy, autonomous node operation in the demanding Web3 environment.

prerequisites

PREREQUISITES

Setting Up Node Alerting Systems

Essential infrastructure and knowledge required to implement effective monitoring and alerting for blockchain nodes.

Before configuring alerts, you must have a production-ready node running. This means your node is fully synced with the target network (e.g., Ethereum Mainnet, Polygon PoS, Solana Mainnet Beta) and is actively validating transactions or producing blocks. Ensure your node software (like Geth, Erigon, Prysm, or Solana Labs client) is on a stable, recent release. The node should be deployed on a reliable cloud provider (AWS EC2, Google Cloud, DigitalOcean) or bare-metal server with sufficient resources—CPU, RAM, and SSD storage—to handle network load without performance degradation.

You will need administrative access to the server hosting your node. This includes SSH access and the ability to install system packages, modify configuration files, and run commands with sudo privileges. Familiarity with basic Linux command-line operations (systemctl, journalctl, cron) is essential. Your alerting system will interact with the node's JSON-RPC endpoint (typically on ports like 8545 for Ethereum) and its metrics exporter (like Prometheus Node Exporter on port 9100). Ensure these services are running and accessible, potentially behind a firewall with restricted inbound rules for security.

The core of any alerting setup is a monitoring stack. The most common and powerful combination is Prometheus for metrics collection and Alertmanager for routing notifications. You must install and configure Prometheus to scrape metrics from your node's RPC endpoint and the server's system metrics. This involves defining a scrape_config in prometheus.yml that points to your targets. For example, a config for a Geth node might target http://localhost:6060/debug/metrics/prometheus. You should understand key metrics like chain_head_block, p2p_peers, cpu_usage, and memory_available.

Define meaningful alert rules in Prometheus using its query language, PromQL. These rules are conditions that, when met, fire alerts to Alertmanager. For instance, an alert for a stalled chain might be: geth_chain_head_block{instance="localhost:6060"} - geth_chain_head_block{instance="localhost:6060"} offset 5m == 0. Other critical alerts include low peer count (p2p_peers < 5), high memory consumption (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10), and process downtime (up{job="geth"} == 0). Test your rules using the Prometheus expression browser before enabling them.

Configure Alertmanager to receive alerts from Prometheus and send notifications to your team. You must set up at least one receiver, such as email, Slack, Telegram, or PagerDuty. This requires obtaining API tokens or webhook URLs from those services. Alertmanager allows you to group, silence, and throttle alerts to prevent notification fatigue. For high-severity alerts (like a validator being slashed), you may want immediate paging, while informational alerts (like a new client version) can be routed to a low-priority channel. Document your alerting hierarchy and escalation policies.

Finally, establish a runbook or documented procedure for responding to each type of alert. An alert is useless without a clear action plan. For example, the response to "Node Is Not Syncing" should include steps to check peer connections, restart the node service, and verify blockchain consensus. Automate remediation where possible using scripts triggered by alerts (e.g., a cron job that restarts a stuck process), but ensure manual oversight for critical actions. Regularly review and update your alert rules and runbooks as the network and your node's role evolve.

architecture-overview

SYSTEM ARCHITECTURE

Setting Up Node Alerting Systems

A robust alerting system is critical for maintaining blockchain node health, ensuring high uptime, and preventing costly downtime or slashing penalties.

Node alerting systems monitor key performance indicators (KPIs) and trigger notifications when metrics deviate from expected baselines. Essential metrics to monitor include node sync status, peer count, block height lag, validator participation rate (for consensus nodes), disk space, memory usage, and CPU load. For Ethereum validators, tracking attestation effectiveness and proposal misses is crucial. Tools like Prometheus for metric collection and Grafana for visualization form the foundation of most professional monitoring stacks.

To implement alerting, you first need to expose node metrics. Most client software, such as Geth, Besu, Lighthouse, and Prysm, provide a metrics endpoint (typically on port 9090 or 8080). You can configure Prometheus to scrape these endpoints. Here's a basic prometheus.yml job configuration for an Ethereum execution client:

yaml
scrape_configs:
  - job_name: 'geth'
    static_configs:
      - targets: ['localhost:6060']

Once metrics are flowing, you define alerting rules in Prometheus to specify conditions, like geth_blockchain_height - eth_block_number > 100 for a severe sync lag.

The final component is the Alertmanager, which handles alerts sent by Prometheus, deduplicates them, groups them, and routes them to the correct receiver such as Slack, Discord, Telegram, PagerDuty, or email. A critical best practice is to avoid alert fatigue by setting appropriate thresholds and using severity levels (warning, critical). For example, a warning might trigger for a peer count below 20, while a critical alert fires for a peer count of 0 or a disk usage above 90%.

Beyond basic infrastructure, consider monitoring chain-specific health signals. For a Cosmos SDK chain, alert on consensus_validator_power changes or slashing_signed_blocks_window misses. For Solana, monitor vote_distance and skipped_slots. Smart alerting should also watch for soft forks, network upgrades, and governance proposals that may require operator action. Integrating with tools like Healthchecks.io or Uptime Kuma can provide external heartbeat monitoring for your entire node service.

For automated remediation, you can connect alerts to scripts or infrastructure-as-code tools. A critical disk space alert could trigger an automated log cleanup script via a webhook. However, fully automated responses for consensus-related issues (like restarting a validator) carry risk and should be implemented with extreme caution, if at all. The goal of the alerting system is to provide timely, actionable intelligence to a human operator, forming the nervous system of a reliable node operation.

core-components

SETTING UP NODE ALERTING SYSTEMS

Core Components and Tools

Essential tools and services for monitoring blockchain node health, performance, and security. Proactive alerting prevents downtime and data loss.

Prometheus & Grafana Stack

The industry-standard open-source monitoring stack. Prometheus scrapes metrics from your node's exporter (like Geth's --metrics flag or a custom collector). Grafana visualizes this data with dashboards and triggers alerts based on rules (e.g., block sync lag, memory usage >90%).

Key Metrics: Peer count, CPU/memory usage, block height, propagation times.
Setup: Deploy alongside your node, configure scraping targets, and import community dashboards for Ethereum, Cosmos, or Polkadot nodes.

EXPLORE

Node Exporter & Custom Exporters

Collect system-level and application-specific metrics for monitoring. Node Exporter provides hardware/OS metrics (CPU, disk I/O, network). For blockchain-specific data, you need a client exporter or custom scripts.

Examples: geth_exporter for Ethereum, cosmos-exporter for Cosmos SDK chains.
Implementation: Run the exporter as a separate process, exposing a /metrics endpoint on a defined port (e.g., :9100) for Prometheus to scrape.

EXPLORE

Alertmanager (Prometheus)

Handles alerts sent by Prometheus, managing deduplication, grouping, and routing to receivers like email, Slack, or PagerDuty. It's crucial for turning metric thresholds into actionable notifications.

Alert Rules: Define in Prometheus config (e.g., expr: up{job="geth"} == 0 for node downtime).
Routing: Configure silences, inhibition rules, and different alert channels for severity levels (warning vs. critical).

EXPLORE

Health Check Endpoints & Uptime Monitors

Implement simple HTTP endpoints on your node to verify liveness and sync status. External services like UptimeRobot or Better Uptime can ping these endpoints.

Common Checks: /health returns 200 if synced, /syncing returns current block vs. network head.
Use Case: Get instant SMS or webhook alerts for node downtime without maintaining a full metrics stack.

EXPLORE

Log Aggregation (Loki) & Alerting

Monitor node logs for errors and specific events. Grafana Loki is a log aggregation system designed to work with Prometheus and Grafana.

Process: Ship node logs (e.g., Geth, Prysm) to Loki using Promtail.
Alerting: Use Loki's LogQL to create alerting rules based on log content, such as rate({job="geth"} |= "ERROR" [5m]) > 0 to detect error spikes.

EXPLORE

PagerDuty / Opsgenie for Incident Response

Dedicated incident management platforms that integrate with Alertmanager webhooks. They provide on-call schedules, escalation policies, and post-mortem tracking.

Workflow: Critical Prometheus alerts → Alertmanager → PagerDuty API → Phone call/SMS to on-call engineer.
Best Practice: Use for production validator nodes or relayers where minutes of downtime equate to slashing risk or missed revenue.

EXPLORE

step1-deploy-prometheus

INFRASTRUCTURE

Step 1: Deploy Prometheus and Exporters

This guide covers the initial deployment of Prometheus and the Node Exporter to establish the foundation for monitoring your blockchain infrastructure.

Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability. It works by scraping metrics from configured targets at regular intervals, evaluating rule expressions, and storing the resulting time-series data. For blockchain nodes, you need to deploy the core Prometheus server and at least one exporter—a small service that exposes metrics in a Prometheus-readable format. The most common exporter for system-level monitoring is the Node Exporter, which provides hardware and OS metrics like CPU usage, memory, disk I/O, and network statistics.

To deploy Prometheus, you can use Docker for simplicity or install it directly on your host. A typical prometheus.yml configuration file defines the scrape intervals, target endpoints (like your exporter), and alerting rules. For a Geth or Erigon node, you would also deploy a client-specific exporter, such as geth-exporter or erigon-exporter, which exposes chain-specific metrics like sync status, peer count, and transaction pool depth. These exporters run alongside your node and expose metrics on an HTTP endpoint (e.g., http://localhost:9100/metrics for Node Exporter).

Here is a basic Docker Compose snippet to run Prometheus and a Node Exporter:

yaml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prom_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
    ports:
      - "9090:9090"
  node-exporter:
    image: prom/node-exporter:latest
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    ports:
      - "9100:9100"
volumes:
  prom_data:

After starting the services, verify Prometheus is scraping by accessing its web UI at http://your-server-ip:9090 and checking the Targets status page.

Configuration is critical. Your prometheus.yml must define a scrape job for each exporter. For example, to scrape the Node Exporter and a hypothetical Geth exporter, you would add:

yaml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
  - job_name: 'geth'
    static_configs:
      - targets: ['geth-exporter:9091']
    scrape_interval: 15s

The scrape_interval dictates how often Prometheus pulls data. For blockchain nodes, a 15-30 second interval balances detail with resource usage. Ensure your firewall allows traffic on the configured ports (9090, 9100, etc.).

Once deployed and configured, Prometheus will begin collecting time-series data. You can immediately start running queries in the Expression Browser to verify functionality. Try node_memory_MemAvailable_bytes to see available system memory or up to check if your scrape targets are healthy (value of 1). This data layer is the prerequisite for the next step: defining alerting rules in Prometheus that will trigger notifications when specific metric thresholds are breached, such as high memory usage or a node falling out of sync.

step2-write-alert-rules

CONFIGURATION

Step 2: Define Alerting Rules

Alerting rules are the core logic that determines when your monitoring system should notify you. This step involves translating potential node failures into specific, actionable conditions.

An alerting rule is a conditional statement that evaluates metrics from your node. When the condition is true for a specified duration, it triggers an alert. Common rule types include threshold-based (e.g., CPU > 90% for 5 minutes) and state-based (e.g., validator status is 'jailed'). For blockchain nodes, critical metrics to monitor are peer count, block height synchronization lag, validator signing status, memory usage, and disk I/O latency. Tools like Prometheus use a domain-specific language (PromQL) to define these rules.

Effective rules require precise thresholds and for clauses to prevent noise. For example, a temporary CPU spike is normal, but sustained high usage indicates a problem. A well-defined rule might be: node_cpu_usage_percent > 85 for 5m. This means the alert only fires if the condition persists for five minutes. Similarly, for a Cosmos SDK chain, you'd monitor cosmos_consensus_latest_block_height - cosmos_node_latest_block_height > 10 to catch synchronization issues. Always base thresholds on your node's observed baseline performance.

Here is a basic example of a Prometheus rule definition in YAML format for a Geth execution client:

yaml
groups:
  - name: geth_alerts
    rules:
    - alert: GethHighMemoryUsage
      expr: process_resident_memory_bytes{job="geth"} > 1.5e9  # 1.5 GB
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "High memory usage on Geth instance"
        description: "Geth memory usage is {{ $value }} bytes for over 10 minutes."

This rule triggers a warning if the Geth process uses more than 1.5 GB of RAM for ten consecutive minutes.

For validator nodes, missed block alerts are critical. Using the Prometheus metrics from the cosmos-sdk or similar, you can create a rule that fires if your validator misses signatures. Another essential alert is for peercount; falling below a minimum (e.g., less than 10 peers for 2 minutes) can indicate network isolation. Always add informative labels (severity: critical/warning) and annotations that include the failing metric value ({{ $value }}) to provide immediate context in notification messages.

After defining rules, test them in a staging environment. Use tools like Prometheus's promtool to test rule files for syntax errors. You can also manually force metric values to verify alert triggering and routing. Document each rule's purpose and threshold rationale. This documentation is crucial for team onboarding and incident response, ensuring everyone understands why an alert fired and the appropriate remediation steps, such as restarting a service or checking network connectivity.

step3-configure-alertmanager

ALERTING PIPELINE

Step 3: Configure Alertmanager

Integrate Prometheus with Alertmanager to route, deduplicate, and dispatch notifications for your node's critical alerts.

Prometheus generates alerts, but Alertmanager is the dedicated service that handles them. It receives alerts from one or more Prometheus servers, performs grouping, inhibition, silencing, and routes them to the correct receiver like email, Slack, PagerDuty, or a webhook. This separation of concerns is a core design principle, allowing for sophisticated alert management independent of the metrics collection layer. You'll typically run Alertmanager as a separate service or container alongside your Prometheus instance.

Configuration is defined in an alertmanager.yml file. The primary sections are global for default settings, route for the routing tree, and receivers for notification integrations. A basic route defines a default receiver and can group alerts by label, such as severity or alertname. For example, grouping by cluster ensures all alerts from the same failing node are bundled into a single notification, preventing alert fatigue. The group_wait, group_interval, and repeat_interval parameters control timing and batching.

Here is a minimal alertmanager.yml example that routes alerts to a Slack webhook, grouping them by the alertname and instance labels:

yaml
global:
  slack_api_url: 'https://hooks.slack.com/services/...'
route:
  group_by: ['alertname', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'slack-notifications'
receivers:
- name: 'slack-notifications'
  slack_configs:
  - channel: '#node-alerts'
    title: '{{ .GroupLabels.alertname }} on {{ .GroupLabels.instance }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'

For production node monitoring, you should configure at least two critical receivers: a high-priority channel (like PagerDuty or SMS) for severity: critical alerts (e.g., node offline, consensus failure) and a low-priority channel (like Slack or email) for severity: warning alerts (e.g., high memory usage, peer count low). Use the match and match_re directives in your route tree to separate these flows. Alertmanager's inhibition rules can also suppress less important alerts when a critical one is firing, such as muting all disk-space warnings if the node is completely down.

After configuring the YAML file, start Alertmanager, usually with the command alertmanager --config.file=alertmanager.yml. You must then tell your Prometheus server where to send alerts by adding the alerting section to your prometheus.yml:

yaml
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093  # Default Alertmanager port

Finally, verify the integration by triggering a test alert from Prometheus and checking that it appears in the Alertmanager web UI (typically at http://localhost:9093) and is forwarded to your configured receiver.

step4-build-grafana-dashboards

ALERTING & MONITORING

Step 4: Build Grafana Dashboards

Transform raw node metrics into actionable visualizations and alerts to ensure operational reliability.

Grafana dashboards provide the visual interface for your monitoring stack, allowing you to query and display Prometheus metrics. A well-designed dashboard gives you a real-time, at-a-glance view of your node's health, including block production status, validator performance, network connectivity, and resource utilization. Start by installing Grafana on your monitoring server, connecting it to your Prometheus data source, and importing a foundational dashboard template specific to your consensus client (e.g., Lighthouse, Prysm, Teku).

Effective dashboards focus on key performance indicators (KPIs). Essential panels to create include: a graph of head_slot to track chain synchronization, a stat for validator_active status, a gauge for cpu_usage and memory_usage, and a graph of network_receive_bytes_total to monitor traffic. Use Grafana variables to make dashboards dynamic, allowing you to filter views by specific validator indices or node instances. Organize related metrics into logical rows and use color coding (green for healthy, red for critical) to make anomalies immediately visible.

The true power of Grafana lies in its alerting engine. You can configure alerts based on panel queries and send notifications to channels like Slack, Discord, or email. Critical alerts to set up include: validator_active == 0 (validator offline), head_slot increase < 1 over 2 epochs (sync stalled), memory_usage > 90%, and disk_free_bytes < 10GB. Configure alert rules with meaningful thresholds, evaluation intervals, and descriptive messages that include the affected validator index and the specific metric value.

For advanced monitoring, implement multi-level alerting. A warning alert for cpu_usage > 80% gives you an early heads-up, while a critical alert for cpu_usage > 95% triggers immediate action. Use Grafana's annotation feature to mark dashboard timelines with events like client upgrades, network hard forks, or manual restarts, providing crucial context during incident investigation. Regularly review and refine your dashboards and alerts based on false positives or missed incidents to improve signal-to-noise ratio.

To ensure resilience, consider running a dedicated Grafana instance separate from your validator nodes to avoid a single point of failure. Secure your Grafana instance with strong authentication, and use dashboard provisioning (YAML config files) to version-control and consistently deploy your dashboard setup across multiple monitoring environments. This creates a reproducible, professional-grade monitoring foundation for your staking operation.

CORE METRICS

Critical Node Metrics to Monitor

Essential performance and health indicators for blockchain nodes, categorized by priority and impact.

Metric	Description	Critical Threshold
Block Height Lag	Number of blocks your node is behind the network tip	10 blocks
Peer Count	Number of active peer connections	< 10 peers
Memory Usage	RAM consumption by the node process	85%
CPU Utilization	Processor load from node operations	90% sustained
Disk I/O Latency	Average time for read/write operations	100ms
Disk Space Free	Available storage on the data volume	< 20%
Validator Uptime (if applicable)	Percentage of signed blocks for validators	< 95%
P2P Inbound/Outbound Bandwidth	Network traffic to/from the node	90% of capacity
RPC/API Error Rate	Percentage of failed API requests	1%
Transaction Pool Size	Number of pending transactions in mempool	10,000
Sync Status	Whether the node is fully synced with the chain	Not Synced
Process Uptime	Time since the node process was last restarted	< 1 hour

NODE ALERTING

Troubleshooting Common Issues

Common challenges and solutions for setting up reliable monitoring and alerting for blockchain nodes.

Missing metrics are typically a connectivity or configuration issue. First, verify Prometheus is scraping your node exporter correctly. Check the Prometheus server logs for errors and ensure the prometheus.yml configuration file has the correct target IP and port (e.g., localhost:9100 for node_exporter).

Common fixes:

Firewall Rules: Ensure the port (e.g., 9100, 26660 for Cosmos, 8545 for Geth) is open on the node's firewall.
Target Status: In the Prometheus web UI (http://<your-server>:9090/targets), confirm the target is UP and not showing connection errors.
Service Health: Restart the node_exporter service: sudo systemctl restart node_exporter.
Grafana Data Source: In Grafana, verify the Prometheus data source is configured with the correct URL (e.g., http://localhost:9090).

resource-links

NODE OPERATIONS

Additional Resources

Hands-on tools and references for building reliable alerting systems around blockchain nodes. These resources cover metrics collection, log-based alerts, transaction-level monitoring, and on-call routing.

Prometheus + Alertmanager

Prometheus is the most common metrics stack for blockchain nodes running on Linux or Kubernetes. Most Ethereum, Cosmos, and Substrate clients expose /metrics endpoints compatible with Prometheus.

Key uses for node alerting:

Scrape node health metrics like peer count, block height, sync status, and RPC latency
Define Alertmanager rules for conditions such as stalled block production, missed slots, or memory exhaustion
Route alerts by severity to Slack, PagerDuty, or email

Concrete examples:

Alert if eth_syncing == 1 for more than 5 minutes
Alert if block height lags the network by more than 3 blocks
Alert if RPC request latency exceeds 500 ms

Prometheus works well for self-hosted setups where you need full control over alert logic and data retention.

EXPLORE

Grafana Alerting

Grafana Alerting sits on top of Prometheus, Loki, or other data sources and provides a UI-driven way to define and manage alerts. It is useful when operators want faster iteration without editing raw YAML rules.

Common node monitoring workflows:

Visualize block height divergence across multiple nodes
Alert on RPC error rate spikes using time-series queries
Combine metrics and logs into a single alert condition

Grafana supports:

Threshold-based and query-based alerts
Multi-channel notifications (Slack, Discord, PagerDuty)
Alert grouping to reduce noise during incidents

For teams running multiple nodes across regions or chains, Grafana provides a centralized view that simplifies alert tuning and debugging.

EXPLORE

Loki for Log-Based Alerting

Grafana Loki enables alerting directly from node logs. This is critical for catching failures that do not surface as metrics.

Typical blockchain node log alerts:

Repeated "failed to connect to peer" messages
Consensus errors such as equivocation warnings
Database corruption or snapshot load failures

How teams use Loki:

Aggregate logs from execution and consensus clients
Write LogQL queries matching error patterns
Trigger alerts when error frequency crosses a threshold

Example:

Alert if more than 10 ERROR logs occur within 60 seconds

Log-based alerts complement metrics by catching edge cases like malformed blocks, RPC handler panics, or disk I/O errors.

EXPLORE

PagerDuty for On-Call Routing

PagerDuty handles alert escalation and on-call scheduling once your monitoring stack detects an issue. It is not blockchain-specific but widely used by infra teams running production validators and RPC services.

Key features for node operators:

On-call rotations for validator or infra teams
Severity-based escalation policies
Incident timelines for postmortems

Example setup:

Prometheus Alertmanager sends critical alerts to PagerDuty
Slot misses or prolonged downtime trigger immediate pages
Lower-severity alerts create incidents without paging

PagerDuty is useful when node operations must meet uptime requirements or SLA commitments, especially for staking providers and RPC services.

EXPLORE

Tenderly Node and Transaction Monitoring

Tenderly provides alerting at the transaction and contract execution level, which can supplement traditional node health monitoring.

Relevant alerting use cases:

Detect failed transactions caused by RPC issues
Alert on abnormal gas usage or revert reasons
Monitor contract interactions that depend on your nodes

How developers apply this:

Track whether application transactions consistently reach the mempool
Alert when transaction failure rates spike after node upgrades
Correlate infra issues with user-facing failures

Tenderly is especially useful for teams running custom RPC endpoints or application-specific nodes where transaction reliability matters as much as raw uptime.

EXPLORE

NODE MONITORING

Frequently Asked Questions

Common questions and troubleshooting steps for setting up and managing node alerting systems.

These alerts monitor different aspects of node health. A block height alert triggers when your node's latest block falls behind the network's canonical chain by a defined threshold (e.g., 10 blocks). It indicates your node is not syncing.

A block production alert is specific to validator nodes on Proof-of-Stake networks like Ethereum, Solana, or Cosmos. It triggers when your validator misses an assigned slot to propose a block, which can lead to slashing penalties and lost rewards. While a height alert means your node is offline, a production alert means it's online but failing its core duty.