How to Build a Network Congestion Forecasting Model

introduction

GUIDE

How to Build a Network Congestion Forecasting Model

A practical guide to predicting blockchain network congestion using on-chain data, statistical models, and machine learning.

Network congestion forecasting is a critical tool for developers, traders, and users to anticipate high gas fees and slow transaction times. By analyzing historical and real-time on-chain data, you can build models to predict periods of peak demand on networks like Ethereum, Solana, or Arbitrum. This guide outlines the core components: data collection, feature engineering, model selection, and deployment. We'll focus on practical implementation using Python and publicly available blockchain data from sources like Chainscore APIs, Dune Analytics, and The Graph.

The first step is data collection. You need granular, historical data on network activity. Key metrics include: gas_used, gas_price, transaction_count, block_time, and pending_transactions. For Ethereum, you can fetch this via the Ethereum JSON-RPC API or aggregated datasets. It's crucial to collect data at a high frequency (e.g., per block or per minute) to capture volatility. Preprocess this data by handling missing values, normalizing features, and creating time-series lags. For example, the average gas price from the last 10 blocks is often a strong predictor of the next block's congestion level.

Next, engineer features that capture the underlying drivers of congestion. Beyond raw metrics, create derived features like: - Transaction fee momentum (rate of change in average gas price) - Mempool pressure (size and composition of pending transactions) - Dominant activity (percentage of transactions from popular dApps like Uniswap) - Time-based features (hour of day, day of week). These features help the model learn patterns, such as weekly cycles or spikes following major NFT mints. Using a library like pandas, you can efficiently calculate these rolling statistics on your historical dataset.

For modeling, start with simpler, interpretable time-series models like ARIMA or Prophet to establish a baseline. These models are effective for capturing trends and seasonality. For more complex, non-linear patterns, gradient boosting frameworks like XGBoost or LightGBM often yield better performance. Frame the problem as a regression task predicting a future metric (e.g., average_gas_price_in_30_minutes). Split your data into training and test sets, being careful to respect time order to avoid look-ahead bias. Evaluate models using metrics like Mean Absolute Error (MAE) or Root Mean Squared Error (RMSE).

To operationalize your model, build a pipeline that: 1. Fetches the latest on-chain data via an API 2. Runs the pre-processing and feature engineering steps 3. Generates a forecast using your trained model 4. Outputs a result (e.g., "High Congestion Expected in 15 Minutes"). You can deploy this as a scheduled script, a serverless function, or a live dashboard. For real-time predictions, consider using Chainscore's Gas API or Blocknative's Mempool API for low-latency data feeds. Always include a confidence interval in your forecast to communicate uncertainty.

Continuously monitor and retrain your model. Network dynamics change with protocol upgrades (like EIP-1559), new dApp launches, and shifting user behavior. Implement a monitoring system to track prediction error over time and trigger model retraining when performance degrades. By building a robust forecasting model, you can create applications for optimal transaction batching, dynamic fee estimation in wallets, or risk management for DeFi protocols. The complete code for a basic Ethereum gas price forecaster is available in our GitHub repository.

prerequisites

BUILDING BLOCKS

Prerequisites

Before constructing a network congestion forecasting model, you need specific data sources, technical tools, and a foundational understanding of blockchain mechanics.

The core requirement for any forecasting model is high-quality, granular data. You will need access to historical blockchain data, including block timestamps, gas prices, transaction counts, and pending transaction mempool states. Reliable data providers like The Graph for indexed on-chain data, Etherscan's API for Ethereum-specific metrics, or Blocknative's Mempool API for real-time transaction streams are essential starting points. For accurate forecasting, aim for data at the block-by-block or sub-minute level to capture rapid network state changes.

On the technical side, proficiency in a data science stack is non-negotiable. You should be comfortable with Python and libraries like Pandas for data manipulation, NumPy for numerical operations, and Scikit-learn or TensorFlow/PyTorch for building machine learning models. Familiarity with time-series analysis libraries such as Prophet or statsmodels is highly beneficial. You'll also need a development environment capable of handling large datasets, which may involve using Jupyter Notebooks for exploration and scripting for production pipelines.

A solid conceptual understanding of blockchain fundamentals is crucial to interpret the data correctly. You must understand how gas markets work on networks like Ethereum, including the concepts of base fee, priority fee, and how users bid for block space. Knowledge of common network stressors—such as the impact of a popular NFT mint, a major DeFi protocol launch, or arbitrage bot activity during market volatility—will help you identify features for your model. This domain expertise allows you to move beyond raw data analysis to meaningful prediction of congestion events.

key-concepts

FOUNDATIONAL KNOWLEDGE

Key Concepts for Congestion Modeling

Building a reliable forecasting model requires understanding the core data sources, methodologies, and tools used to analyze blockchain network activity.

On-Chain Data Sources

Raw blockchain data is the foundation. Key sources include:

Block explorers (Etherscan, Arbiscan) for historical transaction and block data.
Node RPC endpoints for real-time mempool and pending transaction feeds.
Specialized data providers (The Graph, Dune Analytics, Flipside) for aggregated metrics like daily active addresses and gas price percentiles.
MEV data from services like EigenPhi and Flashbots to understand priority fee influence.

EXPLORE

Gas Fee Mechanics

Understanding fee markets is critical for prediction. Focus on:

Base Fee: The algorithmically determined minimum fee per unit of gas, which adjusts per block based on network utilization.
Priority Fee (Tip): The extra fee users pay to validators to prioritize their transaction within a block.
Gas Used vs. Gas Limit: The relationship between block space consumption and capacity drives base fee volatility. Models must track the gas target (typically 15M for Ethereum) and how sustained usage above it triggers fee increases.

Time-Series Analysis & Feature Engineering

Transform raw data into predictive signals.

Create rolling averages (e.g., 10-block average gas price) to smooth volatility.
Calculate rate-of-change metrics for pending transactions and base fee.
Engineer cyclical features for time-of-day and day-of-week patterns in network activity.
Incorporate external features like major NFT mint schedules or DeFi protocol launch announcements that historically cause congestion.

Modeling Approaches

Different techniques suit different forecasting horizons.

Statistical models (ARIMA, GARCH) are effective for short-term (next block, next hour) predictions based on recent history.
Machine Learning models (LSTMs, Gradient Boosting) can capture complex, non-linear patterns from multiple features for medium-term forecasts.
Simulation models agent-based modeling to simulate user and validator behavior under different network load scenarios for stress testing.

Backtesting & Validation

Rigorously test your model against historical data.

Use walk-forward validation, retraining the model on expanding windows of past data and testing on subsequent unseen data.
Define clear evaluation metrics: Mean Absolute Error (MAE) for accuracy, Mean Absolute Percentage Error (MAPE) for relative error, and directional accuracy (did it correctly predict fee increases/decreases?).
Compare your model's performance against simple benchmarks like a naive "last value" predictor.

Implementation Tools & Libraries

Practical libraries to build your pipeline.

Data Fetching: web3.py or ethers.js for direct node interaction; SDKs from data platforms.
Analysis & Modeling: pandas for data manipulation, scikit-learn for ML models, statsmodels for time-series analysis, and TensorFlow/PyTorch for deep learning.
Visualization: matplotlib, seaborn, or plotly for creating charts of predictions vs. actuals.
Deployment: Containerize with Docker and schedule forecasts using Apache Airflow or Prefect.

EXPLORE

data-collection

FOUNDATION

Step 1: Data Collection and Sources

The accuracy of any predictive model is fundamentally limited by the quality and relevance of its input data. For forecasting Ethereum network congestion, this step involves systematically gathering on-chain and off-chain data that directly influences gas prices and block space demand.

Effective congestion forecasting requires a multi-faceted data approach. The primary source is historical on-chain data, which you can query from an archival node or a provider like Chainscore, Alchemy, or Infura. Essential datasets include historical gas prices (average, priority fee, max fee), block metrics (size, gas used, base fee), and transaction volumes categorized by type (e.g., DEX swaps, NFT mints, bridge transactions). This data reveals patterns in network demand and fee mechanics.

To build a predictive model, you must also incorporate leading indicators and external signals. This includes pending transaction pools (mempool data) to gauge immediate demand, major event calendars for scheduled NFT drops or token launches, and macro-layer 2 activity (like Arbitrum or Optimism transaction bursts that can spill over to Mainnet). Real-time data feeds for these signals are available via specialized mempool APIs and blockchain analytics platforms.

For practical implementation, here's a conceptual Python snippet using the Web3.py library and a hypothetical mempool API to collect a foundational dataset:

python
from web3 import Web3
import requests
import pandas as pd

# Connect to an Ethereum node
w3 = Web3(Web3.HTTPProvider('YOUR_RPC_ENDPOINT'))

# Fetch recent block data
def get_block_data(num_blocks=100):
    blocks = []
    latest = w3.eth.block_number
    for i in range(latest - num_blocks, latest):
        block = w3.eth.get_block(i, full_transactions=False)
        blocks.append({
            'number': block.number,
            'gasUsed': block.gasUsed,
            'gasLimit': block.gasLimit,
            'baseFeePerGas': block.get('baseFeePerGas', 0),
            'timestamp': block.timestamp
        })
    return pd.DataFrame(blocks)

# Fetch pending transactions from a mempool API (example)
def get_mempool_data(api_url):
    response = requests.get(f"{api_url}/pending-txs")
    return response.json()  # Returns list of pending transactions with gas prices

This code collects historical block statistics and real-time mempool states, forming the core of your time-series dataset.

Data collection is not a one-time task but a continuous pipeline. You must establish reliable ETL (Extract, Transform, Load) processes to ingest this data at regular intervals—every block or every minute for high-frequency models. The key is to structure your data with clear timestamps, normalize values (e.g., converting gas prices to gwei), and handle chain reorganizations gracefully to ensure a clean, consistent dataset for the modeling phase.

feature-engineering

DATA PREPARATION

Step 2: Feature Engineering and Target Definition

This step transforms raw blockchain data into predictive features and defines the target variable for forecasting network congestion.

Feature engineering is the process of creating informative input variables from raw on-chain and mempool data. For a congestion model, effective features capture the state and momentum of network demand. Key data sources include block headers, pending transaction pools, and historical gas prices. You should extract metrics like the count of pending transactions, average gas price in the mempool, and the gas used in recent blocks. These raw metrics form the foundation for more sophisticated derived features.

To build predictive power, create lagged and aggregated features. For example, calculate the moving average of the base fee over the last 10 blocks or the rate of change in pending transactions over the last 5 minutes. Incorporating time-based features, such as the hour of the day or day of the week, can capture recurring patterns in network activity. For Ethereum, monitoring the size of the base_fee_per_gas and gas_used_ratio from previous blocks, as defined in EIP-1559, is essential for predicting fee market shifts.

Defining the target variable clearly is critical. Your model's goal dictates this definition. For a binary classification model predicting 'high congestion,' you might set a threshold, such as 1 if the next block's base fee increases by more than 10% and 0 otherwise. For a regression model forecasting the exact base fee, the target is the numeric value of the base fee in Gwei for a future block (e.g., block number N+5). The forecast horizon (e.g., 1 block, 10 blocks) should match your intended use case, like optimizing transaction timing.

Ensure your feature and target data are aligned on the same time index, typically block height or timestamp. A common mistake is data leakage, where future information inadvertently influences a feature. Always calculate features using only data available at the time of prediction. For instance, features for predicting block N must be computed using data strictly from blocks N-1 and earlier, plus the current mempool state.

Finally, perform exploratory data analysis (EDA) on your engineered dataset. Calculate correlations between features and the target to identify strong predictors. Visualize distributions to check for outliers—extremely high gas price spikes may need special handling. This analysis validates your feature choices and informs any necessary normalization or scaling before model training in the next step.

FORECASTING APPROACHES

Model Algorithm Comparison

Comparison of common algorithms for predicting network congestion metrics like gas prices and block space demand.

Algorithm / Metric	LSTM (Long Short-Term Memory)	Prophet	Gradient Boosting (XGBoost/LightGBM)
Best For	Capturing long-term temporal dependencies in sequential data	Time series with strong seasonality (hourly/daily patterns)	Tabular data with exogenous features (e.g., NFT mints, DEX volume)
Training Speed	Slow (requires GPU for efficiency)	Fast	Fast to Moderate
Interpretability	Low (black-box model)	High (trend/seasonality components are explicit)	Moderate (feature importance available)
Handles Exogenous Features
Multivariate Forecasting
Typical Prediction Error (MAPE)*	5-12%	8-15%	7-14%
Implementation Complexity	High	Low	Moderate
Real-time Inference Speed	< 100ms	< 50ms	< 20ms

model-training

STEP 3

Model Training and Evaluation

This step transforms your prepared blockchain data into a predictive model. We'll cover training a time-series model, evaluating its performance, and interpreting the results to forecast network congestion.

With your feature-engineered dataset from Step 2, you can now train a forecasting model. For time-series problems like congestion prediction, models such as Long Short-Term Memory (LSTM) networks, Prophet, or gradient-boosted trees (XGBoost, LightGBM) are effective. The choice depends on your data's characteristics: LSTMs excel at capturing complex temporal dependencies in sequential data, while tree-based models are often faster to train and can handle tabular data with mixed feature types well. We'll use a simple LSTM example with PyTorch to demonstrate the core training loop.

Here is a basic code skeleton for an LSTM model training pipeline. This example assumes you have sequences of historical gas prices and pending transaction counts as input features (X_train) and a target variable like the gas price 10 blocks in the future (y_train).

python
import torch
import torch.nn as nn

class CongestionLSTM(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        lstm_out, _ = self.lstm(x)
        predictions = self.linear(lstm_out[:, -1, :])  # Use last sequence step
        return predictions

# Model, loss, optimizer
model = CongestionLSTM(input_size=5, hidden_size=50, num_layers=2, output_size=1)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop
for epoch in range(100):
    model.train()
    optimizer.zero_grad()
    output = model(X_train)
    loss = criterion(output, y_train)
    loss.backward()
    optimizer.step()

After training, you must rigorously evaluate the model's performance on unseen data (your test set). Key metrics for regression forecasting include:

Mean Absolute Error (MAE): The average absolute difference between predictions and actuals, easy to interpret (e.g., "off by 12 Gwei on average").
Root Mean Squared Error (RMSE): Penalizes larger errors more heavily, useful if being very wrong is costly.
Mean Absolute Percentage Error (MAPE): Expresses error as a percentage of the actual value, helpful for understanding relative scale. Always compare these metrics against a simple baseline model, like predicting the last observed value (naïve forecast). If your complex model doesn't significantly outperform the baseline, its added complexity may not be justified.

Beyond aggregate metrics, analyze residuals (prediction errors) to diagnose model weaknesses. Plot residuals over time: patterns or trends indicate the model is missing systematic behavior in the data. Also, examine performance during specific congestion events (e.g., a major NFT mint or DeFi exploit). Did the model anticipate the gas price spike? If it consistently fails during volatility, you may need to incorporate features that signal impending high-activity events, such as social media sentiment analysis or on-chain contract deployment spikes.

Finally, consider the operational context. A model used for automatic transaction fee bidding requires low latency and high-frequency predictions, favoring simpler, faster models. A model for strategic planning can afford longer training times and complexity. Continuously retrain your model with new data, as blockchain usage patterns evolve. Tools like Weights & Biases or MLflow can help track experiments, model versions, and performance degradation over time, which is crucial for maintaining a reliable forecasting system in production.

deployment-considerations

PRODUCTION READINESS

Deployment and Practical Considerations

Transitioning your network congestion model from a prototype to a reliable, production-grade service requires careful planning around infrastructure, monitoring, and cost management.

Deploying a forecasting model involves more than just running a script. You need a robust infrastructure stack. For a Python-based model, containerizing it with Docker ensures consistency across environments. Use a cloud service like AWS SageMaker, Google Cloud AI Platform, or a dedicated server with a FastAPI or Flask backend to serve predictions via a REST API. The API should accept parameters like chain_id, time_horizon, and return a structured JSON response with the forecasted base_fee_percentile, estimated_confirmation_time, and model confidence intervals. Implement health checks and rate limiting to manage load and prevent abuse.

Continuous monitoring is critical for maintaining model accuracy and system reliability. Implement logging for every prediction request and the model's output using a service like Datadog or Grafana. Track key performance indicators (KPIs) such as prediction latency, API error rates, and model drift. Model drift occurs when the statistical properties of live blockchain data (e.g., gas price distributions post-EIP-1559) change, degrading forecast performance. Set up alerts for when prediction errors exceed a defined threshold, triggering a retraining pipeline. This pipeline should automatically fetch new historical data, retrain the model, and deploy the updated version using a CI/CD workflow.

For real-time predictions, your data pipeline must be efficient and cost-effective. Continuously pulling data from a node provider like Alchemy or Infura incurs RPC call costs. Optimize by subscribing to specific events (like new blocks) via WebSockets instead of polling. Cache frequent or computationally expensive queries, such as historical fee percentiles for the last 100 blocks. Consider the trade-off between prediction frequency and utility; updating forecasts every block may be overkill for user applications that only check every few minutes. Estimate and monitor your monthly infrastructure and data provider costs to ensure the service is sustainable.

Finally, integrate your forecasting service into practical applications. A Discord bot can alert a community when congestion is predicted to spike. A browser extension could display real-time fee estimates on Etherscan. For dApp developers, provide an SDK or simple library that abstracts the API calls. Document your API thoroughly with OpenAPI/Swagger and include code snippets for popular languages. Always include clear disclaimers that predictions are probabilistic estimates, not guarantees, to manage user expectations. The ultimate goal is to create a reliable, maintainable service that provides actionable intelligence for navigating network congestion.

resource-links

GUIDES

Tools and Resources

These tools and datasets are commonly used to build network congestion forecasting models for blockchains like Ethereum, L2s, and high-throughput L1s. Each card focuses on a concrete step: data collection, feature engineering, modeling, or evaluation.

On-Chain Data Access via JSON-RPC

JSON-RPC endpoints are the primary source for raw blockchain congestion signals.

You can query full nodes or provider APIs to extract:

Block-level metrics: gasUsed, gasLimit, baseFeePerGas, block timestamps
Mempool signals: pending transaction count, gas price distribution (node-dependent)
Transaction throughput: tx count per block, effective gas price

Common forecasting features derived from RPC data:

Gas utilization ratio (gasUsed / gasLimit)
Base fee momentum and volatility over N blocks
Inter-block time variance

For production-grade models, avoid public rate-limited endpoints. Use providers that support archive data and consistent latency. Most teams poll blocks in near-real time and backfill historical data for training. JSON-RPC is low-level but gives maximum control and chain-specific fidelity.

EXPLORE

Blockchain Explorers and Indexed APIs

Indexed APIs reduce complexity by pre-processing raw chain data into queryable tables.

Explorers and analytics platforms typically provide:

Historical gas metrics aggregated by block or hour
Transaction counts, unique senders, and contract interaction volume
CSV or REST exports suitable for model training

These tools are useful for:

Rapid prototyping of congestion predictors
Cross-checking node-derived data
Building baseline models before running your own indexer

A common workflow is to train on explorer data, then switch to self-indexed data for production. Be aware of sampling intervals, missing mempool visibility, and API rate limits. Indexed APIs trade precision for speed, which is acceptable for medium-horizon congestion forecasting.

EXPLORE

Python Modeling Stack for Time-Series Forecasting

Most congestion forecasting models are implemented using Python time-series and ML libraries.

Core components:

pandas / numpy for feature engineering and rolling-window metrics
scikit-learn for regression baselines (ElasticNet, Gradient Boosting)
statsmodels or Prophet for interpretable time-series models
PyTorch or TensorFlow for LSTM, Temporal CNN, or Transformer models

Typical targets include:

Next-block gas utilization
Base fee N blocks ahead
Probability of congestion exceeding a threshold

Start with simple autoregressive baselines before deep learning. In many chains, short-term congestion is highly autocorrelated, and complex models only help during regime shifts like NFT mints or liquidations.

EXPLORE

Public Datasets and SQL-Based Analytics

Public blockchain datasets enable large-scale historical analysis without running nodes.

Platforms like BigQuery host full Ethereum and L2 datasets, including:

Blocks and transactions tables
Gas usage, fees, and timestamps
Contract-level interaction data

These datasets are well-suited for:

Long-horizon congestion studies
Feature discovery across market cycles
Training models on millions of blocks

A common pattern is to generate labeled datasets in SQL, export to Parquet or CSV, then train models locally. SQL-based aggregation is often faster and more reproducible than ad-hoc scripts, especially when testing multiple congestion definitions.

EXPLORE

NETWORK CONGESTION

Frequently Asked Questions

Common questions and technical solutions for developers building predictive models for blockchain network congestion.

Effective models rely on a combination of on-chain and off-chain data. Key sources include:

On-chain Metrics: Pending transaction pool (mempool) size, average gas prices, block utilization percentage, and transaction failure rates.
Network Activity: Transaction count per block, active wallet addresses, and contract interaction frequency.
External Events: Major NFT mint schedules, token launch times (often found via social sentiment analysis), and scheduled protocol upgrades.
Historical Data: Past congestion patterns correlated with time of day, day of week, and specific DeFi activity cycles.

For example, an Ethereum model might ingest mempool data from an Alchemy or Infura node, combine it with Dune Analytics queries for historical trends, and use a Twitter/X API stream to flag known upcoming events.

conclusion

IMPLEMENTATION GUIDE

Conclusion and Next Steps

You now have the core components to build a functional network congestion forecasting model. This guide has covered data sourcing, feature engineering, and model training. The next steps involve production deployment and continuous improvement.

To operationalize your model, you need a reliable data pipeline. Use services like Chainscore's Block Feed API for real-time mempool data and The Graph for historical on-chain metrics. Implement a scheduler (e.g., using Celery or AWS Lambda) to run your feature engineering and prediction scripts at regular intervals, such as every block or every minute. Store predictions in a time-series database like InfluxDB or TimescaleDB for easy querying by your applications.

Integrate the forecast into user-facing products to provide tangible value. For a wallet, you could display a "Recommended Gas Price" based on the predicted congestion level for the next 5 blocks. A DeFi protocol could use the forecast to schedule low-priority treasury operations during predicted low-congestion periods. Monitor your model's performance by tracking metrics like Mean Absolute Error (MAE) against actual base fee or inclusion times, and set up alerts for significant prediction drift.

This model is a starting point. To improve accuracy, explore more sophisticated techniques. Incorporate macro-level features like overall NFT mint activity or major protocol launch calendars. Experiment with different model architectures; Gradient Boosting (XGBoost, LightGBM) often outperforms Random Forests for tabular data, while LSTM networks can capture longer-term temporal dependencies. Finally, consider building ensemble models that combine predictions from multiple approaches for greater robustness.