Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
Free 30-min Web3 Consultation
Book Consultation
Smart Contract Security Audits
View Audit Services
Custom DeFi Protocol Development
Explore DeFi
Full-Stack Web3 dApp Development
View App Services
LABS
Guides

How to Design a Node Data Archival and Pruning Policy

A technical guide for node operators to create a systematic policy for managing blockchain data growth, balancing storage costs with data availability.
Chainscore © 2026
introduction
INTRODUCTION

How to Design a Node Data Archival and Pruning Policy

A strategic guide to managing blockchain node storage by balancing data availability, performance, and cost.

Running a full blockchain node requires significant and ever-growing storage. A data archival and pruning policy is a systematic plan for managing this data lifecycle. It defines what data to keep readily accessible, what to archive for later retrieval, and what to delete permanently. Without a policy, nodes can become bloated, leading to slower sync times, higher operational costs, and potential resource exhaustion. This guide outlines the key considerations and steps for designing an effective policy tailored to your node's purpose, whether it's for an RPC endpoint, an indexer, or personal use.

The first step is to define your node's operational requirements. Ask: What data does your application need immediate, low-latency access to? A DeFi indexer needs recent block data and event logs, while a block explorer might need full historical state. Next, consider your resource constraints: available disk space, budget for cloud storage, and acceptable sync/query performance. Finally, understand the data types on your chain: the blockchain itself (headers, transactions), the world state (account balances, contract storage), and receipts/event logs. Each has different importance and storage characteristics.

With requirements defined, you can choose a pruning mode. Most node clients like Geth, Erigon, and Nethermind offer options. Full archive nodes keep everything indefinitely. Pruned nodes delete old state data but keep all block bodies and receipts. Light clients only store block headers. For Ethereum, Geth's --prune flag can reduce a node from ~1TB+ to under 700GB by removing state trie nodes older than 128 blocks. The trade-off is the inability to query historical state without an archive service. Your policy should document the chosen mode and the exact client configuration flags.

Archival is the process of moving data you don't need on your primary disk to cheaper, slower storage. This is crucial for data you must retain but rarely access, like very old transaction histories for compliance. Design an archival pipeline: use tools to export data (e.g., parity db export for OpenEthereum, custom scripts for Erigon's etl), compress it, and upload it to cold storage like Amazon S3 Glacier or a dedicated archival service. Automate this process to run on a schedule, ensuring your primary disk never exceeds a defined threshold, such as 80% capacity.

Your policy must be executable. Implement it through infrastructure-as-code using tools like Ansible, Terraform, or Docker Compose. For example, a Docker setup might include a cron job that runs a pruning script weekly and a health check that alerts you if disk usage exceeds 90%. For chains with state expiry (like a planned Ethereum upgrade), your policy must include procedures for accessing expired state from the archive. Test your archival and restoration process regularly; a backup you cannot restore is worthless. Document every step, script, and external service dependency clearly for your team.

A good policy is not static. Monitor key metrics: disk usage growth rate, block processing speed, and RPC query latency. If performance degrades, revisit your pruning settings or archival frequency. Stay informed about client updates; new versions often introduce more efficient storage engines or pruning options. By treating data management as a core operational concern, you ensure your node remains reliable, cost-effective, and fit for its intended purpose over the long term.

prerequisites
PREREQUISITES

How to Design a Node Data Archival and Pruning Policy

A strategic approach to managing blockchain node storage by defining what data to keep, what to prune, and how to archive it for long-term access.

A node's data archival and pruning policy defines the rules for managing its local copy of the blockchain state and history. Unlike a full archival node that stores every block and state from genesis, a pruned node selectively deletes older data to conserve disk space, which is essential for running nodes on consumer hardware or in resource-constrained environments. The core trade-off is between storage footprint and historical data accessibility. A well-designed policy balances these needs, specifying parameters like the number of recent blocks to retain in full, the method for pruning state trie nodes, and the protocol for offloading pruned data to an external archive.

The first step is understanding your node's operational requirements. Ask: does your application need access to old transaction receipts, logs, or full state for arbitrary historical blocks? A DeFi indexer may need recent data only, while a block explorer or auditor requires full history. For Ethereum clients like Geth, you configure pruning via flags: --gcmode=archive for full storage, or --gcmode=full with --gcscheme=snap for state pruning, keeping only recent state. The --cache settings directly impact pruning performance and memory usage. Similarly, Cosmos SDK chains use application-specific pruning options set in app.toml, with strategies like default, nothing, everything, or custom.

Design your policy around concrete metrics. For a pruned Ethereum node, a common target is to keep the last 128 blocks of state (a "snapshot") and prune everything older. This requires about 550GB of SSD storage as of early 2025, compared to over 12TB for a full archive node. Your policy should define the pruning interval (how often garbage collection runs) and the archive trigger (e.g., automatically export pruned blocks to a separate service when disk usage exceeds 80%). Consider using tools like erigon for its more efficient "staged sync" and flat storage model, or besu with its Bonsai trie storage, which offer different pruning characteristics and storage footprints.

Implementing the policy requires automation and monitoring. Use process managers like systemd or container orchestration to ensure your node restarts with correct pruning flags. Monitor key metrics: disk usage growth rate, chaindata directory size, and sync status. Set up alerts for when you approach storage limits. For archiving, you can configure your node to stream pruned blocks to decentralized storage like Arweave or Filecoin using services like Bundlr or Lighthouse, or to a centralized cloud bucket. The policy document should include the exact CLI commands, configuration file snippets, and cron jobs for these archival routines.

Finally, validate and test your policy in a staging environment before deploying to mainnet. Sync a testnet node from scratch with your proposed pruning settings and measure the final disk usage, sync time, and ability to serve historical queries. Use RPC calls like eth_getBlockByNumber for old blocks to ensure your archived data is retrievable. Remember that pruning is often irreversible; once state is deleted from a local node, recovering it requires a resync from genesis or a trusted archive. Your policy is a living document—review it quarterly as chain data growth and client software evolve to ensure it continues to meet your needs efficiently.

key-concepts-text
KEY CONCEPTS FOR DATA LIFECYCLE

How to Design a Node Data Archival and Pruning Policy

A strategic data lifecycle policy is critical for managing blockchain node storage, balancing performance, cost, and historical data access. This guide explains how to design an effective archival and pruning strategy.

Blockchain nodes accumulate vast amounts of data, including the full transaction history, state data, and consensus logs. Without management, this leads to unsustainable storage growth, slower synchronization times, and increased operational costs. A data lifecycle policy defines rules for which data to retain locally, which to prune, and which to archive for long-term storage. The primary goal is to maintain node health and performance while preserving the ability to serve the network and, if necessary, reconstruct the full historical chain. Key drivers for implementing a policy include the exponential growth of chain data (e.g., Ethereum's archive node requires over 12TB), the need for fast state sync for validators, and cost-effective cloud or hardware resource usage.

Designing a policy starts by classifying node data types and their access patterns. Block data (headers, transactions) is written once and read often for verification. State data (account balances, smart contract storage) is constantly updated and must be accessed with low latency for transaction execution. Historical state (the state at a past block) is rarely accessed but may be required for certain queries or dispute resolutions. Your policy must define retention periods for each class. For example, a full node might prune state older than 128 blocks (a common geth default) but keep all block data. An archive node retains everything indefinitely. A pruned node might only keep the most recent 10,000 blocks of all data to minimize footprint.

Pruning is the process of permanently deleting non-essential data from the active database. In Ethereum clients like Geth, you can enable pruning with flags like --gcmode=archive (no pruning) or --gcmode=full (state pruning). The command geth snapshot prune-state actively removes old state trie nodes. For Cosmos SDK chains, the pruning configuration in app.toml offers settings like default (keep last 100 states and 10% of all states) or everything (prune all states except current). It's crucial to understand that pruning is often irreversible; pruned data can only be recovered by syncing from genesis or restoring from an archive. Therefore, pruning policies should be tested on testnets and aligned with your node's purpose—never prune a node that is serving historical RPC queries.

Archiving complements pruning by moving data to cheaper, cold storage before deletion. A robust archival strategy involves periodically exporting prunable data to files. For instance, you can use geth export to write a range of blocks to a file, then store it on AWS S3 Glacier or a local NAS. Tools like Erigon's integration stage_headers can create compressed archive formats. The policy must specify the archival frequency (e.g., daily, every 10,000 blocks), retention period (e.g., 1 year in cold storage), and a verification process (e.g., checksum validation) to ensure data integrity. Automate this pipeline with cron jobs or Kubernetes CronJobs to ensure consistency and reduce operational overhead.

Implement your policy by configuring your client and setting up automation. A basic implementation for a Geth execution client on a non-archive node includes the flags: --syncmode snap --gcmode full --txlookuplimit 0. This enables snapshot sync, state pruning, and removes transaction index entries for old blocks. Monitor disk usage with tools like du and df, and set alerts for when usage exceeds 80% of capacity. For a comprehensive setup, use a stateful database like Pebble or RocksDB with tuned compaction filters to optimize pruning performance. Always document the recovery procedure: how to rebuild state from a pruned node using a trusted archive source, which is essential for disaster recovery and maintaining network trust.

Finally, validate and iterate on your policy. Measure its impact on key metrics: disk I/O, sync time, RPC query latency, and storage cost. Use a testnet or a mirrored mainnet environment to simulate long-term growth. Adjust retention windows based on actual query needs—if no one requests data older than 3 months, you can tighten the policy. Remember that network upgrades (like Ethereum's Dencun) or new client features can change data structures and pruning capabilities. Review and update your data lifecycle policy at least biannually to incorporate client improvements and evolving requirements, ensuring your node infrastructure remains efficient and reliable.

DATA LIFECYCLE

Blockchain Data Types and Retention Strategies

Comparison of storage characteristics and recommended retention policies for core blockchain data types.

Data TypeSize ImpactAccess FrequencyPruning FeasibilityRecommended Retention

Block Headers

~80 bytes/block

High (consensus)

Full History

Transaction Data

~250 bytes/tx avg

High (state exec)

Prune after finality

State Trie (World State)

GBs-TBs, grows with usage

Very High (every tx)

Latest snapshot only

Receipts & Logs

~300 bytes/tx avg

Medium (indexing, queries)

Archive (external DB)

Historical State

Multi-TB over years

Low (audit, analytics)

Archive (cold storage)

Peer & Network Data

MBs per day

Low (debugging)

Rotate (7-30 days)

Consensus Messages (e.g., attestations)

Varies by protocol

Very Low post-finality

Prune aggressively

policy-framework-steps
ARCHIVAL POLICY FOUNDATION

Step 1: Define Your Node's Service Level Objectives (SLOs)

Before configuring your node's data retention, you must define clear goals for what services it will provide. This determines your archival and pruning strategy.

A Service Level Objective (SLO) is a measurable target for the reliability or performance of a service your node offers. For archival policies, this translates to defining the historical data access you guarantee. Common SLOs include: Full Archive (all historical state and transactions), State History (pruned blocks but full state history), or Recent Data (only recent blocks and state, typically for validators). Your SLO directly dictates your storage requirements and the pruning flags you will use.

For example, an Ethereum node running a RPC endpoint for a block explorer requires a full archive to serve arbitrary historical queries. In contrast, a Cosmos SDK validator node might only need the last 100 blocks for consensus and can prune everything else. The Ethereum Execution Client Specifications and Cosmos SDK Documentation provide default pruning configurations aligned with different node roles. Define your SLOs in writing before proceeding.

Quantify your SLOs with specific metrics. Instead of "fast queries," define "95% of historical block queries return in under 2 seconds." Instead of "reliable state," commit to "zero unavailability of state for the last 10,000 blocks." These metrics will later help you choose between storage solutions like SSDs for performance versus HDD arrays for capacity, and configure parameters like --pruning, --pruning-keep-recent, and --pruning-interval in clients like Geth, Erigon, or Cosmos nodes.

implement-pruning
NODE OPTIMIZATION

Step 2: Implement Pruning for State and History

A practical guide to designing and implementing a data pruning policy to manage your node's storage footprint without sacrificing essential functionality.

Pruning is the process of selectively deleting historical blockchain data that is no longer required for a node's core operations. Full archival nodes store the entire history, which for networks like Ethereum can exceed 15 TB. For most use cases—running a RPC endpoint, a validator, or a light client gateway—this is excessive. Pruning allows you to delete old block bodies and receipts while retaining the current world state and a recent window of block history, reducing storage needs by 80-95%. The key is to define a policy that balances storage savings with the data access needs of your applications.

The primary targets for pruning are block bodies (transaction lists) and receipts (transaction execution logs). These are large but often only needed for a limited historical lookback. Most clients allow you to specify a pruning retention window. For example, Geth's --gcmode flag with archive, full, or light modes, or the more granular --txlookuplimit to keep only the last N blocks of transaction indices. Erigon uses a staged sync paradigm where pruning is an integral, periodic operation. Besu offers --pruning-enabled with --pruning-blocks-retained to define the history window. A common starting policy is to retain the last 128,000 blocks (approx. 18 days on Ethereum) for transaction lookups.

Crucially, pruning must preserve the state trie. The state (account balances, contract storage, nonces) is essential for validating new blocks and processing transactions. Clients implement state pruning by traversing the Merkle Patricia Trie and removing nodes unreachable from the current state root. This is often a background process. When configuring your node, you must ensure state pruning is enabled (e.g., Geth's --gcmode=full) while setting appropriate cache sizes (--cache flags) to keep recent state data in memory for performance. Failing to prune state will lead to unbounded storage growth, even if block bodies are deleted.

Design your policy based on your node's role. A validator/consensus node needs recent blocks for attestations and a full current state, but can prune ancient history aggressively. An RPC node serving dApps may need a longer history window for eth_getLogs queries; analyze typical user requests to set --txlookuplimit. Use monitoring tools to track growth rates. Implement the policy via your client's startup flags or configuration file. After changes, perform a full resync with the new pruning flags for the cleanest result, as in-place pruning of an existing archive can be slow and I/O intensive.

Test your pruning configuration on a testnet or with a small retention window first. Verify core functions: new block processing, transaction sending, and historical queries for your defined window. Remember that pruned data is irrecoverable locally; services like Etherscan or dedicated archive nodes can fill gaps. For a robust setup, consider a hybrid architecture: a pruned primary node for low-latency operations paired with a separate, cost-optimized archive node (or use a service like Infura or Alchemy) for deep historical data needs, optimizing both performance and cost.

design-archive-pipeline
DATA LIFECYCLE MANAGEMENT

Step 3: Design an Archital Pipeline to Cold Storage

A systematic approach to moving historical blockchain data from live nodes to cost-effective, long-term storage while maintaining operational efficiency.

Node data archival is the process of selectively moving historical blockchain data from a node's active storage (like an SSD) to a separate, cheaper storage tier, often called cold storage. This is distinct from pruning, which permanently deletes data. The core challenge is designing a pipeline that balances cost, data accessibility, and the ability to rebuild or verify the chain. A typical archival policy might move all block data older than a certain epoch (e.g., 5,000 epochs on Ethereum) to an object storage service like AWS S3 Glacier or a decentralized network like Arweave or Filecoin, while keeping recent state and block headers on the live node for fast queries.

The architectural design involves several key components. First, you need a data extraction layer that can read and serialize historical blocks and state from your node client (Geth, Erigon, etc.) into a structured format like compressed .tar files or columnar formats like Parquet. Second, a storage abstraction layer is crucial to support multiple backends (S3, IPFS, local NAS) without changing core logic. Third, implement a verification and integrity layer to generate checksums (like SHA-256 hashes) for archived data and store them on-chain or in a manifest file, ensuring the data hasn't been corrupted during transfer or storage.

Here is a simplified conceptual workflow for an archival script using the Ethereum execution client geth and the RPC method debug_getBlockRlp to fetch raw block data:

python
# Pseudo-code for block data extraction
from web3 import Web3
import boto3

w3 = Web3(Web3.HTTPProvider('http://localhost:8545'))
s3 = boto3.client('s3')

archive_cutoff_block = w3.eth.block_number - 100000
for block_num in range(1, archive_cutoff_block):
    raw_block_rlp = w3.provider.make_request('debug_getBlockRlp', [block_num])
    # Compress and package blocks in batches
    # Upload batch to S3 Glacier Deep Archive
    # Record block number, batch file hash, and storage URI in a database

This script would run periodically, extracting blocks beyond the cutoff, batching them, and pushing them to cold storage.

After archiving data, you must update your node's configuration to reflect that this historical data is no longer locally available. For Geth, you would use the --datadir.ancient flag to point to a directory containing the archived ancient chain data if you keep a local copy, or you would rely on external services for deep historical queries. The node will then operate in a pruned mode for local data but can still theoretically serve old data by fetching it from the cold storage pipeline, though with significant latency. It's critical to document the retrieval process and expected SLAs, as fetching from Glacier Deep Archive can take several hours.

Finally, establish a data lifecycle policy. Define clear triggers: archive blocks every 100,000 blocks or weekly. Set retention rules, perhaps keeping 1 year of data on warm storage and moving everything older to cold storage. Implement monitoring for the pipeline's health, tracking metrics like archival success rate, storage costs, and data retrieval times. This systematic approach transforms a growing, costly node dataset into a managed asset, drastically reducing operational expenses while preserving the full history of the chain for auditors, indexers, or future state inspections.

optimize-rpc-layer
ARCHIVAL POLICY

Step 4: Optimize the RPC Layer for Hybrid Data

Design a node data archival and pruning policy to balance performance, cost, and data availability for your hybrid RPC infrastructure.

A data archival policy defines the rules for what blockchain data your node retains and for how long. This is critical for a hybrid RPC setup where you might serve recent data from a high-performance full node and historical data from a separate archival service. The core decision is the pruning window—the number of recent blocks you keep locally. For example, a Geth node can be configured with --gcmode=archive to keep everything or --gcmode=snap with --prune to maintain a rolling state of the last 128, 256, or 1024 blocks. The choice directly impacts your node's storage footprint, sync time, and the historical query latency you can support.

To design an effective policy, first analyze your application's data access patterns. Track RPC calls to identify the access frequency curve. Requests for the latest 1000 blocks might constitute 95% of your traffic, while older data is queried infrequently for analytics or dispute resolution. Your policy should match this: keep a hot cache of recent state (e.g., last 10k blocks) on fast SSDs for low-latency queries, while older data can be moved to a cold archive on cheaper object storage or delegated to a service like Chainnodes or QuickNode's Archive Plans. Implement tiered storage by using tools that can fetch missing historical data from your archive layer on-demand.

Your pruning strategy must also consider state trie access. Even pruned nodes must retain recent state to validate new blocks. However, accessing state from a very old block (pre-pruning window) requires the full historical state trie. If your applications need this, you must maintain a full archive node or integrate a service that provides one. For EVM chains, tools like Erigon's --prune flags offer granular control over pruning history, receipts, and call traces separately, allowing you to keep essential data like logs for longer than the full state.

Automate and monitor your policy. Use node client logs and metrics (e.g., eth_syncing, disk I/O) to alert when you're approaching storage limits. Script the archival process: once data ages beyond your hot window, compress and upload it to your cold storage. For retrieval, implement a fallback RPC endpoint in your infrastructure that routes historical queries (eth_getBlockByNumber, eth_getLogs for old blocks) to your archival service, ensuring seamless user experience. This hybrid approach optimizes costs while maintaining data availability guarantees.

Finally, document and test your disaster recovery. Ensure you can rebuild your hot node from a snapshot and that your cold archive data is verifiable and accessible. A well-defined archival policy transforms your RPC layer from a costly, monolithic node into a scalable, cost-efficient system tailored to actual usage, which is essential for running reliable infrastructure in production.

CLIENT COMPARISON

Pruning Configuration for Major Ethereum Clients

Default and configurable data pruning options for Geth, Erigon, Nethermind, and Besu.

Pruning FeatureGethErigonNethermindBesu

Default State Pruning

Default History Pruning

Prune Block History Flag

--prune.ancient

--prune.h.before

--Pruning.Mode=Hybrid

--pruning-enabled=true

Prune Receipts/Traces

Prune Mode Options

Full, Ancient Only

H, HC, HCT

Archive, Full, Memory

Archive, Full, Fast

Estimated Disk Space (Full Sync)

~650 GB

~1.2 TB

~700 GB

~800 GB

Prune While Syncing

Manual Prune Command

geth snapshot prune-state

erigon prune

dotnet Nethermind.Runner --Pruning.Mode Full

besu --pruning-enabled=true --data-storage-format=BONSAI

NODE OPERATIONS

Frequently Asked Questions

Common questions and solutions for designing and managing blockchain node data archival and pruning policies.

Pruning is the process of permanently deleting historical blockchain data (like old transaction receipts or intermediate state tries) from a node's local storage to save disk space. Archival refers to maintaining a complete, unaltered copy of all historical data.

  • Full Archival Node: Stores the entire history, required for services like block explorers or indexers.
  • Pruned Node: Deletes data older than a configurable threshold (e.g., last 128 blocks).
  • Light Node: Stores only block headers, fetching other data on-demand from peers.

The choice impacts your node's resource footprint and functionality. Pruning can reduce a Geth node's disk usage from ~1TB to ~300GB, but you lose the ability to query old state directly.

conclusion
IMPLEMENTATION STRATEGY

Conclusion and Next Steps

Designing an effective node data archival and pruning policy requires balancing performance, cost, and data availability. This section summarizes key takeaways and outlines practical steps for implementation.

A successful archival policy is defined by clear retention rules and pruning triggers. Your primary decisions involve what data to keep (e.g., full blocks, state tries, transaction receipts), for how long (based on contract finality or legal requirements), and the conditions for deletion (like disk usage thresholds or block height). For Ethereum execution clients like Geth, this is configured via flags such as --gcmode for garbage collection mode and --txlookuplimit to control historical transaction index retention. The goal is to maintain a node that serves your specific needs—whether for real-time RPC queries, block explorers, or light client support—without the bloat of indefinite full history.

Your next step is to implement and test the policy in a staging environment. Start by benchmarking your node's baseline performance and storage growth. Then, apply your chosen pruning configuration. For a Substrate-based chain, you might adjust the state_pruning and keep_blocks parameters in the chain specification. After pruning, verify that essential chain operations—like block production for validators or historical data queries for RPC nodes—function correctly. Tools like Prometheus and Grafana are critical for monitoring key metrics: database size, sync status, and memory usage before and after the pruning cycle executes.

Finally, consider long-term data persistence strategies. For data you prune but may need later, establish an archival pipeline. This could involve syncing a separate, dedicated archival node that never prunes, or exporting pruned data to cheaper cold storage solutions like AWS S3 Glacier or Filecoin. Automate this process using scripts that leverage node RPC methods (e.g., debug_setHead for controlled state rollbacks in testing) or blockchain indexers like The Graph for structured querying of historical events. Regularly review and update your policy as network upgrades (like Ethereum's EIP-4444, which proposes historical data expiry) or your application's data requirements evolve.

How to Design a Node Data Archival and Pruning Policy | ChainScore Guides