How to Scale Node Fleets for Blockchain Networks

introduction

INFRASTRUCTURE

Introduction to Node Fleet Scaling

A guide to scaling blockchain node infrastructure for high-throughput applications, covering architecture patterns, automation, and best practices.

Node fleet scaling is the practice of managing multiple blockchain nodes as a unified, resilient infrastructure layer. Unlike running a single node for personal use, a fleet is designed for high availability, load balancing, and geographic distribution. This is critical for applications like indexers, RPC providers, MEV searchers, and exchanges that require consistent, low-latency access to blockchain data. The primary goal is to ensure that your application's uptime and performance are not dependent on the health of a single node instance.

A scalable node fleet architecture typically involves several key components. A load balancer (like Nginx or HAProxy) distributes incoming RPC requests across a pool of healthy nodes. Health checks continuously monitor node sync status and latency, automatically removing unhealthy instances from the pool. Synchronized node configurations are managed through infrastructure-as-code tools like Terraform or Ansible. For stateful chains, you must also plan for fast sync strategies and snapshot management to reduce the time needed to bring new nodes online.

Automation is non-negotiable for fleet management at scale. Use container orchestration with Docker and Kubernetes to deploy, update, and roll back node software consistently. Implement automated provisioning to spin up new nodes in different cloud regions or with different clients (e.g., Geth and Erigon for Ethereum) to increase client diversity. Monitoring stacks like Prometheus and Grafana are essential for tracking metrics such as block height lag, peer count, memory usage, and request error rates across the entire fleet.

Consider the specific demands of the blockchain network. For high-throughput chains like Solana or Sui, node hardware requirements (CPU, RAM, SSD IOPS) are significant and directly impact scalability costs. For Ethereum after The Merge, you need to run both an execution client (e.g., Nethermind) and a consensus client (e.g., Lighthouse). Scaling this "pair" introduces complexity in keeping them synchronized and failover coordinated. Archival nodes, which store full history, require exponentially more storage and are often scaled separately from standard full nodes.

A successful scaling strategy also involves cost optimization and failure planning. Use a mix of cloud providers and bare-metal servers to avoid vendor lockout and reduce latency. Implement graceful degradation so that if a primary chain experiences issues, your fleet can partially failover to alternative data sources or chains. Finally, document your disaster recovery procedures, including how to rebuild the fleet from snapshots and how to verify data consistency across nodes after an incident.

prerequisites

PREREQUISITES FOR SCALING

How to Scale Node Fleets

Before you can scale a node fleet, you must establish a robust foundation. This guide outlines the core infrastructure and operational prerequisites for managing multiple blockchain nodes effectively.

Scaling a node fleet begins with infrastructure-as-code (IaC). You must define your node's configuration—including the client software (e.g., Geth, Erigon, Lighthouse), network settings, and security policies—as declarative code. Tools like Terraform, Ansible, or Pulumi allow you to version-control this setup and deploy identical nodes across multiple cloud providers or data centers. This ensures consistency, eliminates manual configuration drift, and enables you to spin up new nodes in minutes. Without IaC, managing a fleet becomes an error-prone, unscalable manual process.

You need a robust monitoring and alerting stack before adding nodes. Each node must expose metrics (e.g., via Prometheus) for block height, peer count, CPU/memory usage, and disk I/O. Centralize these metrics in a dashboard like Grafana. More critically, configure alerts for critical failures: a node falling behind the chain tip (head_slot lag), running out of disk space, or losing all peer connections. Setting this up for a single node first creates the template you will replicate across your entire fleet, turning operational data into actionable insights.

Establish a secure and automated secret management system. Node operations require private keys, RPC endpoints, and API tokens. Hard-coding these into configuration files is a severe security risk. Instead, use a dedicated service like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault. Your deployment scripts should pull secrets dynamically at runtime. This practice not only secures your fleet but also simplifies secret rotation and access control, which are essential for maintaining security at scale.

Plan your network topology and connectivity. A scalable fleet often spans multiple regions or cloud providers for redundancy. You must configure virtual private clouds (VPCs), subnets, and firewall rules to allow secure communication between nodes and clients. Consider using a private network like Chainscore's P2P Network to reduce reliance on public peer discovery and improve sync times. Ensure your nodes have stable, low-latency connections to blockchain networks and that your load balancers (if used) are configured for the specific RPC methods you will serve.

Finally, implement a continuous integration and deployment (CI/CD) pipeline. Automate the testing of configuration changes and client updates before they hit production. A typical pipeline builds a new machine image or container with the updated node client, deploys it to a single canary node, runs health checks, and only then rolls it out to the full fleet. This process, managed by tools like GitHub Actions, GitLab CI, or Jenkins, is the cornerstone of maintaining a healthy, up-to-date fleet without service disruption.

scaling-strategies

CORE SCALING STRATEGIES

How to Scale Node Fleets

Scaling a blockchain node fleet requires a systematic approach to handle increased load, ensure high availability, and maintain decentralization. This guide covers the key strategies for horizontal scaling, load balancing, and automation.

Horizontal scaling is the primary method for expanding node capacity. Instead of upgrading individual servers (vertical scaling), you add more nodes to the fleet. This approach improves redundancy and fault tolerance. For example, running multiple geth or erigon clients across different cloud regions or data centers distributes the network load and prevents a single point of failure. Tools like Kubernetes or Docker Swarm are essential for orchestrating these containerized node instances, allowing you to define the desired state of your fleet and automatically manage container deployment and health.

Effective load balancing is critical for distributing requests across your node fleet. A common pattern involves placing a reverse proxy like Nginx or a dedicated load balancer (AWS ALB, Cloud Load Balancing) in front of your RPC endpoints. This balances the JSON-RPC query load and can route traffic based on health checks. For stateful operations requiring data consistency, implement session affinity (sticky sessions) so a user's requests are directed to the same node. For read-heavy workloads, you can configure the load balancer to direct eth_getBlockByNumber requests to a subset of nodes while reserving others for write operations like eth_sendRawTransaction.

Automation and Infrastructure as Code (IaC) are non-negotiable for managing a scalable fleet. Define your entire node infrastructure—virtual machines, security groups, and network configurations—using Terraform or Pulumi scripts. This ensures consistent, repeatable deployments. Implement a CI/CD pipeline to automate node client updates; when a new besu version is released, your pipeline can roll it out to canary nodes for testing before a full fleet deployment. Monitoring with Prometheus (for metrics like block sync status, peer count, memory usage) and alerting with Grafana or Alertmanager allows for proactive scaling decisions based on predefined thresholds.

Consider implementing a multi-cloud or hybrid architecture to mitigate provider risk and reduce latency. Deploying nodes across AWS, Google Cloud, and a bare-metal provider ensures your service remains online during a regional cloud outage. Use a global anycast network or a GeoDNS service to route end-users to the closest healthy node cluster, significantly improving response times for RPC calls. This strategy also aligns with blockchain's decentralized ethos by avoiding reliance on a single infrastructure vendor.

Finally, optimize individual node performance to get the most out of each instance. This includes using fast SSDs (NVMe) for chain data, tuning database settings (e.g., leveldb cache size for Geth), and selecting instance types with high network bandwidth. For archival nodes, consider separating the execution client from the consensus client and using a shared database or a read-replica setup. Regularly profile your nodes' resource usage to identify bottlenecks before they impact the entire fleet's performance.

NODE MANAGEMENT

Orchestration Tool Comparison

A comparison of leading tools for managing and scaling blockchain node fleets, focusing on core operational features.

Feature / Metric	Kubernetes	Docker Swarm	Nomad
Container Orchestration
Service Discovery
Auto-Scaling (Horizontal)
Multi-Cloud Deployment
StatefulSet Support (for chain data)
Built-in Load Balancer
Learning Curve	High	Low	Medium
Resource Overhead	~300-500MB/node	< 100MB/node	~100-200MB/node
Native Secret Management

step-by-step-kubernetes

NODE OPERATIONS

Step-by-Step: Scaling with Kubernetes

A practical guide to horizontally scaling blockchain node fleets using Kubernetes, covering strategies from manual pod scaling to full automation with the Horizontal Pod Autoscaler.

Scaling a node fleet horizontally involves adding or removing identical node instances (pods) to meet demand. In Kubernetes, this is managed through the Deployment resource. The spec.replicas field defines the desired number of running pods. To manually scale a deployment named geth-node to five instances, you use the command kubectl scale deployment/geth-node --replicas=5. Kubernetes' controller then works to match the actual state to this desired state, scheduling new pods across available worker nodes if resources permit. This approach is ideal for planned scaling events, like preparing for a network upgrade or a scheduled high-traffic period.

For dynamic, demand-based scaling, the Horizontal Pod Autoscaler (HPA) is essential. The HPA automatically adjusts the number of pods in a deployment based on observed CPU utilization, memory consumption, or custom metrics. A typical configuration for an RPC node might scale based on average CPU usage. You define a target, for example, 70% average CPU utilization across all pods. If usage consistently exceeds this threshold, the HPA increments the replica count. It uses the Metrics Server to gather resource data. Apply an HPA with a command like kubectl autoscale deployment/geth-node --cpu-percent=70 --min=3 --max=10.

Effective scaling requires proper resource requests and limits. In your pod specification, resources.requests informs the scheduler about minimum needs (e.g., cpu: "2", memory: "4Gi"), while resources.limits (cpu: "4", memory: "8Gi") prevent a pod from consuming excessive cluster resources. Without these, the HPA cannot make accurate scaling decisions, and pods may be evicted or fail to schedule. For stateful nodes like archival Ethereum nodes, ensure persistent volume claims are correctly configured to allow new pods to attach to existing chain data, avoiding a full re-sync with each scale-up event.

Advanced scaling can leverage custom metrics via Prometheus and the Prometheus Adapter. This allows scaling based on application-level metrics such as - eth_syncing status, pending transaction queue depth, or request latency. For instance, you can configure the HPA to scale up if the average HTTP request latency for JSON-RPC calls exceeds 500ms. This is more aligned with user experience than raw CPU usage. Setting this up involves installing the monitoring stack, exposing custom metrics, and configuring the adapter to make those metrics available to the Kubernetes API for the HPA to consume.

Always test scaling behavior in a staging environment. Use load testing tools like k6 or vegeta to simulate traffic spikes and verify that the HPA triggers scale-up events within a reasonable time (typically 30-60 seconds for metrics to propagate and pods to become ready). Similarly, test scale-down to ensure it doesn't occur too aggressively, which could interrupt service. Implement Pod Disruption Budgets (PDBs) to control the number of pods that can be down simultaneously during voluntary disruptions like node maintenance or cluster upgrades, ensuring high availability is maintained throughout the scaling process.

infrastructure-as-code

SCALING NODE FLEETS

Infrastructure as Code with Terraform

A guide to managing and scaling blockchain node deployments using Terraform's declarative infrastructure management.

Managing a single blockchain node is straightforward, but operating a production-grade fleet introduces complexity. You must handle provisioning, configuration, security groups, load balancing, and monitoring across multiple cloud providers or regions. Infrastructure as Code (IaC) solves this by defining your infrastructure in version-controlled configuration files. Terraform, by HashiCorp, is the leading IaC tool that allows you to declare the desired state of your resources—like virtual machines, networks, and storage—in a human-readable format. This approach ensures consistency, enables collaboration, and provides a clear audit trail for all infrastructure changes.

The core of Terraform is the HashiCorp Configuration Language (HCL). A basic configuration for a node might define a cloud compute instance, its disk image, machine type, and firewall rules. Crucially, Terraform uses providers to interact with APIs from cloud platforms like AWS, Google Cloud, or Azure, as well as specialized services. For node operations, you would use the relevant cloud provider alongside the terraform-provider-ansible or similar tools to handle the subsequent software installation and systemd service configuration, creating a complete deployment pipeline from infrastructure to running software.

Scaling a node fleet efficiently requires leveraging Terraform's dynamic constructs. Instead of defining ten identical nodes individually, you use the count or for_each meta-arguments. These allow you to create multiple resource instances from a single block, parameterized by variables. For example, you can deploy nodes across multiple regions by iterating over a list of locations. This model makes it trivial to adjust fleet size: changing a single count value and running terraform apply will provision or destroy instances as needed, automatically reconciling the live infrastructure with your declared configuration.

State management is critical for team collaboration and safety. Terraform stores the mapping between your configuration and real-world resources in a state file. For individual use, this is local, but for team environments, you must use a remote backend like Terraform Cloud, AWS S3, or HashiCorp Consul. This backend locks the state during operations to prevent conflicts. A well-structured project uses Terraform modules to encapsulate reusable components—like a "blockchain-node" module—which can be versioned and shared across different environments (development, staging, production) ensuring identical base configurations.

Advanced scaling strategies involve auto-scaling groups (ASGs) on AWS or instance groups on GCP, which Terraform can configure. These services manage a pool of nodes, automatically adding or removing instances based on metrics like CPU load. Terraform defines the launch template and scaling policies, while the cloud service handles the runtime adjustments. For global distribution, you combine these with a load balancer resource (like an AWS ALB or NLB) that Terraform also provisions, distributing incoming RPC or P2P traffic evenly across your healthy node instances, increasing both capacity and fault tolerance.

To implement this, structure your project with clear separation: variables.tf for inputs like node count and instance type, outputs.tf for useful data like public IPs, and main.tf for core resources. Always run terraform plan to preview changes before applying. For node fleets, integrate monitoring from the start by having Terraform provision cloud monitoring dashboards and alert policies. This IaC approach transforms node fleet management from an error-prone, manual process into a reliable, automated, and scalable system.

monitoring-resources

MONITORING AND OBSERVABILITY

How to Scale Node Fleets

Scaling a node fleet requires robust tools for metrics, logs, and alerts. This guide covers essential tools for maintaining performance and reliability at scale.

Prometheus and Grafana Stack

The Prometheus time-series database and Grafana visualization platform form the industry standard for monitoring. Use Prometheus to scrape metrics from your nodes (CPU, memory, disk I/O, block height, peer count) and Grafana to build dashboards.

Deploy Prometheus exporters like node_exporter for system metrics and custom exporters for chain-specific data.
Set up recording rules to pre-compute expensive queries.
Configure Grafana alerts to notify your team via Slack, PagerDuty, or email when metrics breach thresholds.

EXPLORE

Implement Structured Logging with Loki

Grafana Loki is a log aggregation system designed for efficiency. It indexes only metadata (like timestamps and labels), making it cost-effective for high-volume node logs.

Instrument your nodes to output logs in a structured format like JSON.
Use the Promtail agent to ship logs from your node instances to Loki.
Query logs in Grafana using LogQL to debug issues, trace transactions, or audit validator performance. Correlate logs with metrics on the same dashboard for faster incident response.

EXPLORE

Distributed Tracing with Jaeger or Tempo

For complex, multi-service node architectures (e.g., separating RPC, validator, and relayer services), implement distributed tracing. This helps you track a request's path across services to identify bottlenecks.

Jaeger is a full-featured, open-source tracing system. Instrument your services with OpenTelemetry SDKs.
Grafana Tempo is a high-volume, minimal-dependency trace store that works natively with Grafana.
Use traces to measure propagation time for blocks or the latency of specific JSON-RPC calls across your fleet.

EXPLORE

Alerting and On-Call with Alertmanager

Prometheus Alertmanager handles alerts sent by Prometheus servers. It manages deduplication, grouping, and routing to various receivers.

Define alerting rules in Prometheus for critical conditions (e.g., chain_sync_status == 0 for 5 minutes).
Configure Alertmanager with routing trees to send different severity alerts to appropriate teams (SRE, DevOps).
Implement silences and inhibitions to prevent alert storms during planned maintenance or related outages.

EXPLORE

Infrastructure as Code with Terraform and Ansible

Scale your node fleet declaratively using Infrastructure as Code (IaC). Terraform provisions cloud resources (VMs, load balancers, networks), while Ansible configures the software on those nodes.

Write Terraform modules to deploy identical node clusters across multiple cloud regions or providers.
Use Ansible playbooks to ensure consistent installation of Geth, Erigon, or Cosmos nodes, along with monitoring agents.
Store state in a remote backend to enable team collaboration and track infrastructure drift.

EXPLORE

Node-Specific Health Checks and Probes

Beyond system metrics, implement application-level health checks for blockchain-specific states.

Create readiness probes that check if the node is synced and has healthy peer connections.
Implement liveness probes that verify the node process is responding to RPC calls.
Monitor consensus participation for validator nodes (e.g., missed blocks in Cosmos, attestation effectiveness in Ethereum). Use these checks in your orchestration system (Kubernetes, Docker Swarm) to automatically restart unhealthy nodes.

99.9%

Target Uptime

< 2 blocks

Sync Tolerance

managing-stateful-data

MANAGING STATEFUL NODE DATA

How to Scale Node Fleets

A guide to handling persistent data across distributed blockchain infrastructure.

Scaling a node fleet requires a strategy for managing stateful data—persistent information like the blockchain ledger, validator keys, and node configuration. Unlike stateless microservices, blockchain nodes maintain critical state that must be preserved across restarts, updates, and scaling events. The primary challenge is ensuring data consistency and high availability while adding or removing nodes. Common approaches include using network-attached storage (NAS), cloud block storage volumes, or distributed file systems like Ceph or GlusterFS to decouple compute from storage.

For automated scaling, you must implement a persistent volume (PV) and persistent volume claim (PVC) model, commonly used in Kubernetes. When a new node pod is scheduled, it should attach to an existing volume containing the synced chain data, rather than starting a fresh sync from genesis. This drastically reduces node provisioning time from days to minutes. Tools like the Rook operator can manage Ceph storage clusters directly within your Kubernetes environment, providing a cloud-native storage layer for your node data.

Data synchronization strategy is critical. For chains with large states (e.g., Ethereum's >1TB archive node), consider a tiered approach: - Hot nodes with full state for low-latency RPC - Warm nodes with pruned state for consensus - Node snapshots served from object storage (like S3) for bootstrapping. Automate snapshot creation and restoration using tools like geth snapshot or erigon snapshots. A statefulset in Kubernetes is the standard workload API for managing stateful applications, ensuring stable network identifiers and persistent storage.

Implement robust backup and disaster recovery procedures. Even with redundant storage, regular, verified backups of validator signing keys and node configuration are non-negotiable. Automate encrypted backups to geographically separate object storage. For true resilience, design your fleet to survive the loss of an entire availability zone. This often means distributing nodes and their backing storage across multiple zones and using a load balancer or service mesh (like Istio) to direct traffic to healthy instances.

Monitor the performance and integrity of your storage layer as diligently as your nodes. Key metrics include I/O latency, throughput, volume usage, and storage class capacity. Set alerts for when a volume is nearing capacity or if I/O errors spike. For public RPC providers, implementing a cache layer (using Redis or a CDN) for common queries like eth_getBlockByNumber can significantly reduce read load on your primary stateful nodes, allowing a smaller fleet to handle more traffic.

MONTHLY COST COMPARISON

Cloud Provider Cost Analysis for Node Fleets

Estimated monthly costs for running a 100-node Ethereum validator fleet with 2 vCPUs, 8GB RAM, and 500GB storage per node.

Resource / Feature	AWS EC2	Google Cloud Compute Engine	Hetzner Cloud
Instance Type	t3.large	e2-standard-2	CPX31
Cost per Node (USD)	$61.64	$67.57	$38.85
Estimated Fleet Cost (USD)	$6,164	$6,757	$3,885
Sustained Use Discounts
1-Year Committed Use Discount
Egress Data Cost per GB	$0.09	$0.12	$0.01
Block Storage (GB-Month)	$0.10	$0.17	$0.04
Global Load Balancer (Hourly)	$0.025	$0.025

resource-links

NODE OPERATIONS

Essential Resources and Documentation

Practical resources and concepts for scaling blockchain node fleets across multiple networks. These cards focus on orchestration, automation, monitoring, and network-specific guidance used by production node operators.

Kubernetes for Node Orchestration

Kubernetes is the default orchestration layer for scaling RPC nodes, validators, and indexers across regions.

Key practices for node fleets:

Use StatefulSets for execution and consensus clients to preserve disk identity
Attach persistent volumes with IOPS guarantees for database-heavy clients like geth and Erigon
Isolate workloads using taints and tolerations to separate archive, full, and light nodes
Implement horizontal pod autoscaling only for stateless components like load-balanced RPC gateways

Production operators typically pair Kubernetes with multiple clusters per region to avoid blast radius and to handle chain-specific maintenance windows. Understanding pod lifecycle hooks is critical to prevent database corruption during rescheduling.

EXPLORE

Infrastructure as Code with Terraform

Terraform enables repeatable provisioning of cloud infrastructure for node fleets across AWS, GCP, or bare-metal providers.

Common patterns for node scaling:

Modularize templates for execution nodes, consensus clients, and sentry nodes
Encode disk size, network bandwidth, and instance classes per chain
Manage multi-region deployments using workspaces or separate state files
Enforce consistency for firewall rules, VPC peering, and load balancers

Operators use Terraform to rebuild nodes quickly after chain halts, client bugs, or disk failures. Version-controlling infrastructure definitions is essential when operating dozens or hundreds of nodes across multiple networks.

EXPLORE

Monitoring Node Health with Prometheus

Prometheus is widely used to monitor node performance, sync status, and resource usage at scale.

Metrics commonly tracked:

Block height and head lag for execution and consensus clients
Memory growth and disk utilization to detect database issues
RPC latency and error rates under production traffic
Peer counts and networking stability

Most blockchain clients expose Prometheus endpoints natively. Operators aggregate metrics into Grafana dashboards and define alert rules for stalled sync, missed slots, or excessive reorgs. Reliable monitoring is a prerequisite for scaling node fleets beyond a single region.

EXPLORE

Centralized Logging with EFK Stack

As node fleets grow, centralized logging becomes necessary to debug client crashes, consensus errors, and RPC failures.

A typical setup uses:

Elasticsearch for indexed log storage
Fluent Bit or Fluentd for log shipping from containers
Kibana for querying and alerting

Structured logs allow operators to filter by client version, chain ID, or region. This is especially useful during network upgrades or when running multiple client implementations. Log retention policies are critical, as archive nodes can generate large volumes of data during high activity periods.

EXPLORE

Ethereum Client Documentation

Official Ethereum client documentation provides chain-specific guidance that directly impacts scaling decisions.

Topics operators rely on include:

Hardware requirements for archive vs full nodes
Recommended client combinations for execution and consensus layers
Sync modes, pruning options, and snapshot behavior
Configuration flags for RPC rate limits and peer discovery

Client docs are often updated around hard forks and performance improvements. Operators running large fleets track these changes closely to plan rolling upgrades and avoid network penalties such as missed attestations or degraded RPC service.

EXPLORE

NODE MANAGEMENT

Frequently Asked Questions

Common technical questions and solutions for scaling and managing blockchain node fleets, from infrastructure to automation.

Understanding node types is crucial for fleet architecture.

Full Node: Stores the current blockchain state and recent blocks, validating new transactions. It's sufficient for most RPC services and dApp backends.
Archive Node: Contains the entire historical state from genesis. Essential for block explorers, complex analytics, and historical data queries. Requires significantly more storage (often 10TB+ for networks like Ethereum).
Validator Node (or Consensus Node): Actively participates in block production and consensus (e.g., staking ETH on Ethereum, running a CometBFT validator). Requires high availability, signing keys, and often a bonded stake.

Choose based on your application's needs: use for RPC (Full), data analysis (Archive), or securing the network (Validator).

conclusion

SCALING YOUR INFRASTRUCTURE

Conclusion and Next Steps

You've learned the core principles for scaling a node fleet. This section summarizes key takeaways and provides a roadmap for advanced optimization.

Successfully scaling a node fleet requires a balance of automation, monitoring, and architectural planning. The core principles covered include using infrastructure-as-code tools like Terraform or Pulumi for reproducible deployments, implementing a robust monitoring stack with Prometheus and Grafana, and designing for high availability across multiple cloud regions or providers. Automating node provisioning, syncing, and key management is non-negotiable for operating at scale. Remember, the goal is to move from manual, fragile processes to a resilient, self-healing system.

For your next steps, consider these advanced optimizations. First, implement load balancing for RPC endpoints using tools like Nginx or cloud-native load balancers to distribute traffic evenly and improve client performance. Second, explore state sharding techniques if you're running validator nodes for a single chain, splitting the validator set across different machine instances to reduce resource contention. Third, integrate alerting and automated remediation; for example, use Prometheus Alertmanager to trigger scripts that automatically restart failed nodes or provision replacements.

To deepen your expertise, engage with the following resources. Study the Kubernetes operators for blockchain nodes, such as the Chainlink Operator or Cosmos SDK's node deployment tools, to understand container-orchestrated scaling. Review the documentation for node clients you use (e.g., Geth, Erigon, Lighthouse) for their specific scaling and performance tuning guides. Finally, participate in infrastructure-focused forums like the EthStaker community or Discord channels for node operators to learn from peers tackling similar scaling challenges.