How to Compare Proof Systems Under Load

introduction

PERFORMANCE BENCHMARKING

How to Compare Proof Systems Under Load

A practical guide to evaluating the performance and scalability characteristics of zero-knowledge proof systems under realistic computational workloads.

When evaluating a zero-knowledge proof system like zk-SNARKs, zk-STARKs, or Bulletproofs, theoretical benchmarks are insufficient. Real-world applications—such as scaling Ethereum with zkRollups, private transactions with Zcash, or confidential smart contracts—demand an understanding of performance under load. This involves measuring not just a single proof generation, but how the system behaves with concurrent requests, large datasets, and sustained operation. Key metrics shift from simple latency to throughput, resource utilization, and cost predictability.

The core components to benchmark are prover time, verifier time, and proof size. However, under load, you must also measure: memory footprint during parallel proof generation, CPU/GPU utilization scalability, I/O bottlenecks when reading large circuit inputs, and proof aggregation efficiency. For example, a Groth16 prover may be fast for a single proof but require a unique trusted setup per circuit, while a STARK prover might use more computational resources but offer post-quantum security and transparent setup. Tools like gnark, circom, and arkworks provide frameworks for creating test circuits.

To run a meaningful comparison, define a standardized workload circuit. This could be a Merkle tree inclusion proof, a signature verification, or a simple private transaction. Fix the circuit constraints (e.g., 10,000 gates) and input size. Then, using each proof system's libraries, measure: 1) Peak throughput: proofs generated per second on a high-core server. 2) Latency distribution: the 50th, 95th, and 99th percentile proof times. 3) Hardware cost: cloud compute expense per 1,000 proofs. Public benchmarks from entities like Ethereum Foundation and zkSecurity offer a starting point, but your specific circuit will determine the optimal system.

Consider the trade-offs specific to your use case. A validium (like StarkEx) prioritizing high TPS might favor a STARK prover on specialized hardware, accepting larger proof sizes. A client-side application needing small, fast verification (like a mobile wallet) might choose a SNARK with a small verifier circuit. Furthermore, monitor prover scalability: does doubling the computational resources halve the proof time? Some systems scale linearly with more cores, while others hit memory bandwidth limits. Document the trust assumptions (trusted setup, transparency) and recursion support, as these dramatically affect long-term system architecture.

Finally, establish a continuous benchmarking pipeline. Performance characteristics evolve with library updates (e.g., Halo2 improvements) and new hardware. Automate tests using scripts that pull the latest versions of plonk, Marlin, or other systems, run them against your workload, and log results to a dashboard. This data-driven approach allows you to make informed decisions about integrating a proof system into your protocol, ensuring it meets the scalability demands of production traffic without unexpected costs or bottlenecks.

prerequisites

PREREQUISITES

How to Compare Proof Systems Under Load

Before benchmarking zero-knowledge proof systems, you need to understand the core metrics and setup the right testing environment. This guide outlines the essential knowledge and tools required for a meaningful performance comparison.

To effectively compare proof systems like Groth16, PLONK, Halo2, or STARKs, you must first define the specific computational workload, or circuit. This circuit, representing your application's logic (e.g., a Merkle tree inclusion proof or a signature verification), is the constant variable across all tests. Performance is meaningless without a standardized benchmark target. You'll need proficiency with a domain-specific language (DSL) like Circom, Noir, or Leo to write these circuits, or the ability to work directly with the proving system's native framework (e.g., arkworks for Groth16).

The key performance metrics form the basis of any comparison. You must measure: proving time (the duration to generate a proof), verification time (the duration to check a proof), and proof size (the serialized byte length). Under load, you also need to track memory consumption and CPU utilization during proving, as some systems trade faster proving for higher hardware demands. Tools like perf on Linux or dedicated profiling libraries within the proof system's SDK are essential for gathering this data accurately and consistently across different runs.

Your testing environment must be controlled and reproducible. Use a dedicated machine with consistent hardware (preferably a high-core-count CPU and ample RAM) and document all specifications. Performance can vary drastically between a laptop and a server. Containerize your setup using Docker to ensure dependency versions (like specific Rust or C++ toolchains) are identical for each proof system you test. This eliminates "works on my machine" variability and is critical for fair comparison.

Finally, understand the fundamental trade-offs between proof systems. Trusted setups (Groth16, PLONK) require a one-time ceremony but can offer smaller proofs and faster verification. Transparent setups (STARKs, Halo2's KZG variant) avoid this trust assumption but may have larger proof sizes. Recursive proof composition is another advanced feature that significantly impacts performance under load, allowing you to aggregate many proofs into one. Knowing these architectural differences will help you interpret your benchmark results in the context of your application's specific trust and scalability requirements.

key-concepts

PERFORMANCE ANALYSIS

Key Concepts for Load Testing

Understanding how different proof systems behave under high transaction volume is critical for building scalable, resilient applications. These concepts define the metrics and methodologies for effective comparison.

Throughput (TPS) vs. Latency

Throughput measures the number of transactions a system can process per second (TPS). Latency is the time from transaction submission to finality. Under load, systems exhibit a trade-off: increasing throughput often increases latency. For example, a zkEVM might achieve 200 TPS with 2-minute finality, while an optimistic rollup could handle 5,000 TPS with a 7-day challenge window. The key metric is sustained throughput at acceptable latency levels.

EXPLORE

Prover Performance & Hardware Requirements

Proof generation is computationally intensive. Load testing must measure:

Prover time: How long to generate a proof for a block of transactions.
Hardware scaling: How prover time changes with more/less powerful CPUs, GPUs, or specialized hardware (ASICs, FPGAs).
Memory usage: Peak RAM consumption during proof generation. Systems like zkSync Era and Polygon zkEVM publish benchmarks showing prover times scaling with transaction batch size, a critical data point for node operators.

EXPLORE

State Growth & Storage Costs

As transaction load increases, so does the state data that must be stored and proven. Key considerations include:

Witness size: Data the prover needs to access, which impacts I/O bottlenecks.
Proof size: The final cryptographic proof submitted on-chain; larger proofs increase L1 gas costs.
State commitment updates: Frequency and cost of updating the state root on the base layer (e.g., Ethereum). A system may have high TPS but become economically unviable if its state growth leads to unsustainable L1 costs.

EXPLORE

Economic Security Under Load

Load testing must verify that economic incentives remain secure during peak usage. This involves analyzing:

Sequencer/prover profitability: Do fees cover hardware and gas costs at max load?
Bonding requirements: Are staking or bonding amounts sufficient to deter malicious behavior when many high-value transactions are pending?
Failure modes: What happens if the sequencer fails during a load spike? Systems like Arbitrum and Optimism have detailed economic models defining these parameters.

EXPLORE

Network & Node Resource Consumption

A full node's ability to sync and verify the chain under load is essential for decentralization. Load tests should measure:

Bandwidth requirements: Data download/upload rates for node synchronization.
CPU usage for verification: Resources needed to verify incoming state updates or proofs.
Initial sync time: How long it takes a new node to catch up to the chain head during high activity. High resource demands can lead to centralization if only well-funded operators can run nodes.

EXPLORE

Benchmarking Methodologies & Tools

Standardized testing ensures fair comparisons. Effective methodologies include:

Reproducible load generation: Using tools like benchttp or custom scripts to simulate realistic transaction mixes (swaps, transfers, NFT mints).
Measuring under contention: Testing performance when multiple users submit transactions concurrently.
Long-duration stress tests: Running high load for hours or days to identify memory leaks or performance degradation. Frameworks like Hyperbench provide a starting point for blockchain-specific load testing.

EXPLORE

test-methodology

BENCHMARKING

Defining the Load Test Methodology

A robust load test methodology is essential for objectively comparing the performance of different zero-knowledge proof systems under realistic conditions.

Load testing a proof system involves simulating sustained, high-volume transaction processing to measure its behavior under stress. The goal is to identify performance bottlenecks, resource consumption patterns, and failure points. Key metrics to track include proof generation time, verification time, CPU and memory usage, throughput (proofs per second), and latency. For a fair comparison, tests must use identical hardware, network conditions, and input data sets across all systems, such as gnark, arkworks, or circom.

The methodology should define a clear workload profile. This specifies the type and complexity of the computational statement being proven. For example, you might test a simple Merkle tree inclusion proof, a signature verification, or a complex zk-SNARK circuit for a private transaction. The workload should scale in difficulty—increasing the number of constraints or the size of the witness—to see how performance degrades. Tools like criterion for Rust or custom benchmarking harnesses are used to execute these tests and collect data.

Beyond raw speed, it's critical to measure resource efficiency. A system that generates proofs quickly but consumes 32GB of RAM may be unsuitable for a resource-constrained environment. Monitoring tools like perf on Linux or language-specific profilers help pinpoint hotspots in the proving algorithm. The test should also include edge cases, such as submitting malformed inputs or simulating network delays, to assess system stability and error handling under load.

Finally, results must be presented with context. A table comparing gnark (Groth16) and circom (PLONK) should note their different trusted setup requirements, proof sizes, and underlying cryptographic assumptions. The methodology should be repeatable, with all code and configuration published—for example, in a GitHub repository—to allow for peer verification. This rigorous approach moves beyond theoretical claims to provide actionable data for selecting a proof system.

BENCHMARK

Core Performance Metrics Comparison

Key quantitative and qualitative metrics for evaluating proof systems under high transaction load.

Metric	zk-SNARKs (Groth16)	zk-STARKs	Plonk / Halo2
Prover Time (10k tx)	~45 sec	~120 sec	~60 sec
Verifier Time	< 10 ms	< 50 ms	< 20 ms
Proof Size	~200 bytes	~45-200 KB	~400 bytes
Trusted Setup Required
Post-Quantum Security
Recursive Proof Support
Gas Cost for On-Chain Verify	~500k gas	~2.5M gas	~800k gas
Memory Footprint (Prover)	~4 GB	~16 GB	~8 GB

benchmarking-code

IMPLEMENTING THE BENCHMARK

How to Compare Proof Systems Under Load

A practical guide to writing and running performance benchmarks for zero-knowledge proof systems using real-world metrics and code.

Benchmarking proof systems requires measuring performance under realistic conditions, not just theoretical peak throughput. Key metrics include prover time, verifier time, and proof size, but also memory consumption, CPU utilization, and throughput under concurrent requests. For a meaningful comparison, you must standardize the computational workload, often called a circuit. A common approach is to benchmark systems like Groth16, Plonk, and Halo2 on the same circuit, such as a SHA-256 hash or a Merkle tree inclusion proof, to isolate the performance of the proving system itself from the complexity of the application logic.

The benchmark setup must be reproducible. Use a dedicated machine or cloud instance with consistent specifications (e.g., AWS c6i.8xlarge). Isolate the process to minimize OS noise and run multiple iterations to account for variance. Tools like Criterion.rs for Rust or pytest-benchmark for Python can automate this. For example, a benchmark script might first compile the circuit into the proving system's specific format (like a .r1cs file for Groth16), then time the trusted setup phase (if applicable), followed by the proof generation and verification steps, capturing resource usage with time or psutil.

Here is a conceptual Python snippet using the snarkjs library to benchmark a Groth16 proof generation. This example assumes you have pre-generated circuit artifacts.

python
import subprocess
import time
import json

def benchmark_groth16_prover(circuit_wasm, zkey_path, input_path):
    """Times the proof generation command."""
    cmd = ['snarkjs', 'groth16', 'prove', zkey_path, circuit_wasm, input_path]
    
    start = time.time()
    result = subprocess.run(cmd, capture_output=True, text=True)
    elapsed = time.time() - start
    
    if result.returncode == 0:
        proof_data = json.loads(result.stdout)
        proof_size = len(json.dumps(proof_data).encode('utf-8'))
        return {'time': elapsed, 'proof_size_bytes': proof_size, 'success': True}
    else:
        return {'error': result.stderr, 'success': False}

This function captures critical data: execution time and the serialized proof size.

To compare systems, you must port your circuit to each one. A zk-SNARK like Groth16 requires a trusted setup and offers constant-size proofs but slower proving. A zk-STARK has faster proving and no trusted setup but larger proof sizes. A benchmark suite should run the same logical circuit (e.g., "prove I know a preimage for this hash") across each system. Present results in a comparative table showing trade-offs: Prover Time (sec), Verifier Time (ms), Proof Size (KB), and Setup Time. This highlights the performance frontier—no single system optimizes for all metrics.

Finally, analyze the results in context. A system with a 2-minute prover time may be fine for a rollup batch but unusable for a wallet transaction. Consider hardware acceleration (GPU proving), recursive proof composition, and ongoing research like Nova and ProtoStar that aim to improve these trade-offs. Publish your benchmark code, circuit files, and raw data to ensure verifiability. Repositories like the ZKProof Community's benchmarking efforts provide standardized templates to build upon, ensuring your comparisons contribute to collective understanding rather than existing in isolation.

PERFORMANCE METRICS

Proof System Comparison Under Load

A comparison of key performance and security metrics for popular proof systems under high transaction load.

Metric	zk-SNARKs (Groth16)	zk-STARKs	Plonk
Prover Time (10k tx)	~45 sec	~120 sec	~60 sec
Verifier Time	< 10 ms	< 100 ms	< 10 ms
Proof Size	~200 bytes	~45 KB	~400 bytes
Trusted Setup Required
Post-Quantum Security
Recursion Support
Gas Cost (Ethereum Verify)	~450k gas	~2.5M gas	~500k gas
Memory Load (Prover)	High	Very High	Medium

analyzing-results

BENCHMARKING METHODOLOGY

Analyzing Results and Trade-offs

A systematic approach to evaluating proof system performance under realistic conditions, moving beyond theoretical peak speeds.

When comparing proof systems like zk-SNARKs (e.g., Groth16, Plonk), zk-STARKs, and Bulletproofs, raw proving time is a misleading metric. You must analyze results within a trade-off matrix that includes verification speed, proof size, trusted setup requirements, memory consumption, and pre-processing time. For instance, a Groth16 proof verifies in milliseconds with a tiny proof size but requires a circuit-specific trusted setup and has high memory overhead for large circuits. In contrast, a STARK proof has no trusted setup and faster prover times for complex computations, but generates proofs measured in hundreds of kilobytes, increasing on-chain verification costs.

Load testing reveals critical bottlenecks. Use a benchmarking framework like Criterion.rs (for Rust implementations) or custom scripts to measure performance under scaling factors. Systematically increase the number of constraints in your circuit or the size of the witness. Plot the results: proving time typically scales linearly or quasilinearly with constraint count, but memory usage can spike non-linearly, causing out-of-memory errors. For example, benchmarking a Merkle tree inclusion proof with 10,000 leaves versus 1,000,000 leaves will show if the prover's memory footprint grows linearly (O(n)) or has a higher complexity, which is crucial for decentralized prover networks.

The trusted setup is a major architectural trade-off. Systems requiring a Powers of Tau ceremony (Groth16, Plonk-KZG) introduce a one-time coordination cost and a trust assumption, but they enable constant-size proofs and fast verification. Transparent systems like STARKs and Bulletproofs eliminate this trust but often have larger proofs or slower verification. Your application's threat model dictates the choice: a permissioned enterprise system might accept a trusted setup for efficiency, while a decentralized protocol likely requires transparency, accepting larger proof sizes as a trade-off.

Quantify the cost of verification, especially for on-chain use. Calculate the gas cost of verifying a proof on Ethereum for different systems. A Groth16 proof for a simple transaction might cost ~200k gas, while a STARK proof for the same logic could exceed 1M gas due to more on-chain computation. Use tools like snarkjs to generate gas estimates or test on a local fork. The trade-off is clear: faster, smaller proofs reduce end-user costs in L2 rollups, making prover efficiency secondary if the proof is verified thousands of times on-chain.

Finally, analyze hardware requirements and parallelism. GPU acceleration can accelerate STARK and Halo2 proving by 10-50x but adds deployment complexity. Benchmark on different hardware (consumer CPU, server CPU, GPU). A system with excellent single-threaded performance might be ideal for a browser client, while a GPU-optimized prover is necessary for a high-throughput rollup sequencer. The optimal proof system is not universally fastest; it is the one whose specific trade-off profile—prover time, proof size, verification cost, and trust model—best aligns with your application's constraints and threat model.

resource-links

BENCHMARKING GUIDES

Tools and Resources

Resources for comparing zero-knowledge proof systems under realistic load. Each tool focuses on measurable prover performance, memory behavior, and constraint efficiency rather than theoretical asymptotics.

zk-Benchmark Frameworks

Benchmark suites are the fastest way to compare Groth16, PLONK, Halo2, and STARK-based systems under identical workloads. These frameworks standardize circuit generation and measurement so timing and memory data are comparable.

Key capabilities to look for:

Fixed circuit templates reused across proof systems
Separate measurement of prover time, verifier time, and setup cost
Peak RSS and heap allocation tracking

Example workflows:

Generate circuits at 10k, 100k, and 1M constraints
Measure prover latency under cold and warm cache
Compare scaling curves rather than single data points

This approach avoids misleading results caused by hand-tuned circuits or single-run benchmarks.

EXPLORE

Halo2 Performance Instrumentation

Halo2 is widely used for production rollups and privacy systems, making it a common reference point when comparing proof systems under load. Its Rust-based tooling allows deep inspection into constraint synthesis and proving overhead.

Practical techniques:

Enable tracing spans to isolate witness generation vs proving time
Instrument ConstraintSystem to track gate counts and lookup usage
Benchmark with and without multi-threading enabled

When comparing against PLONK or STARK systems, normalize results by:

Total constraints
Advice and fixed column count
FFT domain size

This helps distinguish protocol differences from implementation artifacts.

EXPLORE

Plonky2 and Recursive Load Testing

Plonky2 is often evaluated under load due to its fast proving times and recursive proof support. When comparing systems, recursion changes both performance and memory profiles.

What to measure:

Base circuit prover time vs recursive aggregation cost
Proof size growth across recursion depth
Memory usage during parallel FFTs

Recommended comparison method:

Benchmark a standalone circuit at N constraints
Benchmark k aggregated proofs of the same circuit
Compare throughput in proofs per second rather than raw latency

This exposes tradeoffs between single-proof efficiency and recursive scalability.

EXPLORE

Load Testing with Synthetic Circuits

Synthetic circuits allow controlled stress testing without application noise. They are essential for isolating how proof systems behave under extreme constraint counts.

Common synthetic patterns:

Repeated hash chains using Poseidon or Keccak
Wide arithmetic circuits with minimal lookups
Memory-heavy circuits simulating Merkle path verification

Best practices:

Increment constraints logarithmically rather than linearly
Record both prover wall-clock time and peak memory
Run on pinned CPU cores to reduce scheduler variance

These tests reveal nonlinear bottlenecks like FFT blowups or memory thrashing.

PROOF SYSTEM PERFORMANCE

Frequently Asked Questions

Common questions developers and researchers have when evaluating and benchmarking zero-knowledge proof systems under real-world conditions.

Focus on the metrics that directly impact your application's user experience and operational costs. The key triad is proving time, verification time, and proof size.

Proving Time: The time to generate a proof. This is often the main bottleneck for applications like zkRollups. It's measured in seconds or minutes and depends heavily on hardware (CPU/GPU).
Verification Time: The time for a verifier to check a proof's validity. For on-chain applications, this determines the gas cost. It should be sub-second.
Proof Size: The byte size of the generated proof. This is critical for blockchain applications where data publication (calldata) is expensive. Smaller proofs reduce L1 costs.

Secondary metrics include memory consumption during proving (RAM/VRAM), trusted setup requirements (universal/transparent), and recursion support for batching proofs.

conclusion

SYNTHESIS

Conclusion and Next Steps

This guide has provided a framework for evaluating proof systems under load, focusing on performance, cost, and security trade-offs. The next step is to apply these principles to your specific use case.

Comparing proof systems under load is not about finding a single "best" option, but identifying the optimal fit for your application's constraints. Key decision factors include: - Throughput requirements (TPS or proofs per second) - Latency tolerance (from proof generation to finality) - Cost structure (on-chain verification gas vs. prover operational expense) - Trust assumptions (validity vs. fraud proofs). For a high-frequency DEX, a SNARK with fast prover times might be ideal, while a state bridge might prioritize the lower on-chain costs of a STARK.

To move from theory to practice, begin with a benchmarking suite using real-world transaction batches. Tools like gnark for SNARKs or starknet-rs for STARKs provide frameworks for load testing. Measure: - Prover time scaling as witness size increases - Memory footprint during peak load - Verifier gas cost on your target L1/L2. Document how these metrics change when adjusting parameters like the size of the trusted setup (tau ceremony) for Groth16 or the proof recursion depth for PlonK.

The ecosystem evolves rapidly. Stay informed on emerging optimizations like custom gate sets in Halo2 for specific circuits, GPU acceleration for STARK provers, and proof aggregation techniques that batch multiple proofs. Follow research from teams like zkSecurity, which audits implementations, and 0xPARC, which explores new proving paradigms. Engaging with the community on forums like the ZKProof Standardization effort or EthResearch is crucial for long-term decision-making.

Finally, consider the operational lifecycle. A proof system choice dictates your infrastructure: managing a trusted setup for SNARKs, operating high-availability prover nodes for STARKs, or integrating with a proof marketplace like =nil; Foundation. Plan for key rotation, upgrade paths for circuit logic, and monitoring for prover performance degradation. Your evaluation should conclude with a concrete implementation plan, starting with a pilot on a testnet before full deployment.