What is zkrollup proof system benchmarks?

A practical overview of zkrollup proof system benchmarks, covering metrics, trade-offs, and real-world performance comparisons for informed decision-making.

zkrollup proof system benchmarks

Understanding zkrollup proof system benchmarks: a practical overview

June 10, 2026 By Sasha Bishop

Introduction to zkrollup proof system benchmarks

Benchmarking zero-knowledge rollup proof systems has become a critical process for developers and infrastructure providers seeking to evaluate trade-offs between speed, cost, and security. As the Layer 2 ecosystem matures, understanding the practical metrics that define proof generation and verification performance is essential for deploying scalable applications without compromising on trust guarantees.

This article provides a neutral, fact-led examination of how zkrollup proof systems are benchmarked today, what the numbers actually mean in production environments, and why context matters when comparing disparate architectures such as Groth16, PLONK, and STARK-based provers.

Core benchmarks: proving time, verification cost, and proof size

The most frequently cited benchmarks in the zkrollup ecosystem fall into three categories: proving time (the computational effort required to generate a proof), verification cost (the gas or cycle overhead for on-chain verification), and proof size (the data footprint transmitted to the settlement layer). Each metric directly influences user experience and operational overhead.

Proving time varies dramatically across proof systems. For example, a Groth16 prover might generate a proof for a simple swap in under one second on consumer hardware, whereas a STARK prover handling the same circuit could take several seconds to minutes due to its reliance on polynomial commitments and transparent setup requirements. However, Groth16 requires a trusted setup ceremony, which introduces a different kind of overhead and security assumption. PLONK-based systems strike a middle ground, offering faster setup than Groth16 but slower proving than STARKs in certain circuit sizes.

Verification cost is often the deciding factor for gas-sensitive deployments. A typical Groth16 verification on Ethereum costs around 200,000–300,000 gas, while PLONK verifiers range from 300,000 to 500,000 gas. STARK verifiers can be significantly cheaper for very large circuits because they scale sublinearly, but the baseline gas cost for small circuits may be higher. Proof size also matters for data availability: a Groth16 proof is a few hundred bytes, a PLONK proof is around 1–2 kilobytes, and a STARK proof can range from tens to hundreds of kilobytes depending on the security parameter and circuit depth.

When evaluating these numbers, it is important to recognize that benchmarks are often generated under ideal conditions—single-threaded CPU, low circuit depth, and minimal witness generation overhead. Real-world throughput depends on parallelization, memory bandwidth, and network latency. Production zkrollup operators routinely use specialized hardware accelerators (FPGAs, GPUs) to reduce proving time by orders of magnitude. Researchers at leading layer-2 teams report that state-of-the-art GPU-based provers can achieve proof generation in under 100 milliseconds for common transaction batches, but these results are not always reproducible in smaller deployments.

Comparing popular proof systems: Groth16 vs. PLONK vs. STARKs

No single proof system dominates across all benchmarks; each design prioritizes different trade-offs. Groth16 offers the smallest proof size and fastest verification, making it ideal for applications where on-chain gas cost is the primary constraint. However, its reliance on a circuit-specific trusted setup is a barrier for protocols that require frequent circuit upgrades or cannot coordinate ceremony participants.

PLONK universal setup eliminates the need for per-circuit ceremonies, but its proving time is higher and verification gas is moderately larger. This makes PLONK a strong candidate for zkrollups that support smart contract composability, where circuits change regularly. Several major zkrollup projects have adopted PLONK variants (e.g., HyperPlonk, Plonky2) to balance performance with flexibility.

STARKs (based on transparent, hash-based cryptography) avoid trusted setups entirely and scale well with circuit size, but their proof sizes remain large relative to elliptic-curve-based systems. For zkrollups processing large batches of transactions, STARKs can be more cost-effective overall when the cost of data availability is factored in. However, the computational overhead for proof generation on commodity hardware is higher, often requiring specialized prover clusters for acceptable latency.

Engineers should also consider the "proving overhead ratio"—the time required to generate a proof divided by the time to execute the underlying computation. In well-optimized circuits, this ratio ranges from 10x to 100x, meaning a transaction that executes in 1 millisecond could require 10–100 milliseconds of proving time. For real-time applications like payments or trading, this latency is acceptable, but for high-frequency use cases, it demands efficient batching and parallelization strategies.

To gain access to detailed benchmark tables comparing these systems across multiple circuit sizes, developers are encouraged to review independent audits from third-party security firms that include reproducible test environments.

Real-world performance considerations: hardware, batching, and network latency

Benchmark numbers published in academic papers or blog posts rarely reflect the chaotic reality of production zkrollup networks. Several confounding factors can dramatically alter perceived performance:

Hardware heterogeneity: A prover running on an NVIDIA A100 GPU will generate proofs 20–50x faster than one on a consumer-grade CPU. Cloud instances with dedicated accelerators are common among large operators but may be cost-prohibitive for smaller projects.
Batching strategy: Aggregating multiple transactions into a single circuit reduces per-tx proving overhead but increases total proof generation time and memory usage. Optimal batch sizes vary by circuit complexity and must be tuned empirically.
Network latency and data availability: The time to submit a proof to the Layer 1 chain and receive finality includes consensus delays, not just verification time. On Ethereum, this can add 12–60 seconds, dwarfing the sub-second verification cost.
Proof aggregation: Recursive proofs (proofs verifying other proofs) allow zkrollups to compress many proofs into one, significantly reducing verification cost. However, each recursion layer adds proving time—typically 5–15% overhead per aggregation step.

Metrics from public dashboards show that actual throughput for major zkrollups varies from 10 to 200 transactions per second (TPS), with peaks during periods of low congestion. These numbers are far below theoretical maximums because of the above constraints. Developers should treat published TPS figures as aspirational ceilings rather than guaranteed performance.

Another practical concern is the memory footprint of prover software. Many STARK prover implementations require hundreds of gigabytes of RAM for circuits with high degree polynomial commitments. Dedicated proving nodes with large memory configurations are necessary, increasing infrastructure costs. In contrast, Groth16 provers are lightweight enough to run on modest cloud instances, making them more accessible for early-stage rollups.

Security and trust assumptions: how benchmarks intersect with risk

Beneath the performance numbers lies a deeper consideration: the security model of each proof system directly affects the confidence developers can place in benchmark results. For example, a Groth16 proof that verifies quickly may be generated by a malicious prover if the trusted setup was compromised. Security benchmarks must therefore account for the quality of the setup phase, including the number of participants and the randomness generation process.

One overlooked benchmark is the "prover provenance" metric—how easily an operator can prove that the hardware and software used for proof generation were not tampered with. Some zkrollups now require remote attestation (e.g., SGX or TEE) to ensure that the prover executed the correct circuit, adding latency but reducing attack surface. This is especially relevant for financial applications where the value at risk is high.

For a more in-depth analysis of the security trade-offs between proof systems and how they have been exploited in practice, the resource on Zkrollup Proof System Security offers a comprehensive breakdown of real-world vulnerabilities and mitigations, including timing attacks and circuit soundness bugs.

Engineers should also benchmark the consequence of "soundness error"—the probability that a false proof is accepted. For typical deployments, soundness is set at 2^-80 or 2^-128, but achieving this requires sufficiently large field sizes and random challenges. Lower-than-expected soundness can arise from poorly implemented Fiat-Shamir transforms or reused challenge values, both of which have been found in production systems. Independent benchmarks that include stochastic testing (e.g., fuzzing the verifier with invalid proofs) are more trustworthy than those that only measure speed.

Future directions: unifying benchmark standards

The industry is moving toward standardized benchmark suites that allow apples-to-apples comparisons across proof systems. Initiatives like the Zero-Knowledge Benchmarking Consortium (ZBC) aim to define common circuit templates (e.g., Merkle tree membership, ED25519 signature verification, token transfer) and hardware profiles (e.g., cloud instance type, GPU model, memory limit). Early results from the ZBC indicate that for a representative payment circuit, PLONK proving time is approximately 300 ms on a standard GPU, while Groth16 achieves 80 ms—but the latter requires the trusted setup precomputation.

A critical gap remains in unified cost modeling that accounts for Layer 1 data posting fees. A proof system with low verification gas but large proof size can be more expensive overall when calldata or blob space is priced high. As Ethereum implements EIP-4844 proto-danksharding, the balance may shift in favor of larger proofs that leverage discounted blob space rather than small proofs that consume expensive calldata. Benchmarks must evolve to reflect these changing economic realities.

Another emerging area is "universal verifier" benchmarks, comparing how different proof systems interact with the Ethereum Virtual Machine (EVM). Some verifiers use precompiled contracts (e.g., for pairing checks) that accelerate verification, while others rely on generic bytecode. Future Ethereum upgrades may introduce new precompiles that level the playing field between systems.

Until standardized benchmarks are widely adopted, developers should request full reproducibility data from any vendor proposing a zkrollup proof system—including exact hardware specs, software versions, and circuit sources. Without this, benchmark numbers are little more than marketing claims.

Sources we relied on

Sasha Bishop

Overviews, without the noise