While ZK-Rollups, circuits, and SNARKs are very complex and not mature enough, using multi-prover approach can hedge the risk of bugs and vulnerabilities in proving systems, architectures, and implementations. The multi-prover can be implemented using different types of proofs (e.g., validity proofs and fraud proofs), different proving systems (e.g., SNARK and STARK), and different implementations by different teams.
In the first part of this article, we cover the significance of multi-prover approach, briefly explaining what else, except for the proving system, can be diversified. In the second part, we shed the light on what is SGX (one of the validity proofs), how it works, and explain why it’s a reliable complementary option to already existing cryptographic and economic systems. In the third section, we briefly describe how multi-prover works for nodes and smart contracts.
Why multi-proof matters
What is SGX, how it works, what are its pros and cons
What is SGX
High-level overview of SGX components
How SGX works
SGX doubts, constraints, and weak points
Existing blockchain implementations
Applying SGX to rollups
How multi-prover works in a rollup setting
Code-risk problem: circuits are complicated and bulky, especially circuits of ZK-EVM: dozens of thousands lines of code. They won’t be bug-free for pretty long time. There might be bugs in the proving systems, in circuit compilers, in ZK-EVM code, etc.
Young cryptographic primitives: ZK-Rollups rely on SNARKs, the latter are pretty young and there might be bugs in proving systems as well.
Multi-prover: there are several proof types generated for one block. Then, in case of a bug, even if one proof is broken, others will unlikely allow the exact same vulnerability to be exploited. If one proof type somehow doesn’t allow a block to be validated, the chain could halt (depending on the requirements to accept something as true). The additional requirement for several proof types is a strict safety improvement.
Disclaimer: from now on, when we tell “multi-prover” we mean different types of proofs.
We can think about multi-prover as about Ethereum client diversity: it allows the chain to “be resilient against partial failures across the network and then recover and finalize”.
As Vitalik explains in his article: “Ethereum's multi-client philosophy is a type of decentralization, and like decentralization in general, one can focus on either the technical benefits of architectural decentralization or the social benefits of political decentralization. Ultimately, the multi-client philosophy was motivated by both and serves both.”
The same is fair for the Rollup multi-prover philosophy: it contributes both to technical and social benefits building architectural and political decentralization. By technical benefits, we mean different proving systems, architectures, and implementations. By social benefits, we mean power diversity: namely, avoiding concentration of power within the chain team.
Multi-prover might cover
Different proof types such as fraud proofs (used in optimistic rollups) and validity proofs (used in zk-rollups). As they rely on very different assumptions and their designs are very far from each other, the correlation between bugs in fraud proofs and validity proofs is very low. One more less popular option for proving systems diversity – SGX. Check the next section to learn how it works.
Different proving systems such as SNARK and STARK back-ends (in case of using zk validity proof type).
Different implementations by different teams: it reduces the risk that one bug in one piece of software leads to a catastrophic breakdown of the entire network as other pieces of software written by other teams don’t contain the same bug with very high probability.
If there are multiple proof types – there is some probability that they will have different outcomes of block proving.
When it comes to the proofs, there are two characteristics that we expect them to have: completeness and soundness. By completeness, we mean that “if the statement is true, a verifier can be convinced of this fact by a prover.” By soundness, we mean that “if the statement is false, no cheating prover can convince a verifier that it is true.” Below we explain what can go wrong with completeness and soundness in the multi-prover context.
For validity proof type:
Broken proof in the completeness context: A proof for a valid block cannot be generated.
So that means no proof of this proof type can be submitted onchain. Depending on how many proof types are required n/N, the chain may halt. Broken proof in the soundness context: Multiple proofs for a valid block can be generated with different blockhashes, while of course only a proof for the one and only correct blockhash should be possible.
In this case, a valid proof can be submitted onchain, but it has to match the blockhashes of the other proof types. If the same bug can be exploited in multiple proof types and several proofs with the same incorrect blockhash can be submitted for the same block, even the multi-prover setup would be broken. For fraud proof type, completeness break means no fraud proof can be generated for an invalid block while soundness break is if a fraud proof can be generated for a valid block.
If two types of proofs are used, and there is completeness problem with one or more proof types, one possible solution is to use multi-sig governance as an emergency. This entity will check which blockhash is valid for real, and confirm it on chain, and exclude or update the proof type that was buggy (or take other strict measures). In this case, good multi-sig is necessary. Check the twitter thread by Ed Felten on what is a good multi-sig.
If there are more than two types of proofs, the majority consensus on proof validity might be the mechanism of dispute resolution. Followed up by the mechanism to exclude or update provers with erroneous behavior.
When it comes to diversity of proof types, one can combine different validity proofs (e.g., zk proofs and SGX proofs) and fraud proofs. However, fraud proofs require a long delay, hence doesn’t seem to fit perfectly for ZK-Rollup needs. In this section, we explore SGX proofs, explaining what it is, how it works, and what pros and cons they have.
SGX stands for Software Guard Extensions.
It is a type of Trusted Execution Environment (TEE) made by Intel.
TEE is an area on the main processor of a device that is separated from the system's main operating system (OS). It ensures data is stored, processed and protected in a secure environment.
SGX enables applications to execute code and protect secrets inside their own trusted execution environment, providing protection from malicious software.
In blockchain, SGX can be used for private smart contracts. They become private as they are run inside the trusted enclaves. For example, one can run oracle inside of SGX to get privacy and prove without revealing what is proven.
SGX creates isolated non-addressable memory regions of code and data (enclaves) reserved from the system’s physical RAM and then encrypted. In other words, enclave is a “secret vault” that contains both code and data where applications can process sensitive data without the risk of it being exposed.
Encryption is taking place at hardware level, it protects against software-based attacks. Even if a hacker has access to the application, entire operating system and BIOS of the system on which the TEE is running, confidential data will remain secret.
TCB (Trusted Computer Base) – everything in a computing system that provides a secure environment for operations. This includes its hardware, firmware, software, operating system, physical locations, built-in security controls, and prescribed security and safety procedures.
Hardware secrets – RPK, Root Provisioning Key (randomly created and retained by Intel), and RSK, Root Seal Key (randomly generated automatically inside the CPU during production).
Attestation – the process of demonstrating that a software executable has been properly instantiated on a platform.
SGX application works with two parts: trusted part and untrusted part.
Application creates an enclave in the trusted part.
Then it makes a call to the trusted function. Trusted function is a piece of code created by a software developer for working inside the enclave. Only trusted functions are allowed to run in the enclave, all other attempts to access the enclave memory from outside the enclave are denied by the processor.
Once the function is called, the application is running in the trusted space and sees the enclave code and data as clear text.
When the trusted function returns, the enclave data remains in the trusted memory.
The application is back to running in the untrusted space.
Some clarifications and details:
Untrusted code is an application;
One calls the ecall function to enter the enclave;
The ecall function input is untrusted;
Trusted code is an enclave;
ocall is made from within an ECALL to call outside;
The output of ocall is untrusted;
Enclaves cannot directly access OS-provided services;
There are two keys: secret key and public key. Secret key is generated inside of enclave and never leaves it. Public key is available for anyone: users can encrypt the message with a public key so only enclave can decrypt it.
Sometimes enclaves need to collaborate with other enclaves on the same platform due to different reasons such as data exchange or communication.
Two main cases when attestation is required:
Two exchanging enclaves have to prove to each other that they can be trusted.
The client have to prove to the server that the client application is running on a trusted platform that can process the secrets securely.
Both of those two cases require a proof of secured execution environment, and Intel SGX refers to this proving process as attestation.
There are two types of attestation:
Local attestation – one enclave authenticates the other enclave locally using Intel SGX Report mechanism to verify that the counterpart is running on the same TCB platform.
Remote attestation – the goal of Remote Attestation is for a Hardware entity or a combination of Hardware and Software to gain the trust of a remote service provider, such that the service provider can confidently provide the client with the secrets requested. It verifies three things: (i) the application’s identity, (ii) its intactness (that it has not been tampered with), and (iii) that it is running securely within an enclave on an Intel SGX enabled platform. With Intel SGX, Remote Attestation software includes the application’s enclave, and the Intel-provided Quoting Enclave (receives REPORTs from other enclaves, verifies them and signs them with the attestation key before returning the result) and Provisioning Enclave (encrypts the attestation key with PSK and stores on the platform for future use). The attestation Hardware is the Intel SGX enabled CPU. For detailed Remote Attestation architecture, check SGX 101.
Remote Attestation feature requires the service to be registered with the Intel Attestation Service (or instead of Intel Service one might use someone else’s service, say cloud providers’ service that also gives security pipelines customization opportunity).
For blockchain SGX, remote attestation is the only option even though it is less robust than the local one.
Using SGX, one trusts Intel that everything is fine with this fancy technology. As the hardware is delivered already with the private key inside the trusted enclave. Furthermore, Intel is pretty slow in terms of new attacks patching. Check sgx.fail to find a list of publicly known SGX attacks that still weren’t fixed by Intel.
As a counter argument to this issue, one mentions that today Intel reputation is an extremely high-value asset. That is, any malicious manipulations with SGX are economically unreasonable for Intel, at least for the personal economic benefits.
Another issue is that as SGX was deprecated on consumer CPUs in 2021, one day it might be deprecated or significantly changed for other versions, that might be incompatible with specific use cases.
2. Brand new technology
Almost no one can fairly answer the question how reliable SGX is. The most fair answer: we don’t know because it’s pretty new and we don’t know all types of attacks that might happen (and will happen).
Most SGX is black-box, with a large part implemented in ucode.
TEEs are not self-sufficient. So they need to be complemented by MPC, ZK, FHE, ORAM, etc (and different combinations of them) to make TEE secure.
Beyond combining SGX with other cryptographic primitives, it might be combined with economic incentives.
SGX vs Economics
Diagram source: Phil’s talk at ETHDenver Privacy Workshop, February 2023
In some aspects, SGX is weaker than economic incentives, sometimes it is stronger.
SGX is more powerful:
SGX is weaker:
When something is private (like in case of SGX) huge cheating might take place without being observed.
So, SGX is not a magic pill. However, enclaves significantly decrease the attack surface and it seems to be reasonable to use it.
Diagram source: a paper “Confidential Computing in Cloud/Fog-based Internet of Things Scenarios”.
The most reasonable strategy is to combine all the solutions (both cryptographic and economic) in a smart way to get a really robust one.
To run existing applications in isolation without modification of the source code libOSes (an entire operating system implemented as a library) like Gramine, Occlum and EGo have been developed, to provide an abstraction layer to applications which expect to be run in Linux-like environments.
SGX implementation within the blockchain area was pioneered by the common effort of Flashbots and Nethermind.
Summary: running Geth within SGX is entirely possible, but both a resource- and time-consuming process. It needs a large amount of memory, has a startup time of around 3 hours, and state is easily lost.
Problems and constraints
Storing the state: slow performance of Gramine’s encrypted file mounts, which made it difficult for Geth to keep up with Ethereum mainnet (maybe it’s a bug).
Initial sync: the initial sync process of the chain can take quite a long time, and currently requires up to 800GB of storage.
Information Leakage: occurs when the host system can extract information about what data is being accessed within the SGX enclave. In the case of Geth, this could potentially leak information about the keys being accessed from the database due to IO and memory access patterns (the paper exploring IO leakage).
With all these problems flashbots tried an alternative approach: to store the entire chain in-memory.
A system with at least 1TB of memory is needed, which is hardware resource intensive and expensive. But not all of the 1TB need to be accessed by the application at once, a lot of it can be encrypted by the SGX kernel driver and swapped out.
The state is lost if the Geth application stops.
Implementation goal (specified): private txs + decentralized builders
Block builders, as well as other infrastructure providers, can no longer see the contents of user transactions and run verifiable block construction algorithms on them;
Looking to the future, builders inside SGX can make blocks that are provably valid and truthfully report their bid size, possibly obviating the need for mev-boost relays.
Problems and constraints
Geth state size: the geth database requires about 1TB of storage. The startup process takes about 4.5 hours. Implemented solution: keeping the ancient database outside the enclave, halving the state required inside (to ~500GB). Check implementation details.
Merging performance and additional latencies: SGX builder currently performs at around ~150 mgasps (Million Gas Per Second), which is about half the performance of a non-SGX builder.
“A likely culprit for this degradation is the way Golang handles syscalls. Instead of using the libc syscall interface, Golang calls directly into the kernel. This produces additional execution overhead when running inside an SGX enclave. Possible improvements could be made by building geth with gccgo.”
More about flashbots implementation:
Opened issue at flashbots forum.
Alex Stokes’s thread about distributed building.
The good news is that for rollups, we are only interested in verifying the correct execution of a block. This means that we can create a stateless program that only verifies the correct execution of a block. All data required to do this can be queried from a normal node. This data is passed into the program as input, and the program verifies its correctness. This is exactly how zk proofs work because zk proofs cannot access the state directly.
More concretely, the inputs are the latest known state root, the list of transactions, and the block parameters. The expected output (the block hash) is also passed in as an input so the blockhash calculated by the program can be checked against the expected value. To verify the state data passed in as an input, the input also contains the Merkle proofs for all states accessed within the block so that it can be verified against the known state root. This way, as long as we can ensure the correct state root and block data are supplied to the stateless function, SGX can attest to the correct execution of the program and, therefore, the correct execution of the block.
Anyone interested in generating proofs will want to run a node for the chain the proofs need to be created for. From this node, all necessary data for generating the proofs is extracted. Next to the obvious data like the transactions in the block, this also includes all the Merkle proofs for all state accesses. For zk proofs, this may also include the execution trace of all the transactions in the block. From this data, the proofs are generated. The proof generation for a zk-based proof may take many minutes using many computers. Other proofs like SGX can run on just a single computer in a matter of seconds.
Once all necessary proofs are generated, they are submitted together to the rollup smart contract. They are not submitted separately because there is no benefit of having only some of the required proofs on-chain. Simply requiring all proofs to be submitted together makes things easier and more efficient. With all proofs available on-chain, the smart contract verifies them for correctness.
For a zk proof, that means running the verifier on the proof. For an SGX-based proof, this is just checking that some ECDSA signature containing the expected data is signed by the expected address. If all proofs pass this check, and the expected number of proofs is given, the only thing left to check is that all proofs actually prove the block has the same blockhash. If that checks out, the block is finally marked as proven, and its blockhash is now known on-chain (assuming all previous blocks were also proven so that the block was built on top of the correct chain tip).
SGX Gitbook: how to build your first SGX application.
Video from Rollup Day 2022 – Multi-Provers for Rollup Security (Vitalik Buterin).
A ethresearch post “2FA zk-rollups using SGX” by Justin Drake.
Video from ETHDenver Privacy Workshop 2023 (Kevin Yu).
Flashbots run block builders inside SGX report.
Running Geth within SGX: Our Experience, Learnings and Code report.
Alex Stokes on distributed block building.
Video from ETHDenver Privacy Workshop 2023 (Phil Daian).
Presentation “SGX Secure Enclaves in Practice Security and Crypto Review”.
Presentation with example of Proof of Work, Proof of Time, and Proof of Ownership implementation.
SGX explanation by Intel.
Presentation by ZHANG Yuanyuan from Shanghai Jiao Tong University.
Official Intel video about SGX.