{"id":13618049,"url":"https://github.com/mratsim/constantine","last_synced_at":"2025-04-07T19:16:21.501Z","repository":{"id":40442808,"uuid":"142173232","full_name":"mratsim/constantine","owner":"mratsim","description":"Constantine: modular, high-performance, zero-dependency  cryptography stack for proof systems and blockchain protocols.","archived":false,"fork":false,"pushed_at":"2024-05-22T17:29:06.000Z","size":18556,"stargazers_count":265,"open_issues_count":60,"forks_count":35,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-05-22T17:39:08.587Z","etag":null,"topics":["barreto-naehrig","bigint","bignum","bls","bls-signature","bls12-381","constant-time","cryptography","digital-signature","elliptic-curve-arithmetic","elliptic-curve-cryptography","elliptic-curves","finite-fields","galois-field","hash-to-curve","pairing","pairing-cryptography","public-key-cryptography","side-channels","zkp"],"latest_commit_sha":null,"homepage":"","language":"Nim","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mratsim.png","metadata":{"files":{"readme":"README-PERFORMANCE.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE-APACHEv2","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-07-24T14:51:39.000Z","updated_at":"2024-05-22T17:39:15.664Z","dependencies_parsed_at":"2023-10-16T04:11:57.154Z","dependency_job_id":"a1f5fc16-e91f-489a-90e7-fe8b50912d9a","html_url":"https://github.com/mratsim/constantine","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fconstantine","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fconstantine/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fconstantine/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mratsim%2Fconstantine/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mratsim","download_url":"https://codeload.github.com/mratsim/constantine/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247713258,"owners_count":20983683,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["barreto-naehrig","bigint","bignum","bls","bls-signature","bls12-381","constant-time","cryptography","digital-signature","elliptic-curve-arithmetic","elliptic-curve-cryptography","elliptic-curves","finite-fields","galois-field","hash-to-curve","pairing","pairing-cryptography","public-key-cryptography","side-channels","zkp"],"created_at":"2024-08-01T20:01:53.338Z","updated_at":"2025-04-07T19:16:21.474Z","avatar_url":"https://github.com/mratsim.png","language":"Nim","funding_links":[],"categories":["Nim","Algorithms","Projects by GPU Technology"],"sub_categories":["Cryptography","Cross-Platform Frameworks"],"readme":"# Performance\n\nHigh-performance is a sought out property.\nNote that security and side-channel resistance takes priority over performance.\n\nNew applications of elliptic curve cryptography like zero-knowledge proofs or\nproof-of-stake based blockchain protocols are bottlenecked by cryptography.\n\n## In blockchain\n\nEthereum 2 clients spent or use to spend anywhere between 30% to 99% of their processing time verifying the signatures of block validators on R\u0026D testnets\nAssuming we want nodes to handle a thousand peers, if a cryptographic pairing takes 1ms, that represents 1s of cryptography per block to sign with a target\nblock frequency of 1 every 6 seconds.\n\n## In zero-knowledge proofs\n\nAccording to https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0\na 16-core CPU can prove 20 transfers/second or 10 transactions/second.\nThe previous implementation was 15x slower and one of the key optimizations\nwas changing the elliptic curve cryptography backend.\nIt had a direct implication on hardware cost and/or cloud computing resources required.\n\n## Measuring performance\n\nTo measure the performance of Constantine\n\n```bash\ngit clone https://github.com/mratsim/constantine\n\n# Default compiler. We recommand enforcing CC=clang for best performance.\nnimble bench_fp\n\n# Arithmetic\nCC=clang nimble bench_fp  # Using Clang + Assembly (recommended)\nCC=clang nimble bench_fp2\nCC=clang nimble bench_fp12\n\n# Scalar multiplication and pairings\nCC=clang nimble bench_ec_g1_scalar_mul\nCC=clang nimble bench_ec_g2_scalar_mul\nCC=clang nimble bench_pairing_bls12_381\n\n# And per-curve summaries\nCC=clang nimble bench_summary_bn254_nogami\nCC=clang nimble bench_summary_bn254_snarks\nCC=clang nimble bench_summary_bls12_377\nCC=clang nimble bench_summary_bls12_381\n\n# Ethereum BLS signature protocol\nCC=clang nimble bench_eth_bls_signatures\n\n# Ethereum KZG commitments\nCC=clang nimble bench_eth_eip4844_kzg\n\n# Ethereum Virtual Machine (EVM) precompiles\nCC=clang nimble bench_eth_evm_precompiles\n\n# Multi-scalar multiplication\nCC=clang nimble bench_ec_g1_msm_bls12_381\nCC=clang nimble bench_ec_g1_msm_bn254_snarks\n```\n\nThe full list of benchmarks is available in the [`benchmarks`](./benchmarks) folder.\nAnd the exact commands are listed as part of `nimble tasks`\n\n\nAs mentioned in the [Compiler caveats](#compiler-caveats) section, GCC is up to 2x slower than Clang due to mishandling of carries and register usage.\n\n## Ethereum benchmarks\n\n### Ethereum Virtual Machine (EVM) precompiles\n\n![Bench Ethereum KZG commitments](./media/bench-eth_evm_precompiles-R7_7840U.png)\n\n### Ethereum KZG commitments (EIP-4844)\n\n![Bench Ethereum KZG commitments](./media/bench-eth_eip4844_kzg-R7_7840U.png)\n\n### Ethereum BLS signatures (over BLS12-381 𝔾₂)\n\n![Bench Ethereum BLS signature](./media/bench-eth_bls_signatures-R7_7840U.png)\n\n## Cryptographic primitives benchmarks\n\n### BLS12-381 detailed benchmarks\n\n![BLS12-381 perf summary](./media/bench-summary_bls12_381-R7_7840U.png)\n\n![BLS12-381 Multi-Scalar multiplication 1](./media/bench-bls12_381_msm-2_to_128-R7_7840U.png)\n![BLS12-381 Multi-Scalar multiplication 2](./media/bench-bls12_381_msm-256_to_16384-R7_7840U.png)\n![BLS12-381 Multi-Scalar multiplication 3](./media/bench-bls12_381_msm-65536_to_262144-R7_7840U.png)\n\n### BN254-Snarks Multi-Scalar-Multiplication benchmarks\n\nOn a i9-9980XE (18 cores, watercooled, overclocked, 4.1GHz all core turbo)\n\n![BN254-Snarks multi-scalar multiplication](./media/bench-bn254_snarks_msm-i9_9980XE.png)\n\n### Parallelism\n\nConstantine multithreaded primitives are powered by a highly tuned threadpool and stress-tested for:\n- scheduler overhead\n- load balancing with extreme imbalance\n- nested data parallelism\n- contention\n- speculative/conditional parallelism\n\nand provides the following paradigms:\n- Future-based task-parallelism\n- Data parallelism (nestable and awaitable for loops)\n  - including arbitrary parallel reductions\n- Dataflow parallelism / Stream parallelism / Graph Parallelism / Pipeline parallelism\n- Structured Parallelism\n\nThe threadpool parallel-for loops use lazy loop splitting and are fully adaptative to the workload being scheduled, the threads in-flight load and the hardware speed unlike most (all?) runtime, see:\n- OpenMP woes depending on hardware and workload: https://github.com/zy97140/omp-benchmark-for-pytorch\n- Raytracing ideal runtime, adapt to pixel compute load: ![load distribution](./media/parallel_load_distribution.png)\\\n  Most (all?) production runtime use scheduling A (split on number of threads like GCC OpenMP) or B (eager splitting, unable to adapt to actual work like LLVM/Intel OpenMP or Intel TBB) while Constantine uses C.\n\nThe threadpool provides efficient backoff strategy to conserve power based on:\n- eventcounts / futexes, for low overhead backoff\n- log-log iterated backoff, a provably optimal backoff strategy used for wireless communication to minimize communication in parallel for-loops\n\nThe research papers on high performance multithreading available in Weave repo: https://github.com/mratsim/weave/tree/7682784/research.\\\n_Note: The threadpool is not backed by Weave but by an inspired runtime that has been significantly simplified for ease of auditing. In particular it uses shared-memory based work-stealing instead of channel-based work-requesting for load balancing as distributed computing is not a target, ..., yet._\n\n## Compiler caveats\n\nUnfortunately compilers and in particular GCC are not very good at optimizing big integers and/or cryptographic code even when using intrinsics like `addcarry_u64`.\n\nCompilers with proper support of `addcarry_u64` like Clang, MSVC and ICC\nmay generate code up to 20~25% faster than GCC.\n\nThis is explained by the GMP team: https://gmplib.org/manual/Assembly-Carry-Propagation.html\nand can be reproduced with the following C code.\n\nSee https://gcc.godbolt.org/z/2h768y\n```C\n#include \u003cstdint.h\u003e\n#include \u003cx86intrin.h\u003e\n\nvoid add256(uint64_t a[4], uint64_t b[4]){\n  uint8_t carry = 0;\n  for (int i = 0; i \u003c 4; ++i)\n    carry = _addcarry_u64(carry, a[i], b[i], \u0026a[i]);\n}\n```\n\nGCC\n```asm\nadd256:\n        movq    (%rsi), %rax\n        addq    (%rdi), %rax\n        setc    %dl\n        movq    %rax, (%rdi)\n        movq    8(%rdi), %rax\n        addb    $-1, %dl\n        adcq    8(%rsi), %rax\n        setc    %dl\n        movq    %rax, 8(%rdi)\n        movq    16(%rdi), %rax\n        addb    $-1, %dl\n        adcq    16(%rsi), %rax\n        setc    %dl\n        movq    %rax, 16(%rdi)\n        movq    24(%rsi), %rax\n        addb    $-1, %dl\n        adcq    %rax, 24(%rdi)\n        ret\n```\n\nClang\n```asm\nadd256:\n        movq    (%rsi), %rax\n        addq    %rax, (%rdi)\n        movq    8(%rsi), %rax\n        adcq    %rax, 8(%rdi)\n        movq    16(%rsi), %rax\n        adcq    %rax, 16(%rdi)\n        movq    24(%rsi), %rax\n        adcq    %rax, 24(%rdi)\n        retq\n```\n### Inline assembly\n\nWhile using intrinsics significantly improve code readability, portability, auditability and maintainability,\nConstantine use inline assembly on x86-64 to ensure performance portability despite poor optimization (for GCC)\nand also to use dedicated large integer instructions MULX, ADCX, ADOX that compilers cannot generate.\n\nThe speed improvement on finite field arithmetic is up 60% with MULX, ADCX, ADOX on BLS12-381 (6 limbs).\n\nFinally assembly is a requirement to ensure constant-time property and to avoid compilers turning careful\nbranchless code into branches, see [Fighting the compiler (wiki)](https://github.com/mratsim/constantine/wiki/Constant-time-arithmetics#fighting-the-compiler)\n\nIn summary, pure C/C++/Nim implies:\n- a smart compiler might unravel the constant time bit manipulation and reintroduce branches.\n- a significant performance cost with GCC (~50% slower than Clang).\n- missed opportunities on recent CPUs that support MULX/ADCX/ADOX instructions (~60% faster than Clang).\n- 2.4x perf ratio between using plain GCC vs GCC with inline assembly.\n\n## Sizes: code size, stack usage\n\nThanks to 10x smaller key sizes for the same security level as RSA, elliptic curve cryptography\nis widely used on resource-constrained devices.\n\nConstantine is actively optimize for code-size and stack usage.\nConstantine does not use heap allocation.\n\nAt the moment Constantine is optimized for 32-bit and 64-bit CPUs.\n\nWhen performance and code size conflicts, a careful and informed default is chosen.\nIn the future, a compile-time flag that goes beyond the compiler `-Os` might be provided.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Fconstantine","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmratsim%2Fconstantine","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmratsim%2Fconstantine/lists"}