{"id":17144090,"url":"https://github.com/szabolcsdombi/optimization-demo","last_synced_at":"2025-04-13T10:21:36.757Z","repository":{"id":177547651,"uuid":"660521959","full_name":"szabolcsdombi/optimization-demo","owner":"szabolcsdombi","description":":zap: Optimizing Python code by implementing a C++ extension","archived":false,"fork":false,"pushed_at":"2023-07-12T05:01:12.000Z","size":37,"stargazers_count":47,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-03-27T01:46:59.098Z","etag":null,"topics":["benchmark","cpp","optimization","python"],"latest_commit_sha":null,"homepage":"","language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/szabolcsdombi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-06-30T07:42:01.000Z","updated_at":"2024-02-04T14:43:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"6b568f97-1998-4034-8706-7b6989e014f5","html_url":"https://github.com/szabolcsdombi/optimization-demo","commit_stats":null,"previous_names":["szabolcsdombi/optimization-demo"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szabolcsdombi%2Foptimization-demo","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szabolcsdombi%2Foptimization-demo/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szabolcsdombi%2Foptimization-demo/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/szabolcsdombi%2Foptimization-demo/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/szabolcsdombi","download_url":"https://codeload.github.com/szabolcsdombi/optimization-demo/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248696232,"owners_count":21147093,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmark","cpp","optimization","python"],"created_at":"2024-10-14T20:43:04.241Z","updated_at":"2025-04-13T10:21:36.725Z","avatar_url":"https://github.com/szabolcsdombi.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# optimization-demo\n\nThis article is about optimizing a tiny bit of Python code by replacing it with its C++ counterpart.\n\nBeware, geek stuff follows.\n\nWe are interested in implementing the [Opening Handshake](https://datatracker.ietf.org/doc/html/rfc6455#section-1.3) of [The Websocket Protocol](https://datatracker.ietf.org/doc/html/rfc6455).\nIt is a fairly simple to understand task, it involes sizeable number crunching and intermediate object allocations to see it pop out in the results.\nTo be clear, this is a demo project with no real world benefits except for the methodology used.\n\nLet's get started.\n\nFirst, we implement a function that returns the `Sec-WebSocket-Accept` calculated from the `Sec-WebSocket-Key`\n\n```py\nfrom base64 import b64encode\nfrom hashlib import sha1\n\n\ndef py_accept(key: str) -\u003e str:\n    return b64encode(sha1((key + '258EAFA5-E914-47DA-95CA-C5AB0DC85B11').encode()).digest()).decode()\n```\n\nWe can easily verify the sample value from the spec matches our return value.\n\n```py\n\u003e\u003e\u003e py_accept('dGhlIHNhbXBsZSBub25jZQ==')\n's3pPLMBiTxaQ9kYGzzhZRbK+xOo='\n```\n\nSo far so good. Now let's dissamble it to see what is inside.\n\n```py\n\u003e\u003e\u003e import dis\n\u003e\u003e\u003e dis.dis(py_accept)\n\n  1           0 RESUME                   0\n\n  2           2 LOAD_GLOBAL              1 (NULL + b64encode)\n             14 LOAD_GLOBAL              3 (NULL + sha1)\n             26 LOAD_FAST                0 (key)\n             28 LOAD_CONST               1 ('258EAFA5-E914-47DA-95CA-C5AB0DC85B11')\n             30 BINARY_OP                0 (+)\n             34 LOAD_METHOD              2 (encode)\n             56 PRECALL                  0\n             60 CALL                     0\n             70 PRECALL                  1\n             74 CALL                     1\n             84 LOAD_METHOD              3 (digest)\n            106 PRECALL                  0\n            110 CALL                     0\n            120 PRECALL                  1\n            124 CALL                     1\n            134 LOAD_METHOD              4 (decode)\n            156 PRECALL                  0\n            160 CALL                     0\n            170 RETURN_VALUE\n```\n\nOkay, this looks a bit messy. Here is a transformed version of it.\n\n```py\n             26 LOAD_FAST                0 (key)\n             28 LOAD_CONST               1 ('258EAFA5-E914-47DA-95CA-C5AB0DC85B11')\n             30 BINARY_OP                0 (+)\n             60 CALL                     0 (encode)\n             74 CALL                     1 (sha1)\n            110 CALL                     0 (digest)\n            124 CALL                     1 (b64encode)\n            160 CALL                     0 (decode)\n            170 RETURN_VALUE\n```\n\nExcept for the temporary values generated within the steps, the code itself looks the fastest possible. All the methods invoked are implemented in C inside the CPython implementation.\n\nNow, we are going to implement all this as a single step in C++. To do that, we can initialize a Python extension and add our new function.\n\n```c++\n#include \u003cPython.h\u003e\n\nvoid sec_websocket_accept(const void * src, void * dst) {\n    ...\n}\n\nPyObject * c_accept(PyObject * self, PyObject * arg) {\n    char result[28];\n    Py_ssize_t len = 0;\n    const char * key = PyUnicode_AsUTF8AndSize(arg, \u0026len);\n    if (!key || len != 24) {\n        PyErr_SetString(PyExc_ValueError, \"invalid key\");\n        return NULL;\n    }\n    sec_websocket_accept(key, result);\n    return PyUnicode_FromStringAndSize(result, 28);\n}\n\nPyMethodDef module_methods[] = {\n    {\"c_accept\", (PyCFunction)c_accept, METH_O, NULL},\n    {},\n};\n\nPyModuleDef module_def = {PyModuleDef_HEAD_INIT, \"mymodule\", NULL, -1, module_methods};\n\nextern \"C\" PyObject * PyInit_mymodule() {\n    return PyModule_Create(\u0026module_def);\n}\n```\n\nThe implementation of `sec_websocket_accept()` is cumbersome enough to worth ommitting from this article.\nHere is a [link](mymodule/mymodule.cpp) to the full code.\n\nIt might not be trivial to see, but neither the Python, nor the C++ variant contains micro-optimizations or any harware specific ones.\nThese are just naive implentations. We can achieve significant results without using any of that.\n\nWe can add our tests, and see the results.\n\n```py\nfrom base64 import b64encode\nfrom hashlib import sha1\n\nfrom mymodule import c_accept\n\n\ndef py_accept(key: str) -\u003e str:\n    return b64encode(sha1((key + '258EAFA5-E914-47DA-95CA-C5AB0DC85B11').encode()).digest()).decode()\n\n\ndef test_python_code(benchmark):\n    assert benchmark(py_accept, 'dGhlIHNhbXBsZSBub25jZQ==') == 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='\n\n\ndef test_optimized_code(benchmark):\n    assert benchmark(c_accept, 'dGhlIHNhbXBsZSBub25jZQ==') == 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='\n```\n\n## Results\n\n```\n--------------------------------------------------------------------------------------------------------\nName (time in ns)           Mean            StdDev              Median            OPS (Mops/s)\n--------------------------------------------------------------------------------------------------------\ntest_optimized_code     208.0669 (1.0)      0.1062 (1.0)      208.0665 (1.0)            4.8061 (1.0)\ntest_python_code        893.2082 (4.29)     6.2278 (58.63)    889.6251 (4.28)           1.1196 (0.23)\n--------------------------------------------------------------------------------------------------------\n```\n\n- It seems our Python code did really well. It can execute 1.1m calls per second.\n- It is also clear our C++ variant is 4.29x faster, clocking at 4.8m calls per second.\n\nAmazing! Replacing a tiny bit of code that seems not optimizable has a significant effect.\n\nIf you wish to run these tests yourself, you will find all the necessary steps in the github actions [here](https://github.com/szabolcsdombi/optimization-demo/actions/runs/5423258436/jobs/9860949567).\n\n## No-Goals of this Article\n\n- This article does not address maintainability or any other burden introduced with replacing simple Python code with cumbersome low level C code.\n- We are not interested in micro-optimization, using SSE or hardware implemented hashing.\n- Not interested in multi-threaded approaches, concurrency.\n- Not interested in implementing it in Rust or any other language not supported out of the box for Python Extensions.\n\n## Fun Fact\n\nWe can implement a magic function that may also work.\n\n```py\ndef magic_accept(key: str) -\u003e str:\n    return 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='\n```\n\nSilly, but indeed it passes the test.\n\n```\n--------------------------------------------------------------------------------------------------------\nName (time in ns)           Mean            StdDev              Median            OPS (Mops/s)\n--------------------------------------------------------------------------------------------------------\ntest_magic_code          70.7236 (1.0)      0.0584 (1.0)       70.7438 (1.0)           14.1396 (1.0)\ntest_optimized_code     210.7457 (2.98)     0.2183 (3.74)     210.7261 (2.98)           4.7451 (0.34)\ntest_python_code        912.7083 (12.91)    5.5000 (94.24)    912.7722 (12.90)          1.0956 (0.08)\n-------------------------------------------------------------------------------------------------------\n```\n\nThe dissambled version seems to be simple too.\n\n```py\n\u003e\u003e\u003e dis.dis(magic_accept)\n  1           0 RESUME                   0\n\n  2           2 LOAD_CONST               1 ('s3pPLMBiTxaQ9kYGzzhZRbK+xOo=')\n              4 RETURN_VALUE\n```\n\nSo, how this new method compares to our existing ones that do real work?\n\nSupprising as it may sound but our C++ implementation is just 2.98x slower.\n(From measurements and interpretations we are now entering a realm of guessings).\nThis could be because of the overhead introduced by calling functions, the interpreter parsing bytecode or our mearuring tools used.\nAt 14m calls per second on a single core this is inevitable.\n\n## Edit\n\nPreviously the best result was 3M keys per second. Actually 4.8M keys per second is possible on an average computer.\nThe [github actions](https://github.com/szabolcsdombi/optimization-demo/actions) still produce the original results.\n\nThere is an [extension](https://github.com/szabolcsdombi/optimization-demo-rust) to this article.\n\n## Summary\n\nBy implementing a simple task in C++ instead of Python, where the underlying function calls are already implemented in C++, we still can get a significant boost.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszabolcsdombi%2Foptimization-demo","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fszabolcsdombi%2Foptimization-demo","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fszabolcsdombi%2Foptimization-demo/lists"}