https://github.com/szabolcsdombi/optimization-demo
:zap: Optimizing Python code by implementing a C++ extension
https://github.com/szabolcsdombi/optimization-demo
benchmark cpp optimization python
Last synced: 12 months ago
JSON representation
:zap: Optimizing Python code by implementing a C++ extension
- Host: GitHub
- URL: https://github.com/szabolcsdombi/optimization-demo
- Owner: szabolcsdombi
- License: mit
- Created: 2023-06-30T07:42:01.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-07-12T05:01:12.000Z (over 2 years ago)
- Last Synced: 2025-03-27T01:46:59.098Z (about 1 year ago)
- Topics: benchmark, cpp, optimization, python
- Language: C++
- Homepage:
- Size: 36.1 KB
- Stars: 47
- Watchers: 2
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# optimization-demo
This article is about optimizing a tiny bit of Python code by replacing it with its C++ counterpart.
Beware, geek stuff follows.
We are interested in implementing the [Opening Handshake](https://datatracker.ietf.org/doc/html/rfc6455#section-1.3) of [The Websocket Protocol](https://datatracker.ietf.org/doc/html/rfc6455).
It is a fairly simple to understand task, it involes sizeable number crunching and intermediate object allocations to see it pop out in the results.
To be clear, this is a demo project with no real world benefits except for the methodology used.
Let's get started.
First, we implement a function that returns the `Sec-WebSocket-Accept` calculated from the `Sec-WebSocket-Key`
```py
from base64 import b64encode
from hashlib import sha1
def py_accept(key: str) -> str:
return b64encode(sha1((key + '258EAFA5-E914-47DA-95CA-C5AB0DC85B11').encode()).digest()).decode()
```
We can easily verify the sample value from the spec matches our return value.
```py
>>> py_accept('dGhlIHNhbXBsZSBub25jZQ==')
's3pPLMBiTxaQ9kYGzzhZRbK+xOo='
```
So far so good. Now let's dissamble it to see what is inside.
```py
>>> import dis
>>> dis.dis(py_accept)
1 0 RESUME 0
2 2 LOAD_GLOBAL 1 (NULL + b64encode)
14 LOAD_GLOBAL 3 (NULL + sha1)
26 LOAD_FAST 0 (key)
28 LOAD_CONST 1 ('258EAFA5-E914-47DA-95CA-C5AB0DC85B11')
30 BINARY_OP 0 (+)
34 LOAD_METHOD 2 (encode)
56 PRECALL 0
60 CALL 0
70 PRECALL 1
74 CALL 1
84 LOAD_METHOD 3 (digest)
106 PRECALL 0
110 CALL 0
120 PRECALL 1
124 CALL 1
134 LOAD_METHOD 4 (decode)
156 PRECALL 0
160 CALL 0
170 RETURN_VALUE
```
Okay, this looks a bit messy. Here is a transformed version of it.
```py
26 LOAD_FAST 0 (key)
28 LOAD_CONST 1 ('258EAFA5-E914-47DA-95CA-C5AB0DC85B11')
30 BINARY_OP 0 (+)
60 CALL 0 (encode)
74 CALL 1 (sha1)
110 CALL 0 (digest)
124 CALL 1 (b64encode)
160 CALL 0 (decode)
170 RETURN_VALUE
```
Except for the temporary values generated within the steps, the code itself looks the fastest possible. All the methods invoked are implemented in C inside the CPython implementation.
Now, we are going to implement all this as a single step in C++. To do that, we can initialize a Python extension and add our new function.
```c++
#include
void sec_websocket_accept(const void * src, void * dst) {
...
}
PyObject * c_accept(PyObject * self, PyObject * arg) {
char result[28];
Py_ssize_t len = 0;
const char * key = PyUnicode_AsUTF8AndSize(arg, &len);
if (!key || len != 24) {
PyErr_SetString(PyExc_ValueError, "invalid key");
return NULL;
}
sec_websocket_accept(key, result);
return PyUnicode_FromStringAndSize(result, 28);
}
PyMethodDef module_methods[] = {
{"c_accept", (PyCFunction)c_accept, METH_O, NULL},
{},
};
PyModuleDef module_def = {PyModuleDef_HEAD_INIT, "mymodule", NULL, -1, module_methods};
extern "C" PyObject * PyInit_mymodule() {
return PyModule_Create(&module_def);
}
```
The implementation of `sec_websocket_accept()` is cumbersome enough to worth ommitting from this article.
Here is a [link](mymodule/mymodule.cpp) to the full code.
It might not be trivial to see, but neither the Python, nor the C++ variant contains micro-optimizations or any harware specific ones.
These are just naive implentations. We can achieve significant results without using any of that.
We can add our tests, and see the results.
```py
from base64 import b64encode
from hashlib import sha1
from mymodule import c_accept
def py_accept(key: str) -> str:
return b64encode(sha1((key + '258EAFA5-E914-47DA-95CA-C5AB0DC85B11').encode()).digest()).decode()
def test_python_code(benchmark):
assert benchmark(py_accept, 'dGhlIHNhbXBsZSBub25jZQ==') == 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='
def test_optimized_code(benchmark):
assert benchmark(c_accept, 'dGhlIHNhbXBsZSBub25jZQ==') == 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='
```
## Results
```
--------------------------------------------------------------------------------------------------------
Name (time in ns) Mean StdDev Median OPS (Mops/s)
--------------------------------------------------------------------------------------------------------
test_optimized_code 208.0669 (1.0) 0.1062 (1.0) 208.0665 (1.0) 4.8061 (1.0)
test_python_code 893.2082 (4.29) 6.2278 (58.63) 889.6251 (4.28) 1.1196 (0.23)
--------------------------------------------------------------------------------------------------------
```
- It seems our Python code did really well. It can execute 1.1m calls per second.
- It is also clear our C++ variant is 4.29x faster, clocking at 4.8m calls per second.
Amazing! Replacing a tiny bit of code that seems not optimizable has a significant effect.
If you wish to run these tests yourself, you will find all the necessary steps in the github actions [here](https://github.com/szabolcsdombi/optimization-demo/actions/runs/5423258436/jobs/9860949567).
## No-Goals of this Article
- This article does not address maintainability or any other burden introduced with replacing simple Python code with cumbersome low level C code.
- We are not interested in micro-optimization, using SSE or hardware implemented hashing.
- Not interested in multi-threaded approaches, concurrency.
- Not interested in implementing it in Rust or any other language not supported out of the box for Python Extensions.
## Fun Fact
We can implement a magic function that may also work.
```py
def magic_accept(key: str) -> str:
return 's3pPLMBiTxaQ9kYGzzhZRbK+xOo='
```
Silly, but indeed it passes the test.
```
--------------------------------------------------------------------------------------------------------
Name (time in ns) Mean StdDev Median OPS (Mops/s)
--------------------------------------------------------------------------------------------------------
test_magic_code 70.7236 (1.0) 0.0584 (1.0) 70.7438 (1.0) 14.1396 (1.0)
test_optimized_code 210.7457 (2.98) 0.2183 (3.74) 210.7261 (2.98) 4.7451 (0.34)
test_python_code 912.7083 (12.91) 5.5000 (94.24) 912.7722 (12.90) 1.0956 (0.08)
-------------------------------------------------------------------------------------------------------
```
The dissambled version seems to be simple too.
```py
>>> dis.dis(magic_accept)
1 0 RESUME 0
2 2 LOAD_CONST 1 ('s3pPLMBiTxaQ9kYGzzhZRbK+xOo=')
4 RETURN_VALUE
```
So, how this new method compares to our existing ones that do real work?
Supprising as it may sound but our C++ implementation is just 2.98x slower.
(From measurements and interpretations we are now entering a realm of guessings).
This could be because of the overhead introduced by calling functions, the interpreter parsing bytecode or our mearuring tools used.
At 14m calls per second on a single core this is inevitable.
## Edit
Previously the best result was 3M keys per second. Actually 4.8M keys per second is possible on an average computer.
The [github actions](https://github.com/szabolcsdombi/optimization-demo/actions) still produce the original results.
There is an [extension](https://github.com/szabolcsdombi/optimization-demo-rust) to this article.
## Summary
By implementing a simple task in C++ instead of Python, where the underlying function calls are already implemented in C++, we still can get a significant boost.