https://github.com/quark-zju/experimental-fastimport
Incomplete work of fast Python object de-serializating
https://github.com/quark-zju/experimental-fastimport
Last synced: 3 months ago
JSON representation
Incomplete work of fast Python object de-serializating
- Host: GitHub
- URL: https://github.com/quark-zju/experimental-fastimport
- Owner: quark-zju
- Created: 2018-12-28T20:43:36.000Z (over 6 years ago)
- Default Branch: @
- Last Pushed: 2023-04-23T20:53:45.000Z (about 2 years ago)
- Last Synced: 2025-01-08T08:45:52.450Z (5 months ago)
- Language: Python
- Size: 114 KB
- Stars: 0
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
Experimental fast Python object serializating.
## The idea
The idea is to serialize Python objects recursively as their raw memory representations (think about `memcpy` a `PyObject` and its dependencies), then object loading can be just a mmap, followed by:
- Adjust pointers. Libraries might be loaded at different places due to ASLR. So related pointers (ex. types, subclasses, reference to other objects) need to be adjusted.
- Side effects. For example, the side effects of creating mutexes needs to be re-done. Native module initialization might need to be re-done (and can be tricky).The ideal end result is that `sys.modules` can be serialized and loaded this way to achieve fast module import, followed by adjusting some states like `os.environ`.
## Implementation
I used `cffi` to read Python source code and figure out different types of Python object's underlying structures. Then implement serialization for each type involved. Serailization produces metadata about how to do pointer adjustments so deserialization does not need to be implemented per each type.
## Outcome
The experiment was for modules used by `hg`, and was able to get 80% hg's integration tests passing on Linux. I didn't continue fixing the remaining issues. Windows support is missing.
## Learnings
- Performnace:
- Pointer adjustment takes a long time. Loading all `hg` modules takes about 50ms.
- ASLR is annoying for performance.
- Mmapping the serialized buffer at a fixed offset can avoid the "relative object" pointer adjustments, but is not noticably faster. Lots of pointers still need to be changed for basic types like `PyBytes_Type`. I haven't done experiments when libpython is not ASLR-ed but I guess that might help performance.
- Correctness:
- `Dict[object, ...]` is a source of test failures, since `id(obj)` can change. It wasn't fixed.
- `ctypes` is a pain to handle. In this implementation, pointers to libraries like `libfoo.so` are tracked as `libfoo.so+offset` and adjusted with the new `libfoo.so` location. This is general purposed for all native libraries. It worked relatively well.
- Practically, if this approach needs to be productionized, the serialization probably needs to be more picky about the input and errors out on any exotic types.