Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/kurtamohler/pyobject-preservation

Preserving and resurrecting PyObjects in CPython
https://github.com/kurtamohler/pyobject-preservation
Last synced: 5 days ago
JSON representation
Preserving and resurrecting PyObjects in CPython
Host: GitHub
URL: https://github.com/kurtamohler/pyobject-preservation
Owner: kurtamohler
License: other
Created: 2022-12-27T23:12:27.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2023-02-27T00:28:23.000Z (almost 2 years ago)
Last Synced: 2024-12-15T09:10:32.625Z (21 days ago)
Language: C++
Size: 88.9 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

        # pyobject-preservation

Preserving and resurrecting PyObjects in

[CPython](https://septatrix.github.io/cpython-dark-docs/extending/index.html#extending-index)

This repo shows an example implementation of a design pattern called "PyObject

preservation", which can be used when writing libraries that offer both

a Python and a C/C++ interface into the same underlying objects using CPython.

## Build

To build this project, you must have

[Miniconda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html)

installed.

Then run the following:

```shell

conda env create -f environment.yaml -n pyobject-preservation

```

```shell

conda activate pyobject-preservation

```

```shell

python setup.py install

```

## Run

Run the tests:

```shell

python test/test_mylib.py

```

## What is PyObject preservation?

Let's say we want to make a Python library that is implemented in C++. We want

a user-facing C++ API that is consistent with the Python API, offering all the

same methods, classes, and features. Users should be able to build a pure-C++

application, a pure-Python application, or choose to write different parts in

Python or C++. We'll need to be able to pass objects back and forth between

Python and C++ contexts. So any class in our Python API will contain

a reference to the underlying C++ class. Calling a method on the Python class

will end up calling the corresponding method in the C++ implementation.

In a library like this, how should we handle the deallocations for such a pair

of Python and C++ objects? Specifically, what should we do if the Python

object's reference count goes to zero but the C++ object needs to stay alive

because there are other references to it in the C++ context?

A good answer is to use "PyObject preservation".

[`PyObject`](https://docs.python.org/3/c-api/structures.html#c.PyObject) is the

C struct that CPython uses to represent all Python objects. We can implement

a deallocation function for our Python class that will cancel the deallocation

if the C++ object needs to stay alive. Then the Python object becomes

a "zombie PyObject", which can either be "resurrected" if we ever need to pass

the object back to the Python context, or it can be deallocated if the C++

object's refcount reaches zero.

## Example library `mylib`

In this repo, we have a Python library `mylib` and a pure-C++ library

`mylib_cpp`. The Python class `mylib.MyClass` contains an owning reference to

an underlying pure-C++ object `mylib_cpp::MyClass`.

We also have another Python class called `mylib.MyClassRef` which only exists

to allow us to exercise PyObject preservation. `mylib.MyClassRef` just retains

a reference to a `mylib_cpp::MyClass`, it doesn't retain a reference to

a `mylib.MyClass`. The method `mylib.MyClassRef.get()` returns

a `mylib.MyClass` object that corresponds with the underlying

`mylib_cpp::MyClass`. If the `mylib.MyClass` is in the zombie state, it will be

resurrected.

Here is a simple demonstration:

```python

import mylib

a = mylib.MyClass()

ref = mylib.MyClassRef(a)

# The PyObject becomes a zombie

del a

# The zombie PyObject is resurrected

b = ref.get()

```

When we call `del a`, the Python reference count for the `mylib.MyClass` object

goes to zero, since `ref` only has a reference to the underlying C++ object. If

`mylib.MyClass` did not have PyObject preservation, the garbage collector would

deallocate the `mylib.MyClass` instance. Then when `ref.get()` is called, a new

`mylib.MyClass` instance would be created for `b` to point to. But since

`mylib.MyClass` does have PyObject preservation, the PyObject is kept alive the

whole time so that `b` points to the same exact instance of `mylib.MyClass`

that `a` used to point to.

## Why not deallocate the PyObject and create a new one when needed?

An alternative to PyObject preservation is to just allow the PyObject to be

deallocated when its reference count goes to zero, even if the C++ object is

still alive. Then, if we ever need to pass the C++ object back up to Python, we

could simply create a new PyObject that has an owning reference to the C++

object. Depending on how you expect the user to use your class, this might be

good enough in some cases, but there are a few issues with it.

When we deallocate the PyObject we will lose information about its pure-Python

properties and subclasses, and weak references won't work properly. These

issues are described below.

### Property preservation

We can add properties on a class's `__dict__`, which will be preserved if the

PyObject is preserved.

```python

import mylib

a = mylib.MyClass()

a.__dict__['my_property'] = 'this is a property'

ref = mylib.MyClassRef(a)

del a

b = ref.get()

# This only works if PyObject was preserved:

b.my_property

```

In the above example, PyObject preservation preserves the properties we add to

a `mylib.MyClass` object, so we can access `b.my_property` after it is

resurrected. If `mylib.MyClass` did not have PyObject preservation, then

`b.my_property` would fail, since the instance of `mylib.MyClass` that held

`my_property` would be deallocated, and there would be no way for the

`ref.get()` call to restore `my_property` on the new `mylib.MyClass` object.

### Subclass preservation

PyObject preservation also preserves subclass information.

```python

import mylib

class MySubclass(mylib.MyClass):

  pass

a = MySublass()

ref = mylib.MyClassRef(a)

del a

b = ref.get()

# Fails if PyObject was not preserved

assert isinstance(b, MySubclass)

```

In the above example, since the PyObject is preserved, `b` will be an instance

of `MySubclass`. If the PyObject was not preserved, the subclass information

would be lost, and `b` would just be a `mylib.MyClass` instance, since

`ref.get()` would not know that it should be restored as a `MySubclass`

instance.

### Weak references

Another reason to use PyObject preservation is to properly support [weak

references](https://docs.python.org/3/library/weakref.html). If we have a weak

reference to a `mylib.MyClass` and all the strong Python references to it have

been deleted, the weakref should still point to the original `mylib.MyClass`

instance as long as the underlying C++ object is alive. If we ever need a new

strong reference, say if `ref.get()` is called, we wouldn't want this to create

a new `mylib.MyClass` instance, because any weakrefs we had would not be

updated to point to this new instance, so they would become invalid.

## Implementation

To implement PyObject preservation in your own project, you can use the implementation

in this repo as an example. Let's look at how it's done.

### Project architecture

First, let's look at how the modules and files in this project are organized.

`mylib` is a pure-Python library, starting in

[`mylib/__init__.py`](mylib/__init__.py). `mylib` calls into `_mylib`, which is

a Python library implemented in C++ with CPython in

[`mylib/csrc/Module.cpp`](mylib/csrc/Module.cpp) and the other files in

[`mylib/csrc`](mylib/csrc). `_mylib` calls into the classes and methods of

`mylib_cpp`, a pure-C++ library defined in [`mylib_cpp/`](mylib_cpp/).

`mylib.MyClass`, defined in [`mylib/_myclass.py`](mylib/_myclass.py), is just

a subclass of `_mylib.MyClassBase`, which is defined in

[`mylib/csrc/MyClassBase.h`](mylib/csrc/MyClassBase.h) and

[`mylib/csrc/MyClassBase.cpp`](mylib/csrc/MyClassBase.cpp). `mylib.MyClass`

does not add any methods or overload any methods of its parent class. In

CPython, the struct `MyClassBase` is the PyObject for `_mylib.MyClassBase`.

The CPython `MyClassBase` contains an `intrusive_ptr` that points to the

underlying pure-C++ `mylib_cpp::MyClass`, which is defined in

[`mylib_cpp/MyClass.h`](mylib_cpp/MyClass.h).

### Quick note: `intrusive_ptr`

`intrusive_ptr` is a kind of smart pointer, copied from

[PyTorch](https://github.com/pytorch/pytorch/blob/master/c10/util/intrusive_ptr.h),

for which the reference count of an object is stored on the object itself,

which is why it's called "intrusive". When the reference count is decremented

to zero, the object is deallocated. `mylib_cpp::MyClass` has to be a subclass

of `intrusive_ptr_target` for this purpose. You could potentially use

a different solution for reference counting C++ objects in your project. Boost

offers an

[`intrusive_ptr`](https://www.boost.org/doc/libs/1_46_0/libs/smart_ptr/intrusive_ptr.html)

as well.

### PyObject preservation metaclass

When a Python object's reference count goes to zero, the object's finalizer,

the

[`__del__`](https://docs.python.org/3/reference/datamodel.html#object.__del__)

method, is usually called automatically. This function is typically used to

clean up any resources that the class instance was using. But we want to

implement a deallocation function for `mylib.MyClass` that can detect whether

the underlying `mylib_cpp::MyClass` has more than one reference to it, and if

so, cancel the deallocation and give the `mylib_cpp::MyClass` object a pointer

to the PyObject of the `mylib.MyClass`.

One might attempt to cancel the deallocation within the finalizer of

`_mylib.MyClassBase`, since we want to have all of the PyObject preservation

logic contained within the base class. However, the finalizer of the subclass

will be called first and then the finalizer of the base class is called (and

then the finalizer of the base class's base class is called, etc). In the case

of `mylib.MyClass`, this would be bad. If `mylib.MyClass.__del__` is called

first, it would unconditionally delete all the subclass information on the

object.  Then when `MyClassBase`'s finalizer is called, it would cancel the

rest of the deallocation, leaving the PyObject in a partially deallocated

state. That doesn't accomplish PyObject preservation, since we've lost the

subclass information.

Instead, we need to override the

[`tp_dealloc`](https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc)

function of any new instance of `MyClassBase` and its subclasses, and we need

to use a custom

[metaclass](https://docs.python.org/3/reference/datamodel.html#metaclasses) to

accomplish this. The default `tp_dealloc` function is what calls the finalizers

in the order described above.

In [mylib/csrc/MyClassBase.cpp](mylib/csrc/MyClassBase.cpp), the metaclass

`_mylib.MyClassMeta` is defined and applied to `_mylib.MyClassBase`. When any

subclass of `_mylib.MyClassBase` (like `mylib.MyClass`) is being created,

`MyClassMeta_init()` is called. This function overrides the

[`tp_dealloc`](https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_dealloc)

function of the new object being created. We set it to our own deallocation

function `pyobj_preservation::dealloc_or_preserve()`. If the object needs to be

preserved, we avoid calling any finalizers, and we don't end up with

a partially deallocated object. We'll talk about what exactly this function

does in the next section.

### PyObject preservation logic

Now we can finally look at the core logic of preserving and resurrecting

PyObjects, which is in

[`mylib/csrc/PyObjectPreservation.h`](mylib/csrc/PyObjectPreservation.h). The

functions in here are template functions so they could be used to add PyObject

preservation for multiple different classes without having to duplicate the

code. The template type `BaseT` refers to the PyObject type for the class,

which is `MyClassBase` in this case. The template type `CppT` refers to the

underlying C++ class, which is `mylib_cpp::MyClass` in this case.

`init_pyobj()` is one of the functions here. This gets called from

`MyClassBase_new` (which is `MyClassBase`'s overload of

[`tp_new`](https://docs.python.org/3/c-api/typeobj.html#c.PyTypeObject.tp_new))

in [`mylib/csrc/MyClassBase.cpp`](mylib/csrc/MyClassBase.cpp) when a new

`MyClassBase` is created. It handles creating an owned instance of the

underlying `mylib_cpp::MyClass`.

`dealloc_or_preserve()` is also defined here, which, as mentioned before, is

called when the refcount of a `mylib.MyClass` instance goes to zero. If the

refcount of the underlying C++ object is 1, that means that the only reference

to it comes from the PyObject, whose refcount is now zero, and both the C++

object and PyObject are deallocated.

If the refcount of the underlying C++ object is greater than one when

`dealloc_or_preserve()` is called, the C++ object needs to be kept alive and

the PyObject needs to be preserved. The ownership between the PyObject and the

C++ object is switched by clearing the PyObject's owning reference to the C++

object and giving the C++ an owning reference to the PyObject. The deallocation

has been successfully cancelled and the PyObject is now a zombie.

How does a pure-C++ `mylib_cpp::MyClass` object hold a reference to the

PyObject? For this, the `mylib_cpp::MyClass` contains a `PyObjectSlot`, which

is defined in [`mylib_cpp/PyObjectSlot.h`](mylib_cpp/PyObjectSlot.h) and

[`mylib_cpp/PyObjectSlot.cpp`](mylib_cpp/PyObjectSlot.cpp). `PyObjectSlot` just

holds a void pointer to the PyObject so that we don't need to include the

Python headers in `mylib_cpp`. `PyObjectSlot` has a boolean switch to keep

track of whether or not it has taken ownership over the PyObject (that is,

whether the PyObject is a zombie or not).

`PyObjectSlot` also has a pointer to something called a `PyInterpreter`, which

is invoked to deallocate a zombie PyObject when the pure-C++ object's reference

count reaches zero. The `PyInterpreter`, defined in

[`mylib_cpp/PyInterpreter.h`](mylib_cpp/PyInterpreter.h) contains a function

pointer that gets set in the CPython context, in

[`mylib/csrc/PyInterpreterDefs.h`](mylib/csrc/PyInterpreterDefs.h). The

`concrete_decref_fn` here simply decrements the refcount of the zombie PyObject

so that it gets cleaned up properly. `init_pyObj()` in

[`mylib/csrc/PyObjectPreservation.h`](mylib/csrc/PyObjectPreservation.h)

initializes the PyObjectSlot of the underlying `mylib_cpp::MyClass` object of

all new `MyClassBase` objects to use the same `PyInterpreter`.

`get_pyobj_from_cdata()`, defined in

[`mylib/csrc/PyObjectPreservation.h`](mylib/csrc/PyObjectPreservation.h), is

the last piece of the puzzle we need. This function is called any time that we

have an existing `mylib_cpp::MyClass` object that we need to pass into the

Python context. If the PyObject that the `mylib_cpp::MyClass`'s PyObjectSlot

points to is not a zombie (that is, the PyObject owns the pure-C++ object),

then the PyObject is simply returned and its refcount is incremented. But if

the PyObject is a zombie, it is resurrected before being returned. To resurrect

the PyObject, we simply have to flip the ownership back again by giving the

`MyClassBase` an owning reference to the `mylib_cpp::MyClass` and telling the

`PyObjectSlot` that it no longer owns the PyObject.

## PyTorch example

Let's look at a real example of PyObject preservation. The idea of PyObject

preservation originated in [PyTorch](https://github.com/pytorch/pytorch), which

has a Python class called `torch.Tensor` and a C++ class called

`torch::Tensor`. For the most part, they both have all the same public methods

and properties. Instances of `torch.Tensor` and `torch::Tensor` can share the

same internal state, which makes it possible to pass a `torch.Tensor` into

a function that is implemented in C++ which will operate on a corresponding

`torch::Tensor`. We can also go the opposite direction--starting with

a `torch::Tensor` that was created in the C++ context, we can propagate it into

the Python context as a `torch.Tensor`.

We won't get into the details of how exactly this is set up in PyTorch, because

these classes have lots of layers of abstraction and other things going on. But

at the bottom of all those layers, both `torch.Tensor` and `torch::Tensor` have

a reference to a `c10::TensorImpl` object which is implemented in C++. When

there are live references to a `torch.Tensor` in Python, its PyObject has an

owning reference to the `c10::TensorImpl`. If the Python references ever go to

zero but there are still references to the `c10::TensorImpl` in C++, then the

Python reference count to the PyObject for `torch.Tensor` will be incremented

once to keep it alive, and the `c10::TensorImpl` will take an owning reference

to the PyObject.

If the tensor ever needs to be returned to the Python context, the PyObject is

resurrected and ownership is flipped so that the `torch.Tensor` has an owning

reference to the `c10::TensorImpl` again. Or if the the reference count to

`c10::TensorImpl` ever goes to zero, its destructor will take care of

decrementing the PyObject's reference count so that it will be deallocated.

However, decrementing the reference count of a PyObject from

`c10::TensorImpl`'s destructor is a bit tricky. We can't just `#include

` and call `Py_DECREF(PyObject*)`. PyTorch supports building C++

applications that don't depend on Python, so `c10::TensorImpl` is defined in

a pure C++ context that cannot directly depend on the CPython library. Put very

simply, PyTorch solves this problem by using a function pointer that either

points to a decref function if the Python context exists or the function

pointer is null if the `c10::TensorImpl` is pure C++.

If you're interested in learning more about PyObject preservation and related

topics in PyTorch, here are some relevant PyTorch Developer Podcast episodes:

  * [PyObject preservation](https://pytorch-dev-podcast.simplecast.com/episodes/pyobject-preservation)

  * [Reference counting](https://pytorch-dev-podcast.simplecast.com/episodes/reference-counting)

  * [Weak references](https://pytorch-dev-podcast.simplecast.com/episodes/weak-references)