- Tags: Python
- Date: January 28, 2015
- Jump to comments
Today, we’re going to demonstrate a fairly evil thing in Python, which I call object replacement.
Say you have some program that’s been running for a while, and a particular
object has made its way throughout your code. It lives inside lists, class
attributes, maybe even inside some closures. You want to completely replace
this object with another one; that is to say, you want to find all references
to object A
and replace them with object B
, enabling A
to be garbage
collected. This has some interesting implications for special object types. If
you have methods that are bound to A
, you want to rebind them to B
. If A
is a class, you want all instances of A
to become instances of B
. And so
on.
But why on Earth would you want to do that? you ask. I’ll focus on a concrete use case in a future post, but for now, I imagine this could be useful in some kind of advanced unit testing situation with mock objects. Still, it’s fairly insane, so let’s leave it primarily as an intellectual exercise.
This article is written for CPython 2.7.[1]
Review
First, a recap on terminology here. You can skip this section if you know Python well.
In Python, names are what most languages call “variables”. They reference objects. So when we do:
…we are creating a list object with four integers, and binding it to the name
a
. In graph form:[2]
In each of the following examples, we are creating new references to the
list object, but we are never duplicating it. Each reference points to the same
memory address (which you can get using id(a)
).
Note that these references are all equal. a
is no more valid a name for the
list than b
, c.data
, or L
(or d.func_closure[0].cell_contents
to the
outside world). As a result, if you delete one of these references—explicitly
with del a
, or implicitly if a name goes out of scope—then the other
references are still around, and object continues to exist. If all of an
object’s references disappear, then Python’s garbage collector should eliminate
it.
Dead ends
My first thought when approaching this problem was to physically write over the
memory where our target object is stored. This can be done using
ctypes.memmove()
from the Python standard library:
What we are doing here is overwriting the fields of the A
instance of the
PyClassObject
C struct
with fields from the B
struct instance. As a result, they now share various
properties, such as their attribute dictionaries
(__dict__
).
So, we can do things like this:
However, there are clear issues. What we’ve done is create a
shallow copy.
Therefore, A
and B
are still distinct objects, so certain changes made to
one will not be replicated to the other:
Also, this won’t work if A
and B
are different sizes, since we will be
either reading from or writing to memory that we don’t necessarily own:
Oh, and there’s a bit of a problem when we deallocate these objects, too…
Fishing for references with Guppy
A more appropriate solution is finding all of the references to the old object, and then updating them to point to the new object, rather than replacing the old object directly.
But how do we track references? Fortunately, there’s a library called
Guppy that allows us to do this. Often used
for diagnosing memory leaks, we can take advantage of its robust object
tracking features here. Install it with pip
(pip install guppy
).
I’ve always found Guppy hard to use (as many debuggers are, though justified by the complexity of the task involved), so we’ll begin with a feature demo before delving into the actual problem.
Feature demonstration
Guppy’s interface is deceptively simple. We begin by calling
guppy.hpy()
,
to expose the Heapy interface, which is the component of Guppy with the
features we want:
Calling
hp.heap()
shows us a table of the objects known to Guppy, grouped together
(mathematically speaking,
partitioned) by
type[3] and sorted by how much space
they take up in memory:
This object (called an
IdentitySet
)
looks bizarre, but it can be treated roughly like a list. If we want to take a
look at strings, we can do heap[0]
:
This isn’t very useful, though. What we really want to do is re-partition this subset using another relationship. There are a number of options, such as:
From this, we can see that the plurality of memory devoted to strings is taken
up by those referenced by code objects (types.CodeType
represents
Python code—accessible from a non-C-defined function through
func.func_code
—and contains things like the names of its local variables and
the actual sequence of opcodes that make it up).
For fun, let’s pick a random string.
Interesting. Since this heap subset contains only one element, we can use
.theone
to get the actual object represented here:
Looks like the docstring for the
types
module. We can confirm
by using
.referrers
to get the set of objects that refer to objects in the given set:
This is types.__dict__
(since the docstring we got is actually stored as
types.__dict__["__doc__"]
), so if we use .referrers
again:
But why did we find an object in the types
module if we never imported it?
Well, let’s see. We can use
hp.iso()
to get the Heapy set consisting of a single given object:
Using a similar procedure as before, we see that types
is imported by the
traceback
module:
…and that is imported by
site
:
Since site
is imported by Python on startup, we’ve figured out why objects
from types
exist, even though we’ve never used them.
We’ve learned something important, too. When objects are stored as ordinary
attributes of a parent object (like types.__doc__
, traceback.types
, and
site.traceback
from above), they are not referenced directly by the parent
object, but by that object’s __dict__
attribute. Therefore, if we want to
replace A
with B
and A
is an attribute of C
, we (probably) don’t need
to know anything special about C
—just how to modify dictionaries.
A good Guppy/Heapy tutorial, while a bit old and incomplete, can be found on Andrey Smirnov’s website.
Examining paths
Let’s set up an example replacement using class instances:
Suppose we want to replace a
with b
. From the demo above, we know that we
can get the Heapy set of a single object using hp.iso()
. We also know we can
use .referrers
to get the set of objects that reference the given object:
a
is only referenced by one object, which makes sense, since we’ve only used
it in one place—as a local variable—meaning hp.iso(a).referrers.theone
must
be locals()
:
However, there is a more useful feature available to us:
.pathsin
.
This also returns references to the given object, but instead of a Heapy set,
it is a list of Path
objects. These are more useful since they tell us not
only what objects are related to the given object, but how they are
related.
This looks very ambiguous. However, we find that we can extract the source of
the reference using .src
:
…and, we can examine the type of relation by looking at .path[1]
(the
actual reason for this isn’t worth getting into, due to Guppy’s lack of
documentation on the subject):
We notice that relation
is a Based_R_INDEXVAL
object. Sounds bizarre, but
this tells us that a
is a particular indexed value of path.src
. What index?
We can get this using relation.r
:
Ah ha! So now we know that a
is equal to the reference source (i.e.,
path.src.theone
) indexed by rel
:
But path.src.theone
is just a dictionary, meaning we know how to modify it
very easily:[4]
Bingo. We’ve successfully replaced a
with b
, using a general method that
should work for any case where a
is in a dictionary-like object.
Handling different reference types
We’ll continue by wrapping this code up in a nice function, which we will expand as we go:
Dictionaries, lists, and tuples
As noted above, this is versatile to handle many dictionary-like situations,
including __dict__
, which means we already know how to replace object
attributes:
Lists can be handled exactly the same as dictionaries, although the keys in
this case (i.e., relation.r
) will always be integers.
Tuples are interesting. We can’t modify them directly because they’re immutable, but we can create a new tuple with the new value, and then replace that tuple just like we replaced our original object:
As a result:
Bound methods
Here’s a fun one. Let’s upgrade our definitions of A
and B
:
After replacing a
with b
, a.func
no longer exists, as we’d expect:
But what if we save a reference to a.func
before the replacement?
Hmm. So f
has kept a reference to a
somehow, but not in a dictionary-like
object. So where is it?
Well, we can reveal it with the attribute f.__self__
:
Unfortunately, this attribute is magical and we can’t write to it directly:
Python clearly doesn’t want us to re-bind bound methods, and a reasonable
person would give up here, but we still have a few tricks up our sleeve. Let’s
examine the internal C structure of bound methods,
PyMethodObject
:
The four gray fields of the struct come from
PyObject_HEAD
,
which exist in all Python objects. The first two fields are from
_PyObject_HEAD_EXTRA
,
and only exist when the debugging macro Py_TRACE_REFS
is defined, in order to
support more advanced reference counting. We can see that the im_self
field,
which mantains the reference to our target object, is either forth or sixth in
the struct depending on Py_TRACE_REFS
. If we can figure out the size of the
field and its offset from the start of the struct, then we can set its value
directly using ctypes.memmove()
:
Here, id(f)
is the memory location of our method, which refers to the start
of the C struct from above. offset
is the number of bytes between this memory
location and the start of the im_self
field. We use
ctypes.byref()
to create a reference to the replacement object, b
, which will be copied over
the existing reference to a
. Finally, field_size
is the number of bytes
we’re copying, equal to the size of the im_self
field.
Well, all but one of these fields are pointers to structure types, meaning they
have the same size,[5] equal to
ctypes.sizeof(ctypes.py_object)
.
This is (probably) 4 or 8 bytes, depending on whether you’re on a 32-bit or a
64-bit system. The other field is a Py_ssize_t
object—possibly the same size
as the pointers, but we can’t be sure—which is equal to
ctypes.sizeof(ctypes.c_ssize_t)
.
We know that field_size
must be ctypes.sizeof(ctypes.py_object)
, since we
are copying a structure pointer. offset
is this value multiplied by the
number of structure pointers before im_self
(4 if Py_TRACE_REFS
is defined
and 2 otherwise), plus ctypes.sizeof(ctypes.c_ssize_t)
for ob_type
. But how
do we determine if Py_TRACE_REFS
is defined? We can’t check the value of a
macro at runtime, but we can check for the existence of
sys.getobjects()
,
which is
only defined when that macro is.
Therefore, we can make our replacement like so:
Excellent—it worked!
There’s another kind of bound method, which is the built-in variety as opposed
to the user-defined variety we saw above. An example is a.__sizeof__()
:
This is stored internally as a
PyCFunctionObject
.
Let’s take a look at its layout:
Fortunately, m_self
here has the same offset as im_self
from before, so we
can just use the same code:
Dictionary keys
Dictionary keys have a different reference relation type than values, but the
replacement works mostly the same way. We pop the value of the old key from the
dictionary, and then insert it in again under the new key. Here’s the code,
which we’ll stick into the main block in replace()
:
And, a demonstration:
Closure cells
We’ll cover just one more case, this time involving a closure. Here’s our test function:
As we can see, an instance of the inner function keeps references to the locals
of the wrapper function, even after using our current
version of replace()
:
Internally, CPython implements this using things called
cells. We notice that
f.func_closure
gives us a tuple of cell
objects, and we can examine an
individual cell’s contents with .cell_contents
:
As expected, we can’t just modify it…
…because that would be too easy. So, how can we replace it? Well, we could
go back to memmove
, but there’s an easier way thanks to the ctypes
module
also exposing Python’s C API. Specifically, the
PyCell_Set
function
(which seems to lack a pure Python equivalent) does exactly what we want. Since
the function expects PyObject*
s as arguments, we’ll need to use
ctypes.py_object
as a wrapper. Here it is:
Perfect – the replacement worked. To tie it together with replace()
, we’ll
note that Guppy represents the cell contents relationship with
Based_R_INTERATTR
, for what I assume to be “internal attribute”. We can use
this to find the cell object within the inner function that references our
target object, and then use the method above to make the change:
Other cases
There are many, many more types of possible replacements. I’ve written a more
extensible version of replace()
with some test cases, which can be viewed
on Gist here.
Certainly, not every case is handled by it, but it seems to cover the majority
that I’ve found through testing. There are a number of reference relations in
Guppy that I couldn’t figure out how to replicate without doing something
insane (R_HASATTR
, R_CELL
, and R_STACK
), so some obscure replacements are
likely unimplemented.
Some other kinds of replacements are known, but impossible. For example,
replacing a class object that uses __slots__
with another class will not work
if the replacement class has a different slot layout and instances of the old
class exist. More generally, replacing a class with a non-class object won’t
work if instances of the class exist. Furthermore, references stored in data
structures managed by C extensions cannot be changed, since there’s no good way
for us to track these.
Footnotes
-
^ This post relies heavily on implementation details of CPython 2.7. While it could be adapted for Python 3 by examining changes to the internal structures of objects that we used above, that would be a lost cause if you wanted to replicate this on Jython or some other implementation. We are so dependent on concepts specific to CPython that you would need to start from scratch, beginning with a language-specific replacement for Guppy.
-
^ The DOT files used to generate graphs in this post are available on Gist.
-
^ They’re actually grouped together by clodo (“class or dict object”), which is similar to type, but groups
__dict__
s separately by their owner’s type. -
^ Python’s documentation tells us not to modify the locals dictionary, but screw that; we’re gonna do it anyway.
-
^ According to the C99 and C11 standards; section 6.2.5.27 in the former and 6.2.5.28 in the latter: “All pointers to structure types shall have the same representation and alignment requirements as each other.”