Recently I found myself, for my own obscure reasons, wanting to know (at runtime, by introspection) which variables a piece of Python code read.
At first glance, this looked fairly straightforward. Python’s eval statement takes three parameters: a code object to run, and two dictionaries for the global and local variables. Since you can subclass a dictionary and override its getters it seemed trivial to monitor what was read, using some variation on a theme of:
class MonitorDict(dict):
def __init__(self, *prm, **kw):
dict.__init__(self, *prm, **kw)
self.accessed = set()
def __getitem__(self, name):
self.accessed.add(name)
return dict.__getitem__(self, name)
(Depending on the detail needed, instead of returning the dictionary contents directly, you might want to wrap them in ‘proxy objects’ first, that had similar ‘instrumentation’ to monitor what was read from those, in turn.)
For locals, this works a treat. Unfortunately, for globals, it does not. As the eval documentation says:
The globals must be a dictionary and locals can be any mapping
We can’t pass our magic dictionary in to wrap the global variables, like we can with the locals. Why eval insists on a vanilla dictionary, I’m not sure – it’s quite unPythonic; duck typing is the norm – but I guess the CPython implementation depends on it for some reason. I expect because reading the dictionary directly instead of checking for subclassed methods gives some performance advantage.
So I needed some darker, filthier magic.
The answers we seek are contained in the code, so what if we examine it directly? Python source is compiled to bytecode which you can read from within Python itself easily – it’s stored in the co_code member of a code object.
(A “code object”, by the way, is what you get if you call the built-in compile() function on some source code. There’s also one for every function (except built-ins/extension modules), in the function’s func_code member. Remember, in Python, everything is an object, even functions – with constructors and members and all that implies.)
If you print it out, though, it’s pretty cryptic:
>>> x = 'hi'
>>> def test():
... print x
>>> test.func_code.co_code
't\x00\x00GHd\x00\x00S'
Fortunately, Python has a disassembler in the standard library: dis. Although it’s not as Stygian as its name would suggest, it’s slightly danker than we’d like because it only outputs in human-readable text:
>>> import dis
>>> dis.dis(test)
2 0 LOAD_GLOBAL 0 (x)
3 PRINT_ITEM
4 PRINT_NEWLINE
5 LOAD_CONST 0 (None)
8 RETURN_VALUE
While this is normally what people want, we’re trying to examine code with code. At least that LOAD_GLOBAL looks kinda promising, right?
dis is itself written in Python, so it’s easy enough to make a new version that returns something machine-readable instead. I’m not going to walk you through it here – it’s mostly just a bunch of cut-and-paste – but I’ve put a copy online in the new kibble section, a sort of overflow bucket for the odds and ends that have nowhere else sensible to go.
It’s called discode.py, and contains the new disassemble function. It returns a list containing a tuple for each bytecode instruction, constisting of the opcode name, the kind of argument it takes (local, global, constant, None, etc) and the value of the argument (or None).
And, by way of examples, there’s also a function that plucks out the names of the globals that a piece of code read (even inside lambda functions it defines), returning them in a set. It simply compares the opcode name to LOAD_GLOBAL as seen above, and also checks for code objects being loaded, and recurses on them.
I imagine there’s quite a lot else you could discover by scanning through the output of discode.disassemble() and analysing the results.