Python Core Files

by Robert Dec 15, 2019

If you come from or have a history in debugging core files when something goes wrong in a native compiled application, then you might have run into the same sort of situation I have if you write software in Python. Specifically, what happens when you have a long-running program, a REST API server for example, which you've shipped out to your customers and something goes wrong? Hopefully you've remembered to catch exceptions and at least log the error message. But what if you didn't log enough information to figure out what was going on at the time?

That's where a core file can come in handy. You can set up your application so that when it returns a 500, for example in the case of a REST server, that it also takes a snapshot of the current state of things. This way, your customer can come to you with their bug report and even attach this core file. This will allow you to examine the state of the call stack and even the local and global memory. This should give you all the information you need to reproduce and resolve the issue quickly which will lead to happier managers and happier customers.

OK, So How?

Good question. Saving the state of everything in a traceback can be tricky. If we go the route of trying to pickle the traceback object and write it to a file, we quickly run into trouble:

>>> import pickle                                                                                                                                    
>>> import sys                                                                                                                                       
>>> try: 
...     print(a) 
... except Exception as e: 
...     with open('core_file', 'wb') as core_file: 
...         pickle.dump(sys.last_traceback, core_file)                                                                                               
Traceback (most recent call last):
  File "", line 5, in 
TypeError: can't pickle traceback objects

can't pickle traceback objects
>>>

So, if we can't pickle traceback objects, now what? There are several libraries out there which can help us put this functionality into our application. The one that I've found that worked the best and was able to save all the data that I wanted was the pm.py (Python Post Mortem) library. Here's an toy example of an application that uses the library to dump a core file:

import sys
from pypm import dump, freeze_traceback

def dump_core(type, value, traceback):
    filename = 'python_core.tb'
    print(f'Uncaught Exception! Dumping core to {filename}')
    with open(filename, 'wb') as core_file:
        dump(freeze_traceback(tb=traceback), core_file)
    print(f'Core dump complete')

sys.excepthook = dump_core

def main():
    adam = 3
    beta = 4
    theta = lambda x: x**2
    delta = range(10)
    epsilon = {'a': [1,2,3,4]}
    if len(sys.argv) > 1 and sys.argv[1] == 'boom':
        raise RuntimeError('boom')
    print('Program completed successfully')

if __name__ == '__main__':
    sys.exit(main())

The sys.excepthook call hooks into the Python runtime and will tell it to call our custom function anytime there is an unhandled exception. We could also call this function ourselves at any point if we chose to (say in the case of a 500 like I mentioned before). Once it is called, it uses the pm.py library to dump the traceback object. If you look into the source code of the library, you can see that it is doing some trickery to make a "fake" traceback object and this object stores all of the important information for each frame, but it remains picklable. Running this program produces this output:

$ python buggy.py 
Program completed successfully
$ python buggy.py boom
Uncaught Exception! Dumping core to python_core.tb
Core dump complete
$

Debugging Using the Core

Now that we have a core file, the next thing we'll want to do is open it up and see what was the state of things when we dumped it. This can be done by using the library to reverse the process:

$ python -m pypm python_core.tb 
> /python_core_files/buggy.py(20)main()
-> raise RuntimeError('boom')
(Pdb) bt
  /python_core_files/buggy.py(24)()
-> sys.exit(main())
> /python_core_files/buggy.py(20)main()
-> raise RuntimeError('boom')
(Pdb) locals()
{'adam': 3, 'beta': 4, 'theta':  at 0x7fc0ddd45048>, 'delta': range(0, 10), 'epsilon': {'a': [1, 2, 3, 4]}}
(Pdb) adam
3
(Pdb) theta
 at 0x7fc0ddd45048>
(Pdb) theta(5)
25
(Pdb) delta[6]
6
(Pdb) l
 15         beta = 4
 16         theta = lambda x: x**2
 17         delta = range(10)
 18         epsilon = {'a': [1,2,3,4]}
 19         if len(sys.argv) > 1 and sys.argv[1] == 'boom':
 20  ->         raise RuntimeError('boom')
 21         print('Program completed successfully')
 22  
 23     if __name__ == '__main__':
 24         sys.exit(main())
[EOF]
(Pdb) sys.argv
['/python_core_files/pypm/pypm/__main__.py', 'python_core.tb']
(Pdb)

You can see that for almost everything, the information from the original execution of the program can be seen inside of our debugger. One thing that wasn't the same was when I tried to print sys.argv. Instead of the value from the original execution, I got the value from the current program. So just keep in mind that there might be some limitations.

Conclusion

Whenever there is a problem with an application that is being run by a customer, it's never good to not be able to answer why something failed and what should be done to fix it. The more time it takes to answer that question, the less happy everyone will be. Keeping in mind that things will go wrong that you didn't expect, make sure to leave yourself ways to get all the information you need from your program. Logging, notifications, core files; all of these things need to be considered ahead of time so that when the day comes, you can solve the problem quickly.

All Post Tags