Tag - code

Entries feed

Thursday 11 August 2011

Low-level Python debugging with GDB

Introduction

Your Python program is crashing or unresponsive and you can't figure it out with traditional tools (printing, PDB) ? Then this tutorial might help you !

GDB is a very powerful debugging tool, but it's hardly intuitive to use and moreover it doesn't understand Python data structures. With no hacking it will only be able to print the system stack and addresses of some variables whereas what you need is the Python stack and Python local variables in a human readable form.

To start, install GDB and a debug build of your Python interpreter (python2.x-dbg on Debian-like systems).

This last file contains a few macros for GDB that will enable it to print Python locals and stack.

Test program

Let's pretend we don't know why this simple program doesn't stop and why it's unresponsive :

1
2
3
4
5
6
7
8
9
import time
 
def next(i):
    time.sleep(10)
    i = 1 - i
 
i = 1
while True:
    next(i)

Easy case

If your problem is easily reproducible, then you're in luck. Restart your script with the debug build of the Python interpreter and attach to it with GDB :

1
2
3
python2.7-dbg test.py &
[1] 7877
gdb -p 7877

At this point the Python interpreter has been interrupted and the script is paused so we can inspect it. First, we can see the Python stack of our script :

1
2
3
4
5
(gdb) py-bt
#5 Frame 0x242ae50, for file test.py, line 4, in next (i=1)
    time.sleep(10)
#8 Frame 0x2427e00, for file test.py, line 9, in <module> ()
    next(i)

To avoid all confusion : the most recent call comes first in this trace unlike when the backtrace is printed from Python. In GDB, the most recent call is called active or selected. We can print Python code and local variables in the selected frame :

1
2
3
4
5
6
7
8
9
10
11
12
(gdb) py-list 
   1    import time
   2    
   3    def next(i):
  >4        time.sleep(10)
   5        i = 1 - i
   6    
   7    i = 1
   8    while True:
   9        next(i)
(gdb) py-locals 
i = 1

With py-up and py-down macros we can change the selected Python frame. You must be aware that the process is still frozen and those commands don't actually do anything to the process.

1
2
3
4
5
6
7
8
9
10
(gdb) py-up
#8 Frame 0x2427e00, for file test.py, line 9, in <module> ()
    next(i)
(gdb) py-list 
   4        time.sleep(10)
   5        i = 1 - i
   6    
   7    i = 1
   8    while True:
  >9        next(i)

At this point GDB has the same behavior as PDB, it is good but not so helpful. If you were unable to figure it out with PDB, then you're probably dealing with some low-level problem in some Python internal, external lib or system call. To understand what happens you will need to explore the system stack :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(gdb) backtrace
#0  0x00007f48671cddf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000058b22c in floatsleep (secs=10) at ../Modules/timemodule.c:943
#2  0x000000000058a05f in time_sleep (self=0x0, args=(10,)) at ../Modules/timemodule.c:206
#3  0x000000000048867e in PyCFunction_Call (func=<built-in function sleep>, arg=(10,), kw=0x0)
    at ../Objects/methodobject.c:81
#4  0x0000000000525efe in call_function (pp_stack=0x7fff3ac2ca40, oparg=1) at ../Python/ceval.c:4013
#5  0x0000000000520f59 in PyEval_EvalFrameEx (f=Frame 0xe59e50, for file test.py, line 4, in next (i=1), throwflag=0)
    at ../Python/ceval.c:2666
#6  0x00000000005263c5 in fast_function (func=<function at remote 0xe074f8>, pp_stack=0x7fff3ac2cdc0, n=1, na=1, nk=0)
    at ../Python/ceval.c:4099
#7  0x00000000005260d5 in call_function (pp_stack=0x7fff3ac2cdc0, oparg=1) at ../Python/ceval.c:4034
#8  0x0000000000520f59 in PyEval_EvalFrameEx (f=Frame 0xe56e00, for file test.py, line 9, in <module> (), throwflag=0)
    at ../Python/ceval.c:2666
#9  0x0000000000523744 in PyEval_EvalCodeEx (co=0xdb90f0, globals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, args=0x0, 
    argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3253
#10 0x0000000000519b86 in PyEval_EvalCode (co=0xdb90f0, globals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, locals=
[...]

We can now see that our program is stuck in a call to select(), in the libc (you might not actually see exactly where the last call was made unless you have a debug version of that external library). Now you should probably use GDB commands finish and return to see if the execution thread comes back into the Python interpreter. If not, it's probably a bug with an external library which should be reproducible outside of Python.

Hard case

You might not be able to trigger systematically the bug which may be happening like once a day on one of your production servers. In this case we absolutely need to perform the analysis right on the production server where you found the unresponsive process. As this process is running on an optimized and stripped version of the Python interpreter, the stack trace will give you very few info :

1
2
3
4
5
6
7
8
9
10
11
12
13
(gdb) bt
#0  0x00007f1b7f02cdf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005048c2 in ?? ()
#2  0x00000000004b90a9 in PyEval_EvalFrameEx ()
#3  0x00000000004b9673 in PyEval_EvalFrameEx ()
#4  0x00000000004bf600 in PyEval_EvalCodeEx ()
#5  0x00000000004c0082 in PyEval_EvalCode ()
#6  0x00000000004df2d2 in ?? ()
#7  0x00000000004dfe64 in PyRun_FileExFlags ()
#8  0x00000000004e096e in PyRun_SimpleFileExFlags ()
#9  0x00000000004f09dd in Py_Main ()
#10 0x00007f1b7ef7cead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x000000000041f0a1 in _start ()

Only public symbols of libpython are visible, we absolutely don't know where we are in the Python script and we have no idea of the Python stack. Let's install the debug version of Python, it will at least install GDB symbols for the Python interpreter :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
(gdb) bt
#0  0x00007f0b1773bdf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005048c2 in floatsleep (self=<value optimized out>, args=<value optimized out>)
    at ../Modules/timemodule.c:943
#2  time_sleep (self=<value optimized out>, args=<value optimized out>) at ../Modules/timemodule.c:206
#3  0x00000000004b90a9 in call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at ../Python/ceval.c:4013
#4  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:2666
#5  0x00000000004b9673 in fast_function (f=<value optimized out>, throwflag=<value optimized out>)
    at ../Python/ceval.c:4099
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:4034
#7  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:2666
#8  0x00000000004bf600 in PyEval_EvalCodeEx (co=0x7f0b18c0f8b0, globals=<value optimized out>, 
    locals=<value optimized out>, args=<value optimized out>, argcount=<value optimized out>, 
    kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3253
#9  0x00000000004c0082 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, 
    locals=<value optimized out>) at ../Python/ceval.c:667
#10 0x00000000004df2d2 in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, flags=
<value optimized out>, arena=<value optimized out>) at ../Python/pythonrun.c:1346
#11 0x00000000004dfe64 in PyRun_FileExFlags (fp=0x2693690, filename=0x7ffffc54e67b "test.py", 
    start=<value optimized out>, globals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, closeit=1,
flags=0x7ffffc54ce30) at ../Python/pythonrun.c:1332
#12 0x00000000004e096e in PyRun_SimpleFileExFlags (fp=0x2693690, filename=<value optimized out>, closeit=1, flags=
    0x7ffffc54ce30) at ../Python/pythonrun.c:936
#13 0x00000000004f09dd in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at ../Modules/main.c:599
#14 0x00007f0b1768bead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x000000000041f0a1 in _start ()

It is better, we now know the module and the file, but nothing about local variables or Python stack. Do not try to use py-* macros, they will not work as almost all Python internals are "optimized out", they will probably trigger a segmentation fault by trying to print Python objects with _PyObject_Dump.

The only chance you have to find exactly where the code is failing is by carefully inspecting all the internal Python variables, some of them are still usable and can be used to find out what's going on. For example :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(gdb) select-frame 2
(gdb) info locals
secs = 10
(gdb) select-frame 3
(gdb) info locals
callargs = (10,)
flags = <value optimized out>
tstate = <value optimized out>
func = <built-in function sleep>
w = <value optimized out>
na = <value optimized out>
nk = <value optimized out>
n = <value optimized out>
pfunc = 0x26aace0
x = <value optimized out>

Frame 2 was a call in timemodule.c and showed us that the argument of the function call was 10 secs.

Frame 3 is in PyEval_EvalFrameEx() (main Python bytecode interpretation routine) it brings us back into the interpreter. Almost all local variables were optimized, func tells us that the function call was for the function sleep. Finally :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
(gdb) select-frame 4
(gdb) info locals 
sp = 0x26aace8
stack_pointer = <value optimized out>
next_instr = 0x7f0b18bc0760 "\001d\002"
opcode = <value optimized out>
oparg = <value optimized out>
why = WHY_NOT
err = 0
x = <value optimized out>
v = <value optimized out>
w = <value optimized out>
u = <value optimized out>
t = <value optimized out>
stream = 0x0
fastlocals = <value optimized out>
freevars = 0x26aace0
retval = <value optimized out>
tstate = 0x25d20a0
co = 0x7f0b18c0fa30
instr_ub = -1
instr_lb = 0
instr_prev = -1
first_instr = 0x7f0b18bc0754 "t"
names = ('time', 'sleep')
consts = (None, 10, 1)
(gdb) p tstate->frame->f_globals 
$5 = 
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': 
<function at remote 0x7f0b18bc2578>, '__doc__': None}
(gdb) p tstate->frame->f_lineno 
$6 = 3

Here we go test.py line 3, i = 1, function call to time.sleep(10) !

Insanely hard case

If this those steps are still insufficient, you might try to set breakpoints on call_function() and let the script run a little bit with continue or step.

The final and ultimate solution is to run PyEval_EvalFrameEx() step by step. Grab the source of CPython and go to the Python directory before launching GDB (it must be the source of that exact same version of the Python interpreter that runs your script) :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(gdb) finish
Run till exit from #0  0x00007fc13a60adf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
0x00000000004dc17b in floatsleep (self=<value optimized out>, args=<value optimized out>)
    at ../Modules/timemodule.c:914
914     ../Modules/timemodule.c: No such file or directory.
        in ../Modules/timemodule.c
(gdb) next
time_sleep (self=<value optimized out>, args=<value optimized out>) at ../Modules/timemodule.c:206
206     in ../Modules/timemodule.c
(gdb) cd python2.7-2.7.2/Python/
Working directory /home/grapsus/ludia/python2.7-2.7.2/Python.
(gdb) next
208         Py_INCREF(Py_None);
(gdb) next
209         return Py_None;
(gdb) next
210     }

Notice that it doesn't work until you cd to the Python directory of CPython source tree. Same thing if you want to debug step by step some Python module, like gevent, you will need the source code of the very same version that's running the script.

It is very time-consuming and you'll probably need a Python bytecode reference to follow what's going on but you'll eventually find the issue.

Conclusion

Even with a strongly optimized and stripped Python interpreter it is possible to debug or at least analyze a buggy Python script.

References

Thursday 13 January 2011

Hexadecimal dump in C

It can be very handy, in your C programs, to be able to print raw data in a UNIX xxd fashion. Here is a short function to do that :

hexdump.c :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#include <stdio.h>
#include <ctype.h>
 
#ifndef HEXDUMP_COLS
#define HEXDUMP_COLS 8
#endif
 
void hexdump(void *mem, unsigned int len)
{
        unsigned int i, j;
        
        for(i = 0; i < len + ((len % HEXDUMP_COLS) ? (HEXDUMP_COLS - len % HEXDUMP_COLS) : 0); i++)
        {
                /* print offset */
                if(i % HEXDUMP_COLS == 0)
                {
                        printf("0x%06x: ", i);
                }
 
                /* print hex data */
                if(i < len)
                {
                        printf("%02x ", 0xFF & ((char*)mem)[i]);
                }
                else /* end of block, just aligning for ASCII dump */
                {
                        printf("   ");
                }
                
                /* print ASCII dump */
                if(i % HEXDUMP_COLS == (HEXDUMP_COLS - 1))
                {
                        for(j = i - (HEXDUMP_COLS - 1); j <= i; j++)
                        {
                                if(j >= len) /* end of block, not really printing */
                                {
                                        putchar(' ');
                                }
                                else if(isprint(((char*)mem)[j])) /* printable char */
                                {
                                        putchar(0xFF & ((char*)mem)[j]);        
                                }
                                else /* other char */
                                {
                                        putchar('.');
                                }
                        }
                        putchar('\n');
                }
        }
}
 
#ifdef HEXDUMP_TEST
int main(int argc, char *argv[])
{
        hexdump(argv[0], 20);
 
        return 0;
}
#endif

hexdump.h

1
2
3
4
5
6
#ifndef _HEXDUMP_H
#define _HEXDUMP_H
 
void hexdump(void *mem, unsigned int len);
 
#endif

It can be tested as a standalone program (which dumps 20 bytes of memory strating at *argv) :

cc --ansi -DHEXDUMP_TEST -o hexdump hexdump.c
./hexdump
0x000000: 2e 2f 68 65 78 64 75 6d ./hexdum
0x000008: 70 00 53 53 48 5f 41 47 p.SSH_AG
0x000010: 45 4e 54 5f             ENT_

Sunday 10 January 2010

Modifier un paquet Debian rapidement (méthode sale)

Parfois on veut modifier juste un petit bout de fichier de configuration dans un paquet Debian. L'utilisation du classique apt-get source, apt-get build-dep et dpkg-buildpackage est alors très longue et lourde, sans parler de tous les paquets, nécessaires à la recompilation, à installer.

Voici deux scripts de mon cru qui permettent de décompresser un paquet binaire, faire les changements nécessaires et reconstruire le paquet. C'est une méthode qui est sale dans la mesure où elle ne respecte pas les procédures de construction de paquets chez Debian, mais elle est très rapide lorsqu'il faut déployer rapidement une configuration personnalisée.

deb-dirty-extract.sh :

1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash
 
if [ ! -f "$1" ] ; then
  echo "No such file"
fi
 
DIR=${1%%.deb}
mkdir "$DIR" && cd "$DIR"
ar x "../$1"
mkdir data control
cd data && tar -zxf ../data.tar.gz && cd ..
cd control && tar -zxf ../control.tar.gz

deb-dirty-rebuild.sh :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
 
if [ ! -f 'debian-binary' -o ! -d 'control' -o ! -d 'data' ] ; then
  echo "No extracted deb file here"
  exit 1
fi
 
if [ -z "$1" -o ! -f "$1" ] ; then
  PKG="../"$(basename $PWD)"-custom.deb"
else
  PKG="$1"
fi
echo "Writing new package to $PKG..."
 
rm -f control.tar.gz data.tar.gz "$PKG"
 
cd data && tar -zcf ../data.tar.gz ./ && cd ..
cd control && tar -zcf ../control.tar.gz ./ && cd ..
ar rc "$PKG" debian-binary control.tar.gz data.tar.gz

Utilisation typique :

wget http://.../.../paquet.deb
deb-dirty-extract.sh paquet.deb
cd paquet
[...] # vos modifications
deb-dirty-rebuild.sh

Wednesday 6 January 2010

Mes codes pour les qualifications de Prologin 2010

Cette année, je participe à nouveau au concours d'informatique Prologin. Voici mes codes pour les qualifications (terminée le 5 janvier).

Également disponible sur mon dépôt SVN :
http://grapsus.net/svn/prog/trunk/prq10/

Continue reading...

Friday 1 January 2010

Nagios et les locales des plugins

Nagios a plein de fonctionnalités intéressantes, mais du côté de la localisation et des encodages de caractères c'est très mal géré. Par exemple on ne peut pas choisir l'encodage des pages web de l'interface (à moins de faire des bidouilles du côté du serveur HTTP), les caractères accentués sont supposés être en ISO-8859 et sont automatiquement convertis en codes HTML ce qui rend impossible leur affichage correct sur un système fonctionnant en UTF-8. Encore, ces problèmes n'affectent pas les fonctionnalités du logiciel. Par contre les plugins s'amusent à exécuter des commandes sans se soucier de la locale utilisée, si bien que la sortie dépend de la langue du système ! À partir de là, les pages de résumé ne ressemblent à rien, Nagiosgrapher marche une fois sur deux etc. Je n'ai pas trouvé comment remédier à ce problème sur le net; dans les rapports de bugs Debian correspondants, on va jusqu'à dire que c'est normal. J'ai essayé de changer la locale de Nagios dans son script init, mais celui-ci se fork et certains résultats sont en anglais alors que d'autres restent en français de manière complètement aléatoire.

Voici ma solution définitive à ce problème : recompiler Nagios avec la modification suivante dans le fichier base/checks.c :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
350,352d349
<       
<       char *current_locale;
<       char previous_locale[40];
762,768d758 
<                       /* Grapsus : run plugins with no locale */
<                       if(current_locale = getenv("LC_ALL"))
<                       {
<                               strncpy(previous_locale, current_locale, 40);
<                               previous_locale[39] = 0;
<                       }
<                       setenv("LC_ALL", "C", 1);
770,774d759 
<                       if(current_locale)
<                               setenv("LC_ALL", previous_locale, 1);
<                       else
<                               unsetenv("LC_ALL");
<                       /* End grapsus run plugins with no locale */

Avec ce patch, Nagios va se mettre en locale par défaut avant d'exécuter un plugin, de sorte que les résultats seront homogènes et facilement lisibles avec Nagiosgrapher.

Voici comment obtenir le code source sous Debian :

apt-get source nagios3
apt-get build-dep nagios3

Une fois le patch appliqué, on reconstruit le paquet avec :

dpkg-buildpackage -uc -us

- page 1 of 2