Thursday 2 August 2012

Fixing Xen "RTNETLINK answers: File exists" issue with ifup on Debian when using network-bridge and static IP address

I recently spent a while trying to make my Xen bridged setup work on Debian with a static IP address so I thought I would share the solution to my problem.

I had the following /etc/network/interfaces

1
2
3
allow-hotplug eth0
iface eth0 inet static
  address 192.168.1.254

Xen was configured with (network-script 'network-bridge netdev=eth0 bridge=eth0') in /etc/xen/xend-config.sxp which renames eth0 to peth0 and bridges peth0 and vif1.0 as eth0.

$ brctl show eth0
bridge name     bridge id               STP enabled     interfaces
eth0            8000.80c16ee70a98       no              peth0
                                                        vif1.0



The issue I encountered was that after boot eth0 was configured with the right address but was down. ifup eth0 yielded

RTNETLINK answers: File exists
Failed to bring up eth0.

Running ifup -v eth0 showed that the following command was being run:

ip addr add 192.168.1.254/255.255.255.0 broadcast +       dev eth0 label eth0

Since eth0 had already been assigned an address (probably by Xen's network script), this command would fail.

In order to keep using (network-script network-bridge), I just changed /etc/network/interfaces to the following:

1
2
3
iface eth0 inet static
  address 192.168.1.254
  pre-up ip addr del 192.168.1.254/24 dev eth0 2> /dev/null || true

This is probably not the best solution since Xen's doc indicates that "It is normally much better to create the bridge yourself in /etc/network/interfaces." but this was the modification to keep most of my configuration untouched.

Tuesday 27 September 2011

Nagios plugin to check SSL certificates

Here a Nagios plugin I wrote for checking SSL certificates for expiry. You can set it up for emitting a WARNING or a CRITICAL state N days before expiry.

/root/bin/nagios-check-crt.sh :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
#!/bin/bash
 
# Written by Alexis Bezverkhyy <alexis@grapsus.net> in september 2011
# This is free and unencumbered software released into the public domain.
# For more information, please refer to <http://unlicense.org/>
 
function PRINT_USAGE(){
  echo "This Nagios plugin checks SSL certificates for expiration :
  -c HOST:PORT host and port to connect
  -d DAYS  minimum days before expiry, otherwise a WARNING is issued
  -D DAYS  minimum days before expiry, otherwise a CRITICAL is issued
  -h    prints out this help"
  exit 0
}
 
CONNECT='';WDAYS=0;CDAYS=0;
declare -i CDAYS 
declare -i WDAYS
while true ; do
  getopts 'c:d:D:h' OPT 
  if [ "$OPT" = '?' ] ; then break; fi; 
  case "$OPT" in
    "c") CONNECT="$OPTARG";;
    "d") WDAYS="$OPTARG";;
    "D") CDAYS="$OPTARG";;
    "h") PRINT_USAGE;;
  esac
done
 
if [ -z "$CONNECT" -o '(' "$WDAYS" = '0' -a "$CDAYS" = '0' ')' ] ; then
  PRINT_USAGE
fi
 
function get_crt_expiry
{
        # connect to host with OpenSSL client, filter CRT, parse CRT,
        # get expiry time, convert to traditionnal y-m-d h:s
        echo -n '' | openssl s_client -connect "$1" 2>/dev/null \
                | awk 'BEGIN { p = 0 }
                                         /BEGIN CERT/ { p = 1 }
                                         { if (p) print $0 }
                                         /END CERT/ { p = 0 }' \
                | openssl asn1parse 2>/dev/null \
                | grep 'UTCTIME' \
                | awk '{ print $7 }' \
                | tr -d 'Z:' \
                | tail -n 1 \
                | sed -r 's/^(..)(..)(..)(..)(..).*$/\1-\2-\3 \4:\5/'
}
 
EXPIRY=$(get_crt_expiry "$CONNECT")
if [ -z "$EXPIRY" ] ; then
        echo "WARNING - cannot get expiry date for $CONNECT"
        exit 1
fi
EPOCH_EXPIRY=$(date -d "$EXPIRY" +%s)
EPOCH_NOW=$(date +%s)
let "REM_DAYS = (EPOCH_EXPIRY - EPOCH_NOW)/(24*3600)"
 
if [ "$CDAYS" -gt 0 -a "$REM_DAYS" -lt "$CDAYS" ] ; then
  echo "CRITICAL - $CONNECT crt expries on $EXPIRY ($REM_DAYS days left)" 
        exit 2
fi
 
if [ "$WDAYS" -gt 0 -a "$REM_DAYS" -lt "$WDAYS" ] ; then
  echo "WARNING - $CONNECT crt expries on $EXPIRY ($REM_DAYS days left)" 
        exit 1
fi
  
echo "OK - $CONNECT crt expries on $EXPIRY ($REM_DAYS days left)"

Here's the configuration to check a simple HTTPS service.

commands.cfg :

1
2
3
4
define command {
  command_name check_crt
  command_line /root/bin/nagios-check-crt.sh -c $ARG1$ -d $ARG2$ -D $ARG3$
}

myhost.cfg :

1
2
3
4
5
6
define service {
        use             generic-service
        host_name  myhost
        service_description HTTPS-CRT
        check_command check_crt!myhost.com:443!60!30
}

Thursday 11 August 2011

Low-level Python debugging with GDB

Introduction

Your Python program is crashing or unresponsive and you can't figure it out with traditional tools (printing, PDB) ? Then this tutorial might help you !

GDB is a very powerful debugging tool, but it's hardly intuitive to use and moreover it doesn't understand Python data structures. With no hacking it will only be able to print the system stack and addresses of some variables whereas what you need is the Python stack and Python local variables in a human readable form.

To start, install GDB and a debug build of your Python interpreter (python2.x-dbg on Debian-like systems).

This last file contains a few macros for GDB that will enable it to print Python locals and stack.

Test program

Let's pretend we don't know why this simple program doesn't stop and why it's unresponsive :

1
2
3
4
5
6
7
8
9
import time
 
def next(i):
    time.sleep(10)
    i = 1 - i
 
i = 1
while True:
    next(i)

Easy case

If your problem is easily reproducible, then you're in luck. Restart your script with the debug build of the Python interpreter and attach to it with GDB :

1
2
3
python2.7-dbg test.py &
[1] 7877
gdb -p 7877

At this point the Python interpreter has been interrupted and the script is paused so we can inspect it. First, we can see the Python stack of our script :

1
2
3
4
5
(gdb) py-bt
#5 Frame 0x242ae50, for file test.py, line 4, in next (i=1)
    time.sleep(10)
#8 Frame 0x2427e00, for file test.py, line 9, in <module> ()
    next(i)

To avoid all confusion : the most recent call comes first in this trace unlike when the backtrace is printed from Python. In GDB, the most recent call is called active or selected. We can print Python code and local variables in the selected frame :

1
2
3
4
5
6
7
8
9
10
11
12
(gdb) py-list 
   1    import time
   2    
   3    def next(i):
  >4        time.sleep(10)
   5        i = 1 - i
   6    
   7    i = 1
   8    while True:
   9        next(i)
(gdb) py-locals 
i = 1

With py-up and py-down macros we can change the selected Python frame. You must be aware that the process is still frozen and those commands don't actually do anything to the process.

1
2
3
4
5
6
7
8
9
10
(gdb) py-up
#8 Frame 0x2427e00, for file test.py, line 9, in <module> ()
    next(i)
(gdb) py-list 
   4        time.sleep(10)
   5        i = 1 - i
   6    
   7    i = 1
   8    while True:
  >9        next(i)

At this point GDB has the same behavior as PDB, it is good but not so helpful. If you were unable to figure it out with PDB, then you're probably dealing with some low-level problem in some Python internal, external lib or system call. To understand what happens you will need to explore the system stack :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
(gdb) backtrace
#0  0x00007f48671cddf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x000000000058b22c in floatsleep (secs=10) at ../Modules/timemodule.c:943
#2  0x000000000058a05f in time_sleep (self=0x0, args=(10,)) at ../Modules/timemodule.c:206
#3  0x000000000048867e in PyCFunction_Call (func=<built-in function sleep>, arg=(10,), kw=0x0)
    at ../Objects/methodobject.c:81
#4  0x0000000000525efe in call_function (pp_stack=0x7fff3ac2ca40, oparg=1) at ../Python/ceval.c:4013
#5  0x0000000000520f59 in PyEval_EvalFrameEx (f=Frame 0xe59e50, for file test.py, line 4, in next (i=1), throwflag=0)
    at ../Python/ceval.c:2666
#6  0x00000000005263c5 in fast_function (func=<function at remote 0xe074f8>, pp_stack=0x7fff3ac2cdc0, n=1, na=1, nk=0)
    at ../Python/ceval.c:4099
#7  0x00000000005260d5 in call_function (pp_stack=0x7fff3ac2cdc0, oparg=1) at ../Python/ceval.c:4034
#8  0x0000000000520f59 in PyEval_EvalFrameEx (f=Frame 0xe56e00, for file test.py, line 9, in <module> (), throwflag=0)
    at ../Python/ceval.c:2666
#9  0x0000000000523744 in PyEval_EvalCodeEx (co=0xdb90f0, globals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, args=0x0, 
    argcount=0, kws=0x0, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3253
#10 0x0000000000519b86 in PyEval_EvalCode (co=0xdb90f0, globals=
    {'__builtins__': <module at remote 0x7f4868754470>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0xe0e678>, '__name__': '__main__', 'next': <function at remote 0xe074f8>, '__doc__': None}, locals=
[...]

We can now see that our program is stuck in a call to select(), in the libc (you might not actually see exactly where the last call was made unless you have a debug version of that external library). Now you should probably use GDB commands finish and return to see if the execution thread comes back into the Python interpreter. If not, it's probably a bug with an external library which should be reproducible outside of Python.

Hard case

You might not be able to trigger systematically the bug which may be happening like once a day on one of your production servers. In this case we absolutely need to perform the analysis right on the production server where you found the unresponsive process. As this process is running on an optimized and stripped version of the Python interpreter, the stack trace will give you very few info :

1
2
3
4
5
6
7
8
9
10
11
12
13
(gdb) bt
#0  0x00007f1b7f02cdf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005048c2 in ?? ()
#2  0x00000000004b90a9 in PyEval_EvalFrameEx ()
#3  0x00000000004b9673 in PyEval_EvalFrameEx ()
#4  0x00000000004bf600 in PyEval_EvalCodeEx ()
#5  0x00000000004c0082 in PyEval_EvalCode ()
#6  0x00000000004df2d2 in ?? ()
#7  0x00000000004dfe64 in PyRun_FileExFlags ()
#8  0x00000000004e096e in PyRun_SimpleFileExFlags ()
#9  0x00000000004f09dd in Py_Main ()
#10 0x00007f1b7ef7cead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x000000000041f0a1 in _start ()

Only public symbols of libpython are visible, we absolutely don't know where we are in the Python script and we have no idea of the Python stack. Let's install the debug version of Python, it will at least install GDB symbols for the Python interpreter :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
(gdb) bt
#0  0x00007f0b1773bdf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00000000005048c2 in floatsleep (self=<value optimized out>, args=<value optimized out>)
    at ../Modules/timemodule.c:943
#2  time_sleep (self=<value optimized out>, args=<value optimized out>) at ../Modules/timemodule.c:206
#3  0x00000000004b90a9 in call_function (f=<value optimized out>, throwflag=<value optimized out>)
    at ../Python/ceval.c:4013
#4  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:2666
#5  0x00000000004b9673 in fast_function (f=<value optimized out>, throwflag=<value optimized out>)
    at ../Python/ceval.c:4099
#6  call_function (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:4034
#7  PyEval_EvalFrameEx (f=<value optimized out>, throwflag=<value optimized out>) at ../Python/ceval.c:2666
#8  0x00000000004bf600 in PyEval_EvalCodeEx (co=0x7f0b18c0f8b0, globals=<value optimized out>, 
    locals=<value optimized out>, args=<value optimized out>, argcount=<value optimized out>, 
    kws=<value optimized out>, kwcount=0, defs=0x0, defcount=0, closure=0x0) at ../Python/ceval.c:3253
#9  0x00000000004c0082 in PyEval_EvalCode (co=<value optimized out>, globals=<value optimized out>, 
    locals=<value optimized out>) at ../Python/ceval.c:667
#10 0x00000000004df2d2 in run_mod (mod=<value optimized out>, filename=<value optimized out>, globals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, flags=
<value optimized out>, arena=<value optimized out>) at ../Python/pythonrun.c:1346
#11 0x00000000004dfe64 in PyRun_FileExFlags (fp=0x2693690, filename=0x7ffffc54e67b "test.py", 
    start=<value optimized out>, globals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, locals=
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': <function at remote 0x7f0b18bc2578>, '__doc__': None}, closeit=1,
flags=0x7ffffc54ce30) at ../Python/pythonrun.c:1332
#12 0x00000000004e096e in PyRun_SimpleFileExFlags (fp=0x2693690, filename=<value optimized out>, closeit=1, flags=
    0x7ffffc54ce30) at ../Python/pythonrun.c:936
#13 0x00000000004f09dd in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at ../Modules/main.c:599
#14 0x00007f0b1768bead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#15 0x000000000041f0a1 in _start ()

It is better, we now know the module and the file, but nothing about local variables or Python stack. Do not try to use py-* macros, they will not work as almost all Python internals are "optimized out", they will probably trigger a segmentation fault by trying to print Python objects with _PyObject_Dump.

The only chance you have to find exactly where the code is failing is by carefully inspecting all the internal Python variables, some of them are still usable and can be used to find out what's going on. For example :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
(gdb) select-frame 2
(gdb) info locals
secs = 10
(gdb) select-frame 3
(gdb) info locals
callargs = (10,)
flags = <value optimized out>
tstate = <value optimized out>
func = <built-in function sleep>
w = <value optimized out>
na = <value optimized out>
nk = <value optimized out>
n = <value optimized out>
pfunc = 0x26aace0
x = <value optimized out>

Frame 2 was a call in timemodule.c and showed us that the argument of the function call was 10 secs.

Frame 3 is in PyEval_EvalFrameEx() (main Python bytecode interpretation routine) it brings us back into the interpreter. Almost all local variables were optimized, func tells us that the function call was for the function sleep. Finally :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
(gdb) select-frame 4
(gdb) info locals 
sp = 0x26aace8
stack_pointer = <value optimized out>
next_instr = 0x7f0b18bc0760 "\001d\002"
opcode = <value optimized out>
oparg = <value optimized out>
why = WHY_NOT
err = 0
x = <value optimized out>
v = <value optimized out>
w = <value optimized out>
u = <value optimized out>
t = <value optimized out>
stream = 0x0
fastlocals = <value optimized out>
freevars = 0x26aace0
retval = <value optimized out>
tstate = 0x25d20a0
co = 0x7f0b18c0fa30
instr_ub = -1
instr_lb = 0
instr_prev = -1
first_instr = 0x7f0b18bc0754 "t"
names = ('time', 'sleep')
consts = (None, 10, 1)
(gdb) p tstate->frame->f_globals 
$5 = 
    {'__builtins__': <module at remote 0x7f0b18c8cad0>, '__file__': 'test.py', '__package__': None, 'i': 1, 'time': <module at remote 0x7f0b18bc7210>, '__name__': '__main__', 'next': 
<function at remote 0x7f0b18bc2578>, '__doc__': None}
(gdb) p tstate->frame->f_lineno 
$6 = 3

Here we go test.py line 3, i = 1, function call to time.sleep(10) !

Insanely hard case

If this those steps are still insufficient, you might try to set breakpoints on call_function() and let the script run a little bit with continue or step.

The final and ultimate solution is to run PyEval_EvalFrameEx() step by step. Grab the source of CPython and go to the Python directory before launching GDB (it must be the source of that exact same version of the Python interpreter that runs your script) :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
(gdb) finish
Run till exit from #0  0x00007fc13a60adf3 in select () from /lib/x86_64-linux-gnu/libc.so.6
0x00000000004dc17b in floatsleep (self=<value optimized out>, args=<value optimized out>)
    at ../Modules/timemodule.c:914
914     ../Modules/timemodule.c: No such file or directory.
        in ../Modules/timemodule.c
(gdb) next
time_sleep (self=<value optimized out>, args=<value optimized out>) at ../Modules/timemodule.c:206
206     in ../Modules/timemodule.c
(gdb) cd python2.7-2.7.2/Python/
Working directory /home/grapsus/ludia/python2.7-2.7.2/Python.
(gdb) next
208         Py_INCREF(Py_None);
(gdb) next
209         return Py_None;
(gdb) next
210     }

Notice that it doesn't work until you cd to the Python directory of CPython source tree. Same thing if you want to debug step by step some Python module, like gevent, you will need the source code of the very same version that's running the script.

It is very time-consuming and you'll probably need a Python bytecode reference to follow what's going on but you'll eventually find the issue.

Conclusion

Even with a strongly optimized and stripped Python interpreter it is possible to debug or at least analyze a buggy Python script.

References

Monday 25 July 2011

Sleep sort performance test

Introduction

Sleep sort is a very curious and original algorithm. It brings several interesting questions. Such as, how to measure its complexity or in what cases does it perform the best.

Principle

In a nutshell: for each element to sort, start a thread or a process which sleeps a time proportional to the element and then pushes it back to the main thread.

Trivial Bash implementation:

1
2
3
4
5
6
7
8
9
10
11
12
#!/bin/bash
 
function f {
        sleep $1
        echo $1
}
 
while [ -n "$1" ] ; do 
        f $1 &
        shift
done
wait

Test suite

In order to study it and measure its performance we (me and Hartok) decided to implement the Sleep Sort in Python and gevent.

Our program generates a set of random numbers and sorts it with Sleep sort then with the Python built-in Quick sort and records timings. It works with three parameters:

  • set size
  • maximum value in the set
  • the ratio between sleep time and values in the set (for example with 10 the thread for 42 sleeps for 42/10 seconds)

This last parameter is pretty tricky: the higher it is, the faster the Sleep sort works. But if it is too high, the time to start a new thread (or greenlet) will be more important than the sleep time for low values and the result will be incorrect. It seems to be strongly hardware dependent and a little tied to the other parameters.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
#!/usr/bin/python
 
import sys
import random
import time
import copy
import gevent
 
divisor = 900000
 
def sort(numbers):
        """ Sorts a list with Sleep Sort algorithm and measures time
        """
        begin = time.time()
 
        def f(n, result):
                gevent.sleep(float(n)/divisor)
                result.append(n)
        
        result = []
        greenlets = []
        # spawn a sleep greenlet for each value
        for n in numbers:
                greenlets.append(gevent.Greenlet.spawn(f, n, result))
 
        # wait for all greenlets to end
        gevent.joinall(greenlets)
        return (time.time()-begin, result)
 
def systemsort(numbers):
        """ Sorts a list with system sort function and measures time
        """
        begin = time.time()
        result = copy.copy(numbers)
        result.sort()
        return (time.time()-begin, result)
 
def generate(n, m): 
        """ Generates a list of n random numbers, each being less than m
        """
        numbers = []
        for i in range(n):
                numbers.append(random.randint(1, m))
        return numbers
 
if __name__ == '__main__':
        # read params
        try:
                n = int(sys.argv[1])
                m = int(sys.argv[2])
        except BaseException:
                print "usage: %s n m [d]" % sys.argv[0]
                print " - n number of random numbers"
                print " - m maximum random value"
                print " - d time divisor (optionnal), default is %d" % divisor
                sys.exit(1)
 
        try:
                divisor = int(sys.argv[3])
        except:
                pass
 
        # generate random values
        numbers = generate(n, m)
        # sleep sort them
        (sleep_time, sleep_result) = sort(numbers)
        # system sort them
        (system_time, system_result) = systemsort(numbers)
 
        if system_result == sleep_result:
                print "sleep sort succeed"
                print "sleep: %f" % sleep_time
                print "system: %f" % system_time
        else:
                print "sleep sort failed (time divisor too ambitious ?)"
                print "sleep: (%d elements)" % len(sleep_result)
                #print sleep_result.__str__()
                print "system: (%d elements)" % len(system_result)
                #print system_result.__str__()
                errors = 0
                for n in range(n):
                        if system_result[n] != sleep_result[n]:
                                errors = errors + 1
                print "%d errors" % errors

Test results

quicksort.png

sleepsort.png

ratio.png

The overall performance of the Sleep sort is awful, but it has interesting performance when it comes to sets with few different values.

Friday 22 July 2011

Blog migré !

Plusieurs internautes reconnus ont retwitté mon dernier billet. Merci Stéphane Bortzmeyer et Korben ! Mon blog était devenu affreusement lent à consulter, la ligne ADSL ne tenant plus la charge avec son petit mégabit d'upload. Plus de 50 personnes se sont rabattues sur une copie que j'ai envoyée sur pastebin.

J'ai décidé de migrer mon blog immédiatement vers un serveur OVH. Merci à Ekinox pour ce gros domU ! Tout est de nouveau opérationnel et rapide si les DNS se sont bien propagés chez vous. Je migrerai le reste de mes services demain, Mercurial, SVN, etc. L'ancien serveur est toujours accessible sur

home.grapsus.net

- page 1 of 6