Xpra: Ticket #1749: crash in xor

I have two 64bit Linux installations of Xpra. Both crash in xor after r17513 which changes xoring from 8bit to 64bit.

That means I took 2.2.x (r18057) and it crashes some seconds after attaching a client.

Thread 3 "xpra" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f14cb52a700 (LWP 19210)]
0x00007f14ddecb625 in __pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str () from /usr/lib64/python2.7/site-packages/xpra/codecs/xor/cyxor.so

When I reverse patch r17560,r17518 and r17513 it isn't crashing.

When I change the new xor routine from 64bit to 32bit it's also not crashing. (see 32bit.patch)

Is it possible that not all buffers are aligned on 8Byte boundaries?

By reading the new xor implementation I also find:

for 0 <= i < steps:
    obuf[i] = abuf[i] ^ bbuf[i]

as an error. If the buffer is smaller than 8byte then 'steps' will be 0, but the 64bit xor is accessing 8bytes of the buffer (which has for instance only 2 bytes).

A debug build also isn't crashing!

python -c 'import struct;print( 8 * struct.calcsize("P"))'

answers '64' (=> python is 64 bit).



Mon, 22 Jan 2018 07:23:39 GMT - Mike: attachment set


Mon, 22 Jan 2018 07:29:08 GMT - Antoine Martin: owner changed

Is it possible that not all buffers are aligned on 8Byte boundaries?

It is possible, though unlikely. In any case, all modern CPUs can handle unaligned accesses.

If the buffer is smaller than 8byte then 'steps' will be 0

Are you sure that this loops runs at all when steps=0? I thought that this Cython for loop syntax was the equivalent to:

for i in range(steps):

Which definitely does not run if steps=0.

Does r18096 fix things for you?


Mon, 22 Jan 2018 07:40:22 GMT - Mike:

No r18096 doesn't change it. I tested that already before opening the ticket.

Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.


Mon, 22 Jan 2018 07:47:48 GMT - Antoine Martin:

Only reversing to the old implementation, switching to 32bit xor or using a debug build avoids the crash.

What sort of CPU are you using?

cat /proc/cpuinfo

Mon, 22 Jan 2018 07:50:31 GMT - Mike: attachment set


Mon, 22 Jan 2018 07:50:45 GMT - Mike: attachment set


Mon, 22 Jan 2018 07:51:49 GMT - Mike:

Added machine1, machine2 with output of

cat /proc/cpuinfo

of both machines.


Mon, 22 Jan 2018 07:54:41 GMT - Antoine Martin:

The code definitely looks correct and I don't have time to figure out why this crashes on some CPUs. You may want to run it in gdb to see what the real problem is. In the meantime, does r18097 fix things?


Mon, 22 Jan 2018 08:02:42 GMT - Mike:

r18097 isn't compiling (misses uint64_t used four out buffer).


Mon, 22 Jan 2018 08:04:19 GMT - Antoine Martin:

Oops, sorry. Compile tested this time: r18098.


Mon, 22 Jan 2018 08:08:26 GMT - Mike:

Now it compiles and runs without crash.


Mon, 22 Jan 2018 08:13:53 GMT - Antoine Martin: status changed; resolution, milestone set

Backported to v2.2.x in r18099.

I'd love to know why those CPUs struggle with a simple 64-bit XOR but I don't have the time to look into it. Feel free to post the GDB backtrace and more details though.


Tue, 23 Jan 2018 06:28:07 GMT - Mike:

I had a look at assembly level. The compiler (gcc) generates code which uses the SSE instruction pxor for 128 bit xor. The SSE instruction needs data aligned on 16 byte boundaries. There is no "unaligned" version of this instruction. Getting the data from mem and storing is done via 16 byte transfers, but the "unaligned" version of the load/store instruction is used (therefore there no crash happens).

When casting to uint64_t* the compiler seems to assume at least an 8 byte alignmend, which would make sense. Therefore in this case there is no pre-check on the alignment and the pxor instruction gives an exception when the data is unaligned.

Now the good news: When casting to uint32_t* the compiler also uses 128 bit xor, but now assumes a 4 byte alignment of the buffers. So it generates code which deals with data not aligned on 16 byte boundaries so that the 128 bit pxor still gets data on a 16 byte boundary. At the end there will be no difference in performance between the uint32_t* and the uint64_t* cast and the current solution is absolutely ok.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing. Best would be to add assertions to check the 4 byte alignmend.

This is the 128 Bit main xor loop:

   ┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
   │0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>        movdqu 0x0(%r13,%rsi,1),%xmm0                                                          │
   │0x7f813205f621 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+369>        add    $0x1,%r10d                                                                      │
  >│0x7f813205f625 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+373>        pxor   (%r14,%rsi,1),%xmm0                                                             │
   │0x7f813205f62b <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+379>        movups %xmm0,(%rcx,%rsi,1)                                                             │
   │0x7f813205f62f <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+383>        add    $0x10,%rsi                                                                      │
   │0x7f813205f633 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+387>        cmp    %r12d,%r10d                                                                     │
   │0x7f813205f636 <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+390>        jb     0x7f813205f61a <__pyx_pw_4xpra_6codecs_3xor_5cyxor_1xor_str+362>                │
   └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

At the end, not the CPU has problems to perform a simple 64 bit xor, but the cast to uint64_t* is simply invalid.


Tue, 23 Jan 2018 07:12:31 GMT - Antoine Martin:

Excellent investigation. I'm pretty sure I've seen somewhere that newer CPU models can deal with 64-bit unaligned access, this would explain why I have not seen any problems at all on any of my test systems.

I guess that the buffers allocated by python/cython will still be 4 byte aligned. Bad: this is still a guessing.

The buffers that we allocate explicitly are all memaligned on Linux (to 64 by default), aligned to 16-bit on macos and not aligned at all on win32... (for details see: browser/xpra/trunk/src/xpra/buffers/memalign.c) In any case, the input to the xor function isn't always directly under our control and this is meant as an opaque call anyway. FWIW: the python string buffers always seem to be 4-byte aligned, but we accept any kind of buffer as input, including memoryviews and those can be sliced to any value...

So r18124 adds a slow byte-at-a-time path for unaligned addresses - the improved unit test exercises it.

@Mike: I think this covers everything?


Tue, 23 Jan 2018 07:32:39 GMT - Mike:

I've tested it and had no crash anymore.

Many thanks for your quick fix!


Sat, 23 Jan 2021 05:32:46 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1749