Xpra: Ticket #502: efficient network receive buffer management when receiving large chunks

At the moment, we have a read_buffer which is a string and we append to it each time we get more data. We read data from the network 8KB at a time, which means that for an 8MB picture (uncompressed RGBA at 1080p), we end up copying that string buffer 1000 times... Quick maths tell me we generate (1000*1001)/2*8K = ~4GB of memory copy for an 8MB picture! Now, with just lz4 compression, the average frame drops to just a few percent of the original size, ie for 5%: 400KB is 50 packets, which means: 50*51/2*8K= ~10MB (which is still 25 times more than we should!) With h264, the compression is much more efficient, so the average packet size drops to 200KB, still high enough that memory copy is probably costing us.

Fri, 24 Jan 2014 15:21:48 GMT - Antoine Martin: attachment set

attachment set to protocol-efficient-receive-buffer.patch

PoC: populates the payload buffer using slicing so we only allocate the memory once

Fri, 24 Jan 2014 16:17:48 GMT - Antoine Martin: attachment set

attachment set to protocol-efficient-receive-buffer-v2.patch

better patch using bytearray instead of ctypes.create_string_buffer - but still too slow..

Fri, 24 Jan 2014 16:21:13 GMT - Antoine Martin: status changed

status changed from new to assigned

The v2 patch above looks good, but there is one problem left: we don't want to spend any time allocating the buffer, a simple malloc would do, unfortunately this is how long it takes to allocate 1GB:

ctypes.create_string_buffer(1024*1024*1024) > 1s
bytearray(1024*1024*1024) > 1s
" "*(1024*1024*1024) ~ 3ms!

So the string wins, but it is immutable... All I want, is a bytearray backed by malloc... why is it so hard?

Fri, 24 Jan 2014 16:35:39 GMT - Antoine Martin: attachment set

attachment set to protocol-efficient-receive-buffer-v3.patch

updated version using a list of strings as temporary buffer

Sun, 26 Jan 2014 15:06:28 GMT - Antoine Martin:

Contrary to what I expected, the performance improvement is marginal at best for the large packet case and the extra if statements in the new code actually make the more common case (handling smaller packets) a little slower! It does reduce client CPU load though, which may still make this worth having.

See Concatenation Test Code for why that is. Quote: With byte-code strings, concatenating with += is as fast as a .join. Since our strings are byte strings from the network layer, using join doesn't really help.

What seems to make more of a difference is the size of the receive buffer, bumping the size to 16k makes a noticeable improvement for the high bandwidth case. Interestingly, the lz4 compression saves a huge amount of bandwidth, without really costing much in terms of number of frames / pixels sent.

Mon, 27 Jan 2014 09:00:16 GMT - Antoine Martin: status changed; resolution set

status changed from assigned to closed
resolution set to fixed

As can be seen on those newly added performance charts: rgb-nocompress, disabling compression of RGB pixels can increase the bandwidth consumption 100-fold, peaking above 100MB/s. The largest improvement comes from bumping the size of the network receive buffer to 64KB, done in r5276.

Finally, the protocol speed charts comparing old code with small variations of the new code, with and without 64KB receive buffers, shows that there isn't much to gain from the new code.

Not applying and closing.

Mon, 27 Jan 2014 09:17:31 GMT - Antoine Martin: attachment set

attachment set to rgb-net-chunks-compression-old-vs-new-filtered.csv

raw CSV data used to generate the graphs

Sat, 23 Jan 2021 04:57:33 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/502