xpra icon
Bug tracker and wiki

Opened 4 years ago

Last modified 4 months ago

#619 new enhancement

better TCP_NODELAY handling: only use it when it is useful

Reported by: Antoine Martin Owned by: J. Max Mena
Priority: critical Milestone: 2.3
Component: core Version: 0.12.x
Keywords: Cc:

Description (last modified by Antoine Martin)

Follow up from #514: at present we enable TCP_NODELAY globally which is a bit wasteful.

It ensures that packets go out as soon as we queue them, but when the packets contain large-ish binary data this means that the binary data and the actual xpra packet structure are likely to travel in separate TCP-level packets.

It would be better to only enable TCP_NODELAY when aggregating packets is not helping: when we have no more data to send or when the output buffer is full. As per: Is there a way to flush a POSIX socket? and this answer:
What I do is enable Nagle, write as many bytes (using non-blocking I/O) as I can to the socket (i.e. until I run out of bytes to send, or the send() call returns EWOULDBLOCK, whichever comes first), and then disable Nagle again. This seems to work well (i.e. I get low latency AND full-size packets where possible)

Good read: The Caveats of TCP_NODELAY

Change History (6)

comment:5 Changed 2 years ago by Antoine Martin

See also: #1211.

comment:8 Changed 17 months ago by Antoine Martin

See also #639, #999, #401, #540, #417

comment:11 Changed 6 months ago by Antoine Martin

Description: modified (diff)
Priority: majorcritical

I'm seeing some weird behaviour with win32 clients trying to improve #999 and detecting late acks.

The network layer's source_has_more function may not be the right place to set and unset NODELAY because lots of small screen updates can take under a millisecond to compress, which is still slower than it takes the network layer to send them...
Maybe we need to pass the screen update flush attribute down to the network layer too.

comment:12 Changed 6 months ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena

Done in r18149 + r18150. Implementation notes:

  • we add an attribute to the packet source functions so the sender can specify if there will be more packets coming (also used if we know the packet is not urgent and can be delayed by the network layer)
  • we no longer set NODELAY globally on startup, we change (if needed) for each packet
  • in the case of screen updates, we re-use the "flush" attribute (set to a non-zero value when we know that there are more screen update packets coming immediately after)
  • some other packets set the flush flag because they need to be sent in a timely manner (ie: ping) or because they affect the user experience (ie: lost-window)
  • TCP_SOCKET_NODELAY is now XPRA_SOCKET_NODELAY as it applies to more than just TCP sockets
  • we expose the "nodelay" attribute, it should always match the TCP sockopt
  • important bandwidth management updates in ticket:999#comment:44

As of r18151, we can use XPRA_SOCKET_NODELAY to overrule the automatic settings:

  • XPRA_SOCKET_NODELAY=1 should be more or less equivalent to what we had before
  • XPRA_SOCKET_NODELAY=0 never sets NODELAY

TODO:

  • deal with other socket types - websockify does its own thing with a global flag, can we override it?
  • verify RFB and UDP still work after protocol layer changes (and proxy?)
  • maybe keep NODELAY off when there is congestion and we are batching? (not sure this is needed or helpful)
  • maybe audio packets need to be flushed quicker? (does this change cause audio latency to go up?)

@maxmylyn: this ticket is tailor made for the automated tests - we want to compare before and after to see if this helps, especially under bandwidth constrained conditions. (only slight problem is, there is a bug I'm working on which causes congestion detection to kick in too early, and the counter measures are too aggressive, causing the framerate to drop..)

Last edited 6 months ago by Antoine Martin (previous) (diff)

comment:13 Changed 6 months ago by J. Max Mena

I did some maintenance this morning on the test box - it's been acting up again. For starters, I've split the TC tests into two different files - one runs just the packet loss/delay tests, the other runs bandwidth limited cases. That way if something goes wrong with either test suite, we don't lost all the data.

However there's a separate problem that I'm going to look at now - seemingly at random the tests fail to stop the server properly, at which point all following tests fail consistently because they're unable to spin up a server on the display that's still in use. A simple xpra stop clears the server, so I'll sprinkle that command in to the scripts between test suites....but I'd like to figure out why I need to do that.

I'll hold on to this ticket for a few days so we can gather more data.

comment:14 Changed 4 months ago by Antoine Martin

Milestone: 0.152.3
Note: See TracTickets for help on using tickets.