xpra icon
Bug tracker and wiki

Opened 2 years ago

Last modified 4 days ago

#999 new defect

catch limited bandwidth issues sooner

Reported by: Antoine Martin Owned by: J. Max Mena
Priority: critical Milestone: 2.2
Component: encodings Version: 0.14.x
Keywords: Cc:

Description

Got some logs which show:

window[2].damage.out_latency.90p : 34
window[2].damage.out_latency.avg : 44
window[2].damage.out_latency.cur : 2
window[2].damage.out_latency.max : 1391
window[2].damage.out_latency.min : 0

Which means that it takes around 44ms to compress and send the packet out to the network layer, often less.
Except that in some cases it can take 1391ms!!

There is another one, which isn't quite as bad:

window[2].damage.out_latency.90p : 324
window[2].damage.out_latency.avg : 119
window[2].damage.out_latency.cur : 25
window[2].damage.out_latency.max : 408
window[2].damage.out_latency.min : 1

At that point the UI became sluggish, about 0.5s behind the actual actions.

Not entirely sure what we should be doing here: by the time the OS is pushing back to us, it is too late already and things will be slow because there isn't enough bandwidth to service us.

Maybe we can watch the "damage out latency" more carefully and immediately increase the batching delay to prevent further degradation?

Attachments (6)

log-queue-to-send.patch (1.3 KB) - added by Antoine Martin 4 weeks ago.
logs immediate queue to send delays
bandwidth-congestion-handling.patch (12.2 KB) - added by Antoine Martin 3 weeks ago.
work in progress patch
bandwidth-congestion-handling-v6.patch (33.9 KB) - added by Antoine Martin 3 weeks ago.
updated patch
bandwidth-congestion-handling-v7.patch (21.7 KB) - added by Antoine Martin 3 weeks ago.
updated patch to apply to r17452
bandwidth-hacks.patch (12.4 KB) - added by Antoine Martin 3 weeks ago.
with these hacks, it becomes possible to run glxspheres on a 1Mbps connection
delay-refresh.patch (3.9 KB) - added by Antoine Martin 3 weeks ago.
delay refresh

Download all attachments as: .zip

Change History (31)

comment:1 Changed 2 years ago by Antoine Martin

Status: newassigned

comment:2 Changed 2 years ago by Antoine Martin

Re: linux traffic control via tc.
Just as I remembered it, the documentation is absolutely awful.

First, your need to install kernel-modules-extra: Is Traffic Control (tc) broken in Fedora 17? (Bug 823316 - unable to simulate drops with tc / netem),

The documentation found at the linux foundation is incomplete: http://www.linuxfoundation.org/collaborate/workgroups/networking/netem: when trying to add the latency, you may get this totally unhelpful message: RTNETLINK answers: No such file or directory (which file would that be? no files involved here at all!)
You need to:

modprobe sch_netem

Added difficulty: for testing, it is much easier to run everything on the same system.
Unfortunately, even when using the system's public network IP, the network subsystem will take a shortcut and route through the loopback device.
So you have to apply the rules there, ie:

tc qdisc add dev lo root netem delay 100ms 50ms 25%

And remember to remove them when you're done as this will interfere with lots of things..

tc qdisc del dev lo root netem

For the record, some alternatives:

On OSX and FreeBSD, there's also (obviously) ipfw:

And in OSX Lion onwards with xcode: Network Link Conditioner

Last edited 2 years ago by Antoine Martin (previous) (diff)

comment:3 Changed 2 years ago by Antoine Martin

Milestone: 0.160.17

I have not found a simple solution to this problem - not one that can be merged this late in the release cycle. Re-scheduling. (hopefully some of the changes can be backported).

But I did find a huge bug in the process: r11376. (backported in r11380).

comment:4 Changed 21 months ago by Antoine Martin

  • r12200 (r12202 for v0.16.x branch) should prevent the line jitter from causing drastic changes in the base batch delay
  • r12158 (r12196 for v0.16.x branch) makes it possible to tune the number of soft-expired sends - as this may make things worse on bandwidth constrained links

comment:5 Changed 20 months ago by Antoine Martin

See also #401, #540 and #1135.

Last edited 20 months ago by Antoine Martin (previous) (diff)

comment:6 Changed 19 months ago by alas

Just as a note (as much so I will remember &/or be able to find more easily as for anyone else's benefit), I've managed to get some other tc functions to work as well: loss, reorder, delete all, and list active (examples above for an eth0 device).

  • To list rules: tc -s qdisc ls dev eth0.
  • To add loss: tc qdisc add dev eth0 root netem loss 1%, or tc qdisc add dev eth0 root netem loss 2% 40% to make it more jittery.
  • To add reorder: tc qdisc add dev eth0 root netem reorder 2%, or tc qdisc add dev eth0 root netem reorder 2% 50% to make it more jittery.
  • To delete all the tc rules: tc qdisc del dev eth0 root.

comment:7 Changed 16 months ago by Antoine Martin

The low-level network code is a bit messy, in large part because of win32 and the way it (doesn't) handle blocking sockets...

  • r13270 ensures we don't penalise win32 clients (workaround is now only applied to win32 shadow servers), backported in r13271
  • r13272: code refactoring / cleanup
  • r13273: more detailed packet accounting

At the moment, we detect the network bottleneck because the network write call takes longer to return, handling WSAEWOULDBLOCK and socket.timeout would be more explicit.
Maybe we shouldn't be using blocking sockets? Or maybe reads can be blocking but it would be useful if writes were not so that we could detect when the network layer cannot handle any more data. (assuming that we can distinguish)
Or maybe we need to use different code altogether for win32 and posix?

Related reading:

Last edited 16 months ago by Antoine Martin (previous) (diff)

comment:8 Changed 16 months ago by Antoine Martin

Milestone: 0.171.0

comment:9 Changed 15 months ago by Antoine Martin

Milestone: 1.03.0

Far too late to make intrusive changes to the network layer.
Some recent fixes and breakage: #1211, #1298, #1134

Good read: https://github.com/TigerVNC/tigervnc/wiki/Latency

Last edited 14 months ago by Antoine Martin (previous) (diff)

comment:10 Changed 10 months ago by Antoine Martin

Milestone: 3.02.1

comment:11 Changed 10 months ago by Antoine Martin

See also #619, #401 and #153

Last edited 8 months ago by Antoine Martin (previous) (diff)

comment:12 Changed 8 months ago by Antoine Martin

The big hurdle for fixing this is that we have a number of queues and threads sitting in between the window damage events and the network sockets.
When things go wrong (network bottleneck, dropped packets, whatever), we need to delay the window pixel capture instead of queuing things up downstream (pixel encoding queue, packet queue, etc).
These buffers were introduced to ensure that we keep the pixel pipeline filled at all times to make the best use of the available bandwidth: highest fps / quality possible.
When tweaking those settings, we want to make sure we don't break the optimal use case.
So maybe we should define a baseline before making any changes, one for the optimal use case (gigabit or local connection without mmap) and one for the "slow" network connection (fixed settings we can reproduce reliably with tc). That's on top of the automated perf tests, which will give us another angle on this.

Things to figure out:

  • XPRA_BATCH_ALWAYS=1 - I don't think this would make much of a difference, we always end up batching anyway - but worth checking
  • XPRA_MAX_SOFT_EXPIRED=0 - how much does this help? It will prevent us from optimistically expiring damage regions before we have received ACK packets. Not sure yet how we would tune this automatically
  • TARGET_LATENCY_TOLERANCE=0 (added in r15616) - how much difference does this make? (probably very little on its own, likely to require XPRA_MAX_SOFT_EXPIRED=0)
  • av-sync: does this even matter?
  • r15617 exposes the "damage.target-latency" for each window, how far is this value from the network latency? This needs to account for the client processing the packet, decoding the picture data and presenting it, sending a new packet.. so an extra ~50ms is to be expected.

Things to do:

  • figure out how to decisively identify network pushback - without breaking blocking / non-blocking socket code... shadow servers, etc - some network implementations will chunk things, or buffer things in the network layer, and others won't.
  • drive the batch delay / soft expired delay more dynamically
Last edited 8 months ago by Antoine Martin (previous) (diff)

comment:13 Changed 8 months ago by Antoine Martin

Testing with r15691 Fedora 26 server and a win7 64-bit client using the default encoding on a 4k screen, connecting over 100Mbps LAN and using glxspheres to generate constant high fps.

Here are the settings we change during the tests (showing the default value here):

  • batch-always: XPRA_BATCH_ALWAYS=0
  • max-expired: XPRA_MAX_SOFT_EXPIRED=5
  • latency-tolerance: TARGET_LATENCY_TOLERANCE=20

Settings "SET1" is: "batch-always=1, max-expired=0, latency-tolerance=0"

Example of tc changes we can make:

  • tc latency: 100 / 50: tc qdisc add dev eth0 root netem delay 100ms 50ms 25%
  • tc loss: 2 / 40: tc qdisc add dev eth0 root netem loss 2% 40%

For collecting statistics:

xpra info | egrep "out_latency.avg|encoder.fps|client-ping-latency.avg"
Sample Settings FPS Client Ping Latency Damage Out Latency
1Defaults 30-35 4 4
2SET1 25-35 4 5
3Defaults with tc latency 100 / 0 25-35 ~105 4
4SET1 with tc latency 100 / 0 25-35 ~105 4
5Defaults with tc latency 100 / 50 15-30 ~200 ~4 to 500! (but usually very low)
6SET1 with tc latency 100 / 50 15-30 ~200 - spiked up to 900! ~4 to 500! (but usually very low)

Initial notes:

  • tried with rgb encoding, auto is faster! (100Mbps LAN)
  • heisenbug for statistics: "xpra info" slows down the server a little bit, so don't capture data too often..
  • some variation between runs, sometimes as much as 20% - not sure why yet
  • av-sync throws things off..
  • fps goes up very slowly in some cases
  • damage out latency goes very high for the xterm window used to start glxspheres with av-sync enabled: this isn't video, it shouldn't be delayed!
  • seen the client ping latency go as high as 900ms - why?
  • damage out latency can start very high with SET1 + tc latency: 2s!
  • fps varies a lot when we add latency, even more so with jitter
  • the tunables from "SET1" affect even the best case scenario (on a LAN with 4ms latency) - shows that they have a purpose
  • latency affects fps: we should be able to achieve the same fps, no matter what the latency is (within reason) - not far off with latency alone (Sample 1 vs 3), worse with jitter (Sample 1 vs 5)

comment:14 Changed 5 months ago by Antoine Martin

Milestone: 2.12.2

re-scheduling

Changed 4 weeks ago by Antoine Martin

Attachment: log-queue-to-send.patch added

logs immediate queue to send delays

comment:15 Changed 3 weeks ago by Antoine Martin

Priority: majorcritical

Simple way to reproduce the problems:

trickle -d 100 -u 100 xpra ...

Launch glxgears and then enlarge the window, the sudden increase in bandwidth consumption causes a spike in send latency:

19:54:44,796 queue to send delay (start):   0ms
19:54:44,797 queue to send delay (end)  :   0ms
19:54:44,800 queue to send delay (start):   0ms
19:54:44,800 queue to send delay (end)  :   0ms
19:54:44,807 queue to send delay (start):   0ms
19:54:44,894 queue to send delay (end)  :  87ms
19:54:45,189 queue to send delay (start):   0ms
19:54:45,189 queue to send delay (end)  :   0ms
19:54:45,193 queue to send delay (start):   0ms
19:54:45,194 queue to send delay (end)  :   1ms
19:54:45,200 queue to send delay (start):   0ms
19:54:45,285 queue to send delay (end)  :  85ms
19:54:45,552 queue to send delay (start):   0ms
19:54:46,001 queue to send delay (end)  : 448ms
19:54:46,001 queue to send delay (start): 353ms
19:54:46,167 queue to send delay (end)  : 519ms
19:54:46,167 queue to send delay (start): 350ms
19:54:46,201 queue to send delay (end)  : 384ms
19:54:46,201 queue to send delay (start):  88ms
19:54:46,242 queue to send delay (end)  : 129ms
19:54:46,449 queue to send delay (start):   0ms
19:54:46,488 queue to send delay (end)  :  39ms
19:54:46,867 queue to send delay (start):   0ms
19:54:46,867 queue to send delay (end)  :   0ms
19:54:47,124 queue to send delay (start):   0ms
19:54:47,124 queue to send delay (end)  :   0ms

The heuristics quickly adapt to hitting the ceiling in this case: it takes under a second to go back to no delay at all (probably thanks to the batch delay adjusting dynamically with the backlog).
We should be able to spot the first packet that takes a long time to push out and raise the batch delay immediately.
When the send speed is restricted, its value ends up being pretty close to the actual line constraints, so we could implement a soft bandwidth limit, detected at runtime.

In other cases, things take longer to settle:

22:23:34,639 queue to send delay (end)  :   0ms for 12KB using   h264: 35205KB/s
22:23:34,659 queue to send delay (start):   0ms
22:23:34,782 queue to send delay (end)  : 123ms for 12KB using   h264: 99KB/s
22:23:34,782 queue to send delay (start): 103ms
22:23:34,783 queue to send delay (end)  : 104ms for 12KB using   h264: 121KB/s
22:23:34,783 queue to send delay (start):  83ms
22:23:34,913 queue to send delay (end)  : 213ms for 12KB using   h264: 60KB/s
22:23:34,914 queue to send delay (start): 193ms
22:23:35,046 queue to send delay (end)  : 325ms for 13KB using   h264: 40KB/s
22:23:35,046 queue to send delay (start): 306ms
22:23:35,176 queue to send delay (end)  : 436ms for 12KB using   h264: 29KB/s
22:23:35,176 queue to send delay (start): 416ms
22:23:35,308 queue to send delay (end)  : 548ms for 13KB using   h264: 23KB/s
22:23:35,308 queue to send delay (start): 527ms
22:23:35,436 queue to send delay (end)  : 656ms for 12KB using   h264: 19KB/s
22:23:35,436 queue to send delay (start): 636ms
22:23:35,551 queue to send delay (end)  : 750ms for 11KB using   h264: 15KB/s
22:23:35,551 queue to send delay (start): 700ms
22:23:35,703 queue to send delay (end)  : 852ms for 15KB using   h264: 17KB/s
22:23:35,703 queue to send delay (start): 789ms
22:23:35,852 queue to send delay (end)  : 938ms for 14KB using   h264: 15KB/s
22:23:35,852 queue to send delay (start): 845ms
22:23:36,015 queue to send delay (end)  : 1007ms for 16KB using   h264: 16KB/s
22:23:36,015 queue to send delay (start): 907ms
22:23:36,174 queue to send delay (end)  : 1066ms for 15KB using   h264: 14KB/s
22:23:36,174 queue to send delay (start): 945ms
22:23:36,300 queue to send delay (end)  : 1071ms for 12KB using   h264: 11KB/s
22:23:36,608 queue to send delay (start):   0ms
22:23:36,632 queue to send delay (end)  :  24ms for 2KB using   h264: 97KB/s
22:23:36,661 queue to send delay (start):   0ms
22:23:36,661 queue to send delay (end)  :   0ms for 2KB using   h264: 9176KB/s

And in the meantime, packets (and that's any packet, not just picture updates...) can take a second to send.. so the user experience suffers.
The difficulty here is that during those first ~120ms, we may have already encoded and queued another 6 packets...
We may get a signal in the form of a socket timeout or of partial send, but this would be caught very deep in the network layer, far from where we would be able to do anything about it.
So I have a patch that tries to deal with this by scheduling very short timers, that we cancel if the send completes in time. This timeout value is still TBD, and may vary from OS to OS... testing will tell.
This at least prevents things from getting completely out of control, but it does create some stuttering as it overcompensates. Synchronizing via the UI thread may not be the right thing to do here (too slow / expensive).
Also, it turns any form of congestion into an increase in batch delay, which may not be the right thing to do if the send queue is busy (could be another window), as this is too drastic.
Combining this approach with a soft bandwidth limit could smooth things out better.

Changed 3 weeks ago by Antoine Martin

work in progress patch

comment:16 Changed 3 weeks ago by Antoine Martin

Some preparatory / related work:

  • r17447: remove "send-speed" batch factor (unreliable and superseded by this ticket)
  • r17448 + r17449: code cleanup / tidy up

The patch above applies to r17449.

Still TODO:

  • seems that a race condition can cause all screen updates to stop!
  • verify that this change doesn't negatively affect full-speed connections (extra calculations and branches, timers, etc)
  • verify on other OS, other types of connections (udp, ssh, etc)
  • expose new data via "xpra info"
  • derive the soft bandwidth limit
  • try harder to avoid using png: when limiting the bandwidth to 1Mbps (see #417), a single medium size png frame can take 5 seconds to send! (~600KB)
  • need to adjust speed and quality auto-tuning with congestion events, and include (soft) bandwidth limit in the heuristics
  • move the encode thread pause code: it applies to damage packets from all window sources since they all share the same thread (generalize the code so source doesn't need to know about the damage sequence)
  • clipboard packets also use the encode thread and generally don't need throttling (and throttling them can cause problems.. priority queues?) - same for cleanup functions, those don't need to wait
  • ideally: get rid of closures from queue_damage_packet
  • some constants should be derived: how long we pause sending during congestion, how long a send is expected to take

Changed 3 weeks ago by Antoine Martin

updated patch

comment:17 Changed 3 weeks ago by Antoine Martin

Much improved patch attached:

  • adds some xpra info (more needed)
  • detects bandwidth limit, moves it to source so it applies to all windows
  • discounts speed and quality when we detect congestion
  • tries harder not to use png

Most work items from comment:16 remain, and also some new ones:

  • split the patch and merge the bandwidth detection first, fix cystats
  • we should support bandwidth-limit=none to turn off all congestion code, even detected bandwidth - without using env vars
  • try harder to stop queuing more packets when we already have late_packets
  • bandwidth budget: evaluate the compressed frame size before proceeding? (on very limited bandwidth setup, each frame may use 25% or more of the bandwidth, allowing it through when we get just under 100% of the budget may still be too much)
  • auto-refresh kicks in because we delay things so much that we never reach the code that re-schedules it... same for get_refresh_subregion_encoding
  • initial speed and quality should be lower when the bandwidth-limit is low
  • cythonize more? use numpy for replacing deque? see Ring buffers in Python/Numpy
  • maybe the encode queue logic could be re-worked to support multi-queues per window, to try to provide fairer queuing

comment:18 Changed 3 weeks ago by Antoine Martin

Available bandwidth detection added in r17452.
This alone makes a huge improvement in user experience on bandwidth constrained links.

Changed 3 weeks ago by Antoine Martin

updated patch to apply to r17452

comment:19 Changed 3 weeks ago by Antoine Martin

More reliable and accurate bandwidth limit auto-detection in r17455.

Main todo items for this release:

  • lots of testing
  • try harder to avoid png: auto-refresh, video refresh, full frame updates can all trigger it...
  • adjust speed and quality with congestion events and bandwidth limit, initial settings should be lower? just when bandwidth limit is set?
  • try harder to delay regions when we hit congestion, don't queue any more!

comment:20 Changed 3 weeks ago by Antoine Martin

Updates:

  • r17456: when we detect bandwidth problems: lower speed and quality, increase batch delay
  • r17458: also slow down the refresh timer

The big remaining problem: a single png frame can kill the bandwidth...
And we can trigger it from quite a few places. Even a jpeg at high quality can use up the whole budget for several seconds.
Then we also end up queuing too many screen updates. Maybe we can inspect the size of the queue and give up on it?
We should also switch to h264 more quickly.

Last edited 3 weeks ago by Antoine Martin (previous) (diff)

Changed 3 weeks ago by Antoine Martin

Attachment: bandwidth-hacks.patch added

with these hacks, it becomes possible to run glxspheres on a 1Mbps connection

comment:21 Changed 3 weeks ago by Antoine Martin

Updates:

  • r17459, r17460 + r17465 fixup: fix cystats tests, add info attributes
  • r17461: start at a lower quality and don't update the values until we have processed a few frames
  • r17462 prevent slow decode speeds from making us use high encoding speed too often (ie: png can be quite slow)
  • r17466: honour fixed speed and fixed quality immediately (backport in r17472)
  • #1694: r17468: re-add webp (r17467 for html5, r17469 win32 build fixup), r17471: use webp more often

Changed 3 weeks ago by Antoine Martin

Attachment: delay-refresh.patch added

delay refresh

comment:22 Changed 2 weeks ago by Antoine Martin

More updates:

  • r17480 change auto-refresh handling to avoid scheduling lossless updates when we're still processing or sending other updates (r17481 for video regions)
  • r17482: take bandwidth limit value into account for calculating auto-refresh delay, speed, quality
  • r17488: fix: keep track of regions to refresh when video region changes
  • r17490: refresh timer fix

I can now run glxspheres at 4 to 8fps on a 1Mbps connection, without getting any stuttering!

If anything, we use less bandwidth than we should (~250Kbps). Being under the limit is good, as it allows the odd large frame to go through without causing too many problems, but this means that the bandwidth-limit detection code never fires, and maybe we could raise the quality or speed (framerate) a bit.

New related ticket for followup: #1700 "faster damage processing"

For very limited bandwidth situations, it would be nice to have a more progressive picture refresh, either flif (#992) or just raising the jpeg / webp quality up, only doing true lossless if we have time and bandwidth.

Last edited 2 weeks ago by Antoine Martin (previous) (diff)

comment:23 Changed 2 weeks ago by Antoine Martin

Owner: changed from Antoine Martin to J. Max Mena
Status: assignednew

With all the latest improvements and fixes (r17491, r17493, r17515, r17516, etc) and running glxspheres for testing, the server ends up turning on b-frames (#800), video scaling and YUV420P subsampling. It then reaches 25fps on a 1Mbps connection!

I think this is as good as it is going to get for this release. We can improve it some more in #1700.


@maxmylyn: bear in mind that a lossless refresh will use up a few seconds' worth of bandwidth, this is unavoidable. We try to delay the refresh as much as possible, especially when we know the bandwidth is limited - this works better if we know in advance that there are limits (#417).
It's a balancing act between quality, bandwidth usage and framerate.

We need results from the automated tests to ensure that the performance has not regressed for the non-bandwidth-constrained case.

When debugging, we need: "-d refresh,compress,regionrefresh", and the usual "xpra info". And a reproducible test case that does not involve user interaction, as I've launched glxspheres literally thousands of times to get here.


Some links for testing bandwidth constraints:

Last edited 2 weeks ago by Antoine Martin (previous) (diff)

comment:24 Changed 12 days ago by Antoine Martin

To try to mitigate the problems with the auto-refresh killing the bandwidth, r17540 switches to "almost-lossless" (using high quality lossy) when bandwidth is scarce.
Note: in some cases, webp lossless will compress better than jpeg lossy... it really depends on the type of content. #1699 may help with those decisions.

comment:25 Changed 4 days ago by J. Max Mena

Okay, I did some maintenance today on the automated test box - unfortunately it's on Fedora 25, and will need to be upgraded sometime in the near future. I might be physically in office on Friday (strong maybe), so I can do it then. I don't want to risk doing an upgrade over the VPN because who knows what can go wrong.

Note: See TracTickets for help on using tickets.