Xpra: Ticket #999: catch limited bandwidth issues sooner

Got some logs which show:

window[2].damage.out_latency.90p : 34
window[2].damage.out_latency.avg : 44
window[2].damage.out_latency.cur : 2
window[2].damage.out_latency.max : 1391
window[2].damage.out_latency.min : 0

Which means that it takes around 44ms to compress and send the packet out to the network layer, often less. Except that in some cases it can take 1391ms!!

There is another one, which isn't quite as bad:

window[2].damage.out_latency.90p : 324
window[2].damage.out_latency.avg : 119
window[2].damage.out_latency.cur : 25
window[2].damage.out_latency.max : 408
window[2].damage.out_latency.min : 1

At that point the UI became sluggish, about 0.5s behind the actual actions.

Not entirely sure what we should be doing here: by the time the OS is pushing back to us, it is too late already and things will be slow because there isn't enough bandwidth to service us.

Maybe we can watch the "damage out latency" more carefully and immediately increase the batching delay to prevent further degradation?



Fri, 23 Oct 2015 02:02:08 GMT - Antoine Martin: status changed

afarr forwarded some links:


Sun, 25 Oct 2015 05:29:13 GMT - Antoine Martin:

Re: linux traffic control via tc. Just as I remembered it, the documentation is absolutely awful.

First, your need to install kernel-modules-extra: Bug 823316 - unable to simulate drops with tc / netem),

The documentation found at the linux foundation is incomplete: http://www.linuxfoundation.org/collaborate/workgroups/networking/netem: when trying to add the latency, you may get this totally unhelpful message: RTNETLINK answers: No such file or directory (which file would that be? no files involved here at all!) You need to:

modprobe sch_netem

Added difficulty: for testing, it is much easier to run everything on the same system. Unfortunately, even when using the system's public network IP, the network subsystem will take a shortcut and route through the loopback device. So you have to apply the rules there, ie:

tc qdisc add dev lo root netem delay 100ms 50ms 25%

And remember to remove them when you're done as this will interfere with lots of things..

tc qdisc del dev lo root netem

For the record, some alternatives:

On OSX and FreeBSD, there's also (obviously) ipfw:

And in OSX Lion onwards with xcode: Network Link Conditioner


Wed, 09 Dec 2015 14:32:15 GMT - Antoine Martin: milestone changed

I have not found a simple solution to this problem - not one that can be merged this late in the release cycle. Re-scheduling. (hopefully some of the changes can be backported).

But I did find a huge bug in the process: r11376. (backported in r11380).


Thu, 24 Mar 2016 09:15:50 GMT - Antoine Martin:


Thu, 14 Apr 2016 04:08:24 GMT - Antoine Martin:

See also #401, #540 and #1135.


Thu, 12 May 2016 23:21:30 GMT - alas:

Just as a note (as much so I will remember &/or be able to find more easily as for anyone else's benefit), I've managed to get some other tc functions to work as well: loss, reorder, delete all, and list active (examples above for an eth0 device).


Mon, 08 Aug 2016 06:24:42 GMT - Antoine Martin:

The low-level network code is a bit messy, in large part because of win32 and the way it (doesn't) handle blocking sockets...

At the moment, we detect the network bottleneck because the network write call takes longer to return, handling WSAEWOULDBLOCK and socket.timeout would be more explicit. Maybe we shouldn't be using blocking sockets? Or maybe reads can be blocking but it would be useful if writes were not so that we could detect when the network layer cannot handle any more data. (assuming that we can distinguish) Or maybe we need to use different code altogether for win32 and posix?

Related reading:


Wed, 10 Aug 2016 05:05:29 GMT - Antoine Martin: milestone changed


Mon, 26 Sep 2016 07:16:33 GMT - Antoine Martin: milestone changed

Far too late to make intrusive changes to the network layer. Some recent fixes and breakage: #1211, #1298, #1134

Good read: https://github.com/TigerVNC/tigervnc/wiki/Latency


Sun, 19 Feb 2017 06:36:28 GMT - Antoine Martin: milestone changed


Mon, 20 Feb 2017 11:49:32 GMT - Antoine Martin:

See also #619, #401 and #153


Fri, 14 Apr 2017 11:29:53 GMT - Antoine Martin:

The big hurdle for fixing this is that we have a number of queues and threads sitting in between the window damage events and the network sockets. When things go wrong (network bottleneck, dropped packets, whatever), we need to delay the window pixel capture instead of queuing things up downstream (pixel encoding queue, packet queue, etc). These buffers were introduced to ensure that we keep the pixel pipeline filled at all times to make the best use of the available bandwidth: highest fps / quality possible. When tweaking those settings, we want to make sure we don't break the optimal use case. So maybe we should define a baseline before making any changes, one for the optimal use case (gigabit or local connection without mmap) and one for the "slow" network connection (fixed settings we can reproduce reliably with tc). That's on top of the automated perf tests, which will give us another angle on this.

Things to figure out:

Things to do:


Sat, 22 Apr 2017 14:29:35 GMT - Antoine Martin:

Testing with r15691 Fedora 26 server and a win7 64-bit client using the default encoding on a 4k screen, connecting over 100Mbps LAN and using glxspheres to generate constant high fps.

Here are the settings we change during the tests (showing the default value here):

Settings "SET1" is: "batch-always=1, max-expired=0, latency-tolerance=0"

Example of tc changes we can make:

For collecting statistics:

xpra info | egrep "out_latency.avg|encoder.fps|client-ping-latency.avg"
Sample Settings FPS Client Ping Latency Damage Out Latency
1Defaults 30-35 4 4
2SET1 25-35 4 5
3Defaults with tc latency 100 / 0 25-35 ~105 4
4SET1 with tc latency 100 / 0 25-35 ~105 4
5Defaults with tc latency 100 / 50 15-30 ~200 ~4 to 500! (but usually very low)
6SET1 with tc latency 100 / 50 15-30 ~200 - spiked up to 900! ~4 to 500! (but usually very low)

Initial notes:


Mon, 10 Jul 2017 14:14:55 GMT - Antoine Martin: milestone changed

re-scheduling


Thu, 16 Nov 2017 14:31:07 GMT - Antoine Martin: attachment set

logs immediate queue to send delays


Thu, 16 Nov 2017 17:51:51 GMT - Antoine Martin: priority changed

Simple way to reproduce the problems:

trickle -d 100 -u 100 xpra ...

Launch glxgears and then enlarge the window, the sudden increase in bandwidth consumption causes a spike in send latency:

19:54:44,796 queue to send delay (start):   0ms
19:54:44,797 queue to send delay (end)  :   0ms
19:54:44,800 queue to send delay (start):   0ms
19:54:44,800 queue to send delay (end)  :   0ms
19:54:44,807 queue to send delay (start):   0ms
19:54:44,894 queue to send delay (end)  :  87ms
19:54:45,189 queue to send delay (start):   0ms
19:54:45,189 queue to send delay (end)  :   0ms
19:54:45,193 queue to send delay (start):   0ms
19:54:45,194 queue to send delay (end)  :   1ms
19:54:45,200 queue to send delay (start):   0ms
19:54:45,285 queue to send delay (end)  :  85ms
19:54:45,552 queue to send delay (start):   0ms
19:54:46,001 queue to send delay (end)  : 448ms
19:54:46,001 queue to send delay (start): 353ms
19:54:46,167 queue to send delay (end)  : 519ms
19:54:46,167 queue to send delay (start): 350ms
19:54:46,201 queue to send delay (end)  : 384ms
19:54:46,201 queue to send delay (start):  88ms
19:54:46,242 queue to send delay (end)  : 129ms
19:54:46,449 queue to send delay (start):   0ms
19:54:46,488 queue to send delay (end)  :  39ms
19:54:46,867 queue to send delay (start):   0ms
19:54:46,867 queue to send delay (end)  :   0ms
19:54:47,124 queue to send delay (start):   0ms
19:54:47,124 queue to send delay (end)  :   0ms

The heuristics quickly adapt to hitting the ceiling in this case: it takes under a second to go back to no delay at all (probably thanks to the batch delay adjusting dynamically with the backlog). We should be able to spot the first packet that takes a long time to push out and raise the batch delay immediately. When the send speed is restricted, its value ends up being pretty close to the actual line constraints, so we could implement a soft bandwidth limit, detected at runtime.

In other cases, things take longer to settle:

22:23:34,639 queue to send delay (end)  :   0ms for 12KB using   h264: 35205KB/s
22:23:34,659 queue to send delay (start):   0ms
22:23:34,782 queue to send delay (end)  : 123ms for 12KB using   h264: 99KB/s
22:23:34,782 queue to send delay (start): 103ms
22:23:34,783 queue to send delay (end)  : 104ms for 12KB using   h264: 121KB/s
22:23:34,783 queue to send delay (start):  83ms
22:23:34,913 queue to send delay (end)  : 213ms for 12KB using   h264: 60KB/s
22:23:34,914 queue to send delay (start): 193ms
22:23:35,046 queue to send delay (end)  : 325ms for 13KB using   h264: 40KB/s
22:23:35,046 queue to send delay (start): 306ms
22:23:35,176 queue to send delay (end)  : 436ms for 12KB using   h264: 29KB/s
22:23:35,176 queue to send delay (start): 416ms
22:23:35,308 queue to send delay (end)  : 548ms for 13KB using   h264: 23KB/s
22:23:35,308 queue to send delay (start): 527ms
22:23:35,436 queue to send delay (end)  : 656ms for 12KB using   h264: 19KB/s
22:23:35,436 queue to send delay (start): 636ms
22:23:35,551 queue to send delay (end)  : 750ms for 11KB using   h264: 15KB/s
22:23:35,551 queue to send delay (start): 700ms
22:23:35,703 queue to send delay (end)  : 852ms for 15KB using   h264: 17KB/s
22:23:35,703 queue to send delay (start): 789ms
22:23:35,852 queue to send delay (end)  : 938ms for 14KB using   h264: 15KB/s
22:23:35,852 queue to send delay (start): 845ms
22:23:36,015 queue to send delay (end)  : 1007ms for 16KB using   h264: 16KB/s
22:23:36,015 queue to send delay (start): 907ms
22:23:36,174 queue to send delay (end)  : 1066ms for 15KB using   h264: 14KB/s
22:23:36,174 queue to send delay (start): 945ms
22:23:36,300 queue to send delay (end)  : 1071ms for 12KB using   h264: 11KB/s
22:23:36,608 queue to send delay (start):   0ms
22:23:36,632 queue to send delay (end)  :  24ms for 2KB using   h264: 97KB/s
22:23:36,661 queue to send delay (start):   0ms
22:23:36,661 queue to send delay (end)  :   0ms for 2KB using   h264: 9176KB/s

And in the meantime, packets (and that's any packet, not just picture updates...) can take a second to send.. so the user experience suffers. The difficulty here is that during those first ~120ms, we may have already encoded and queued another 6 packets... We may get a signal in the form of a socket timeout or of partial send, but this would be caught very deep in the network layer, far from where we would be able to do anything about it. So I have a patch that tries to deal with this by scheduling very short timers, that we cancel if the send completes in time. This timeout value is still TBD, and may vary from OS to OS... testing will tell. This at least prevents things from getting completely out of control, but it does create some stuttering as it overcompensates. Synchronizing via the UI thread may not be the right thing to do here (too slow / expensive). Also, it turns any form of congestion into an increase in batch delay, which may not be the right thing to do if the send queue is busy (could be another window), as this is too drastic. Combining this approach with a soft bandwidth limit could smooth things out better.


Fri, 17 Nov 2017 12:55:01 GMT - Antoine Martin: attachment set

work in progress patch


Fri, 17 Nov 2017 13:05:24 GMT - Antoine Martin:

Some preparatory / related work:

The patch above applies to r17449.

Still TODO:


Sat, 18 Nov 2017 18:00:33 GMT - Antoine Martin: attachment set

updated patch


Sat, 18 Nov 2017 18:14:06 GMT - Antoine Martin:

Much improved patch attached:

Most work items from comment:16 remain, and also some new ones:


Sun, 19 Nov 2017 07:19:29 GMT - Antoine Martin:

Available bandwidth detection added in r17452. This alone makes a huge improvement in user experience on bandwidth constrained links.


Sun, 19 Nov 2017 08:08:14 GMT - Antoine Martin: attachment set

updated patch to apply to r17452


Sun, 19 Nov 2017 18:10:36 GMT - Antoine Martin:

More reliable and accurate bandwidth limit auto-detection in r17455.

Main todo items for this release:


Mon, 20 Nov 2017 09:44:05 GMT - Antoine Martin:

Updates:

The big remaining problem: a single png frame can kill the bandwidth... And we can trigger it from quite a few places. Even a jpeg at high quality can use up the whole budget for several seconds. Then we also end up queuing too many screen updates. Maybe we can inspect the size of the queue and give up on it? We should also switch to h264 more quickly.


Mon, 20 Nov 2017 16:09:25 GMT - Antoine Martin: attachment set

with these hacks, it becomes possible to run glxspheres on a 1Mbps connection


Thu, 23 Nov 2017 06:13:52 GMT - Antoine Martin:

Updates:


Thu, 23 Nov 2017 07:58:28 GMT - Antoine Martin: attachment set

delay refresh


Fri, 24 Nov 2017 15:41:31 GMT - Antoine Martin:

More updates:

I can now run glxspheres at 4 to 8fps on a 1Mbps connection, without getting any stuttering!

If anything, we use less bandwidth than we should (~250Kbps). Being under the limit is good, as it allows the odd large frame to go through without causing too many problems, but this means that the bandwidth-limit detection code never fires, and maybe we could raise the quality or speed (framerate) a bit.

New related ticket for followup: #1700 "faster damage processing"

For very limited bandwidth situations, it would be nice to have a more progressive picture refresh, either flif (#992) or just raising the jpeg / webp quality up, only doing true lossless if we have time and bandwidth.


Sun, 26 Nov 2017 11:20:04 GMT - Antoine Martin: owner, status changed

With all the latest improvements and fixes (r17491, r17493, r17515, r17516, etc) and running glxspheres for testing, the server ends up turning on b-frames (#800), video scaling and YUV420P subsampling. It then reaches 25fps on a 1Mbps connection!

I think this is as good as it is going to get for this release. We can improve it some more in #1700.


@maxmylyn: bear in mind that a lossless refresh will use up a few seconds' worth of bandwidth, this is unavoidable. We try to delay the refresh as much as possible, especially when we know the bandwidth is limited - this works better if we know in advance that there are limits (#417). It's a balancing act between quality, bandwidth usage and framerate.

We need results from the automated tests to ensure that the performance has not regressed for the non-bandwidth-constrained case.

When debugging, we need: "-d refresh,compress,regionrefresh", and the usual "xpra info". And a reproducible test case that does not involve user interaction, as I've launched glxspheres literally thousands of times to get here.


Some links for testing bandwidth constraints:


Tue, 28 Nov 2017 16:30:06 GMT - Antoine Martin:

To try to mitigate the problems with the auto-refresh killing the bandwidth, r17540 switches to "almost-lossless" (using high quality lossy) when bandwidth is scarce. Note: in some cases, webp lossless will compress better than jpeg lossy... it really depends on the type of content. #1699 may help with those decisions.


Wed, 06 Dec 2017 22:57:14 GMT - J. Max Mena:

Okay, I did some maintenance today on the automated test box - unfortunately it's on Fedora 25, and will need to be upgraded sometime in the near future. I might be physically in office on Friday (strong maybe), so I can do it then. I don't want to risk doing an upgrade over the VPN because who knows what can go wrong.


Mon, 11 Dec 2017 23:53:24 GMT - J. Max Mena: owner changed

Now that I'm not dealing with a horrible Fedora 25->26 upgrade I've had time to finally catch up on my tickets. While testing #1638 by watching some video files and reading Wikipedia at the same time (partly to pass the time, partly to add some extra strain on the session) I noticed that the session aggressively lowers quality to the point that it became almost impossible to read after scrolling or switching tabs unless I let the session sit for a couple seconds. I noticed the aggressive quality drops, but I thought I was going crazy when I thought my overall performance had tanked sometime last week - but the session was still responsive, so it wasn't lagging or drawing spinners to indicate a slow server. I went perusing through the timeline to see if anything had changed in the last couple weeks and I found the link to this ticket in r17540.

So, I rolled my server back to r17539 and rebuilt it, and re-launched a session - and lo and behold everything was pretty again. While the quality can drop if you aggressively scroll (as expected) generally the server refreshes much quicker and to the layman's eye the session looks and behaves much nicer.

So I guess I have a half question and a half response to this - would something in r17540 cause the server to aggressively pursue lower quality even when bandwidth wouldn't be scarce? My server is on a full gigabit (and fully wired, albeit kinda lengthy, Ethernet) LAN so there's no reason why it should think that it has limited bandwidth. Or am I completely wrong? (totally likely)


Also, unrelated, I'll be up physically at the test box to yet again run a Fedora upgrade to 26.


Tue, 12 Dec 2017 05:43:41 GMT - Antoine Martin: owner changed

I noticed that the session aggressively lowers quality to the point that it became almost impossible to read after scrolling or switching tabs unless I let the session sit for a couple seconds

Aggressively lowering the quality is one thing, but this sounds like the auto-refresh is kicking in too slowly.

I found the link to this ticket in r17540

Well, this ticket has been set to "critical" and was assigned to you for the 2.2 release, that would be a more obvious way.

I rolled my server back to r17539 and rebuilt it...

Any other later revision than r17539? There are ~68 revisions since then.

would something in r17540 cause the server to aggressively pursue lower quality

No. It only switches to non-lossless auto-refresh if both of those conditions are met:

This would manifest itself as an auto-refresh that is lossy but near lossless, this does not affect how quickly it kicks in.

It does nothing to the "quality" setting, and AFAICR - the only changes to the quality setting calculations only apply when there are bandwidth constraints (r17456).

You didn't include any debug information (ie: "xpra info" would be a start, also "-d refresh,regionrefresh" since this a refresh issue), so we can't say for sure if this is at play here.

The changesets that do change how we handle auto-refresh are: r17491, r17481, r17480, r17458, r17456. In particular, r17481+r17480 change when we cancel and schedule the auto-refresh, to ensure that we only schedule one after we have sent a lossy packet (and not just at the time the screen update is queued up).


Tue, 12 Dec 2017 19:23:18 GMT - J. Max Mena:

Well, this ticket has been set to "critical" and was assigned to you for the 2.2 release, that would be a more obvious way.

Sorry about that - I've been finally able to catch up to my tickets late last week. Now that the holidays are approaching I'll definitely be able to get more done, especially after next week when I'm back full time.

Anyways:

No. It only switches to non-lossless auto-refresh if both of those conditions are met: ...

It does nothing to the "quality" setting, and AFAICR - the only changes to the quality setting calculations only apply when there are bandwidth constraints (r17456).

Okay I had a feeling that I was mistaken but wanted to make sure, thanks for confirming that.

So I did what I should have done yesterday and turned on the OpenGL paint boxes to investigate what's actually going on rather than blindly guessing. What I'm seeing is that as of r17607 it's painting an entire Firefox window with h264 (that is still the blue one, right?) under certain situations. In certain situations it makes - like when I switch from tab to tab rather quickly or click through a series of Wikipedia links quickly. But there are other cases where it doesn't make sense - the two most notably:

After some further investigation - it's much easier to repro what I'm seeing by opening an Xterm and running for i in {1..500}; do dmesg; done and then let the Xterm sit for a second or two. With the OpenGL paint boxes (and the refresh logs running) it becomes obvious to me that it's the autorefresh that's painting as lossy.

So before I walk through the revisions you listed, try a few other things, and then get some logs / Xpra info, do you want to create a new ticket or is that related enough to this one to continue here? (In case you're still awake, but I think that's highly unlikely)


Tue, 12 Dec 2017 22:29:46 GMT - J. Max Mena:

I've tried all the revisions you listed and more testing is inconclusive. I can't help but wonder if maybe it's the new Firefox update that's triggering what I'm seeing much more.


Wed, 13 Dec 2017 17:26:27 GMT - Antoine Martin:

I've tried all the revisions you listed and more testing is inconclusive. I can't help but wonder if maybe it's the new Firefox update that's triggering what I'm seeing much more.

The first step, as always and as noted in comment:27, is xpra info, and also -d refresh,regionrefresh. At least then we'll know what the refresh delay is, quality and so on.

It sounds to me like the heavy scrolling is triggering a video region covering most of the window, that's expected - that's how we trigger scrolling detection. What should be happening is that the auto-refresh should be kicking in quickly, and maybe it isn't. If there are minor screen updates within the video region, this can delay things further. Maybe we can detect those and refresh before timing out the video region.


Wed, 13 Dec 2017 18:18:57 GMT - J. Max Mena: owner changed

One quick note before I pass this back:

The Automated Test Box upgrade was succesfull - unfortunately it hasn't been properly running the tests for a couple weeks due to Fedora 25 going EOL and it losing access to the 1.X packages in the Beta repo - I had a script that would curl the repo list and parse it for the latest 1.X and remove Xpra and install the latest 1.X and later upgrade to the latest 2.X before the next round of testing. Anyways, now that it's up to Fedora 26 it should continue with the tests as you originally asked in comment:23.

Also - you mention bandwidth constrained tests - I have some test runs with TC enabled but they add jitter and packet loss - I'll have to add a test case or two (maybe three?) that adds some hard bandwidth limits as well to see how that impacts us.

Anyways,


I got an xpra info for you of when it gets blurry before the auto-refresh kicks in several seconds later. I also managed to get some -d refresh,regionrefresh logs of my Xterm test case where I got it to stick blurry for quite some time before a timed autorefresh finally kicked in. I'll attach them shortly.


Wed, 13 Dec 2017 18:19:27 GMT - J. Max Mena: attachment set

Xpra info as requested


Wed, 13 Dec 2017 18:19:35 GMT - J. Max Mena: attachment set


Wed, 13 Dec 2017 19:01:34 GMT - Antoine Martin: owner changed

So, looking at these numbers, some windows have a high minimum refresh delay:

window.1.encodings.auto-refresh.min-delay=730
window.2.encodings.auto-refresh.min-delay=150
window.3.encodings.auto-refresh.min-delay=730

And that seems to be caused by:

update_refresh_attributes() sizef=0.40, qf=1.20, sf=0.96, cf=1.00, \
    batch delay=92, bandwidth-limit=56659, min-delay=17649, max-delay=1000, delay=17649

So, somehow the heuristics end up thinking that you only have ~56Kbps of bandwidth available. And somehow I doubt you're using a 56K modem. What's odd is that the xpra info you've attached does not match this:

client.bandwidth-limit.actual=2074963
client.bandwidth-limit.setting=800000000
window.1.bandwidth-limit=213268
window.2.bandwidth-limit=213268
window.3.bandwidth-limit=213268
window.33.bandwidth-limit=1435158

Was the "xpra info" captured at the same time as the log sample? Congestion events that trigger bandwidth-limit calculations can be seen with "-d stats". This is what this ticket is about, if we get that wrong... we'll be driving the heuristics using bad data.

r17622 makes it possible to turn off "bandwidth-limit" detection with XPRA_BANDWIDTH_DETECTION=0 xpra start .., so at least you can use it to figure out if this is what is causing the slow refresh.


Wed, 13 Dec 2017 19:12:52 GMT - J. Max Mena:

Was the "xpra info" captured at the same time as the log sample?

No, they were a few minutes apart - I can try again and see if I can capture them at the same time.

So, somehow the heuristics end up thinking that you only have ~56Kbps of bandwidth available. And somehow I doubt you're using a 56K modem.

That would explain what I'm seeing quite a bit.

r17622 makes it possible to turn off "bandwidth-limit" detection with XPRA_BANDWIDTH_DETECTION=0 xpra start .., so at least you can use it to figure out if this is what is causing the slow refresh.

I'll play around with this, but I won't be back for a couple hours since I have to go take a final (woo "Project Management" class maybe I'll be eligible for a management future).

In the meantime I have another update for you:

I added a few bandwidth limited (via TC) test cases to the automated tests - with 50/25/8 mbit bandwidth limits added to the loopback device.


Thu, 14 Dec 2017 17:30:13 GMT - J. Max Mena: owner changed

Running with XPRA_BANDWIDTH_DETECTION=0 makes the problem completely go away - it's like night and day.

I re-ran the Xterm test I mentioned in comment:28 with -d stats running - I'll attach that.


Thu, 14 Dec 2017 17:42:40 GMT - Antoine Martin: owner changed

I still see no xpra info attached, and the log seems to be for a completely different window setup, just two windows instead of 4 this time? xterms? What are they?

Just to be clear, I don't have divining powers.


Thu, 14 Dec 2017 18:31:46 GMT - J. Max Mena: owner changed

Some more explanation:

I simplified my setup to just two Xterms - one I had xpra info :13 ready to go - and in the other I ran for i in {1..500}; do dmesg; done because that reliably triggers the blurry state. What I did was run the dmesg Xterm and while it was still sitting at blurry I ran the xpra info. Much easier than firing up Firefox and playing around with it until it starts getting blurry.

Apologies for not including that in the first place. I'll attach the new log and xpra info.

But also I found that when running with XPRA_BANDWIDTH_DETECTION=0 the sound sync starts to go way off to the point where sound starts getting at least 500ms behind video after having a session run for a while - it starts out okay but after a few minutes the sound starts to get further and further behind. When running with XPRA_BANDWIDTH_DETECTION=1 sound sync is almost perfect.

EDIT: After the fact I've found that starting a YouTube video in Firefox and pausing video for several minutes (5 is the minimum to make it most notable) before hitting play again is where it's most notable that the sound starts to fall behind. Same thing applies to local video files so you don't need to fire up a Firefox / Google Chrome


Thu, 14 Dec 2017 18:32:14 GMT - J. Max Mena: attachment set

-d stats while running a quick test in an Xterm (for i in {1..500}; do dmesg; done)


Thu, 14 Dec 2017 18:32:35 GMT - J. Max Mena: attachment set

Xpra info ran right after the xterm test (999dstats.log)


Thu, 14 Dec 2017 19:07:37 GMT - Antoine Martin:

So here's what I think is happening in this particular example: lots of very small screen updates (one character at a time: 12x13 pixels, less than 1KB) are followed by large full screen updates (499x316 or 480x312 pixels, 8KB minimum, climbing to 200KB). The large packets eventually trigger the network throttling and we detect that event (record_congestion_event in the logs). When we then calculate the bandwidth available, we ignore the last second or so to ensure that we don't count the packets that caused us to go over the limit, and so we only count the very small packets and end up calculating a very low bandwidth limit value. r17638 should mitigate that.

What I still don't understand is why your network setup starts pushing back after less than 150KB sent. I'm not seeing that with either MS Windows or Fedora connecting through a very cheap 100Mbps switch.

This example was also very different from the original browser test case from comment:26. Not sure at all how that one will fare. If problems persist, please make sure to attach -d stats,refresh,refreshregion AND xpra info captured at the time when refresh problems occur.


Thu, 14 Dec 2017 22:55:01 GMT - J. Max Mena:

r17638 did it - it's much much better now.

What I still don't understand is why your network setup starts pushing back after less than 150KB sent. I'm not seeing that with either MS Windows or Fedora connecting through a very cheap 100Mbps switch.

The house I live in wasn't wired for ethernet so we had to hand wire it in a ....very creative matter. I have to go through two gigabit switches (and around the entire house, in fact) to get to the crappy AT&T Gateway in the other room, so there's that. That being said I can easily consistently push way higher speeds than 56k - when doing a file transfer over the network it can easily push 150+mbits.

But, now that this has been sorted - sound sync has started to regress - I'm seeing it consistently behind by 300-500ms after a session has been running for a few minutes. Should I re-open #1638 or do you want to sort it here?


Fri, 15 Dec 2017 05:03:38 GMT - Antoine Martin:

r17638 did it - it's much much better now.

Applied to v2.2 in r17645. What is the new detected bandwidth-limit value? Is the refresh performance as good as it was in 2.1?

sound sync has started to regress

Sound-sync is not directly related to this. Do you have an easily reproducible test case? Can you use it to bisect? Are you confident that this regression is caused by this ticket?

The new bandwidth handling code can slow down screen updates, and this should not be a problem for sound sync which is usually behind the video anyway. (I assume you mean the audio is falling behind the video here, right?) The new code should prevent us from swamping the connection, which should allow the audio packets to flow more reliably and prevent the kinds of bursts that cause the queue levels to rise. We now also trigger video more quickly, which should help sound sync not hinder it.

The only downside that I can think of would be if the video framerate slows down so much that we end up not using a video encoding, as only video is synced. You can verify that with the paint boxes.

I'm seeing it consistently behind by 300-500ms after a session has been running for a few minutes.

Was the buffer level high? What is the av-sync value in "xpra info"?

Should I re-open #1638 or do you want to sort it here?

Unless it is unequivocally proven that the bandwidth limit code updates from this ticket are causing this regression, use #1638. First, try turning off cutter to see if that is the real problem.


Sat, 23 Dec 2017 13:03:04 GMT - Antoine Martin: owner changed


Mon, 08 Jan 2018 20:52:10 GMT - J. Max Mena: owner changed

Okay, back from vacation and now setting in to full time - this is the first ticket. I've ran it with the cutter disabled and that appears to be the culprit - I don't see a difference with and without the bandwidth detection enabled. So, I'll be closing this one (as it is not the cause and re-opening #1638 in a moment.

But before I actually do close this one I have one last question - is there a way to override the auto-detection to set it to a value higher or lower than what it would detect? It would be useful for testing.


Tue, 09 Jan 2018 04:48:54 GMT - Antoine Martin: owner changed

is there a way to override the auto-detection to set it to a value higher or lower than what it would detect? It would be useful for testing.

As per comment:32:

r17622 makes it possible to turn off "bandwidth-limit" detection with XPRA_BANDWIDTH_DETECTION=0. ie:

XPRA_BANDWIDTH_DETECTION=0 xpra start --bandwidth-limit=10mbps

Tue, 09 Jan 2018 16:58:41 GMT - J. Max Mena: status changed; resolution set

Okay, that's what I figured just wanted to make sure that was the proper way.

Closing. (finally)


Wed, 24 Jan 2018 12:29:46 GMT - Antoine Martin:

Improved congestion detection in:


Thu, 15 Feb 2018 22:57:07 GMT - J. Max Mena:

I might have to re-open this one - I'm seeing a lot more low-bandwidth popups than I should be, especially when I switch from a video to a non-video tab in Chrome.

I can somewhat reliably trigger the low-bandwidth warning:

EDIT: I realize that quality drops when switching desktops is to be expected since we lower video framerate under those situations - however I do see the bandwidth warning quite a bit.


Fri, 16 Feb 2018 03:38:38 GMT - Antoine Martin: status changed; resolution deleted

Logs? xpra info? bandwidth configuration?


Fri, 16 Feb 2018 19:18:57 GMT - J. Max Mena:

Managed to get a rapid drop in quality, but didn't get the actual warning to pop up - I'll keep trying, but here are my initial logs

Client and server are both r18446 Fedora 26 machines

The server and client were running with -d bandwidth,stats - I'll start a new session and try to get an Xpra Info


Fri, 16 Feb 2018 19:19:30 GMT - J. Max Mena: attachment set


Fri, 16 Feb 2018 21:28:17 GMT - J. Max Mena: attachment set

Got this immediately after lunch after I let the session sit for an hour or so


Fri, 16 Feb 2018 21:28:49 GMT - J. Max Mena: owner, status changed


Sat, 17 Feb 2018 13:30:46 GMT - Antoine Martin: owner changed

Please specify the bandwidth configuration. I assume this is on a LAN? 100Mbps? With latency under 5ms?

wrt "rapid drop in quality": Looking at the log file, minor congestion events (those can happen) seem to trigger a chain reaction: the latency tolerance goes down and we calculate a higher packets backlog (because we try to prevent flooding the pipe as soon as we think something might be going wrong), which triggers yet more soft congestion events.. So r18454 should handle this better, only triggering congestion events from the "soft expired" timer if those are new events.

Related changes:


Got this immediately after lunch after I let the session sit for an hour or so

I assume that you mean you got a bandwidth warning? Without logs or reproduction steps, there's nothing I can do about this one.


Tue, 20 Feb 2018 18:29:41 GMT - J. Max Mena: owner changed

For reference my network is a gigabit network - I have three Gigabit Netgear switches between my machine and the main Gateway (have to chain between the office and my room to get to the Gateway because the house isn't wired for Ethernet). It's a bit annoying but the network can reliably push large amounts of bandwidth - to the point where if I do a file transfer over Rsync or FTP I max out hard drive speeds.

Upped server and client to r18503:

I no longer get any warnings, but the bandwidth heuristics reliably drop down to a very low amount of bandwidth, which kills performance for me. I can get this to reliably happen by starting a session with Chrome and an Xterm and pull up a YouTube video and just sit and watch it for a few minutes - the framerate notably drops after a time and eventually becomes unwatchable. I'll retest this with VLC from a local file to make it easier to repro. In the meantime (installing VLC as I write this), I'll attach logs from my last session - they start out with a bandwidth limit of 17 megabytes/s which feels reasonable, but after a minute or so it drops this limit to 3 megabytes/s. I had to trim it as I let it run for several minutes while I made coffee and the file got really large - I have the untrimmed version if you want it. I trimmed it right at the point where it drops the limit to ~3 megabytes/s.

In another session I saw it drop down to under 1 megabytes/s while using FileZilla to SFTP a few video files over in a background virtual desktop (stuck even after switching back):

2018-02-20 10:20:45,758 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5259548 : 1689%
2018-02-20 10:20:45,809 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5231673 : 1680%
2018-02-20 10:20:45,859 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5231673 : 1680%
2018-02-20 10:20:45,890 packet decoding sequence 3632 for window 7: 1200x244 took 4.9ms
2018-02-20 10:20:45,890 record_latency: took    6.5 ms round trip,    6.4 for echo,    4.0 for decoding of   292800 pixels,    28568 bytes sent over the network in    1.6 ms,    1.5 ms for echo
2018-02-20 10:20:45,890 may_send_delayed() wid=  7 : bandwidth limit=311369, used=4823210 : 1549%
2018-02-20 10:20:45,918 packet decoding sequence 3633 for window 7: 1200x422 took 8.7ms
2018-02-20 10:20:45,919 record_latency: took   10.5 ms round trip,   10.4 for echo,    8.0 for decoding of   506400 pixels,    59366 bytes sent over the network in    1.8 ms,    1.6 ms for echo
2018-02-20 10:20:45,919 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5187959 : 1666%
2018-02-20 10:20:45,970 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5044613 : 1620%
2018-02-20 10:20:46,020 may_send_delayed() wid=  7 : bandwidth limit=311369, used=5041816 : 1619%

(I'm assuming the bandwidth limit= is what the heuristics are detecting)


Tue, 20 Feb 2018 18:30:16 GMT - J. Max Mena: attachment set


Tue, 20 Feb 2018 19:02:09 GMT - J. Max Mena:

UPDATE:

I get the exact same behavior with local video files. (I admit it took way too long to figure out I was missing the x264 x264-libs packages)

It starts out fine but after a minute or two, the bandwidth drops to the point the movie quality notably drops. It becomes very apparent if you have subtitles enabled.


Wed, 14 Mar 2018 12:05:23 GMT - Antoine Martin: owner changed

There are only 2 congestion events in the whole log sample:

Is this a virtual machine or something? The other packets all arrive much quicker and this one is small. We could (gradually) ignore the first few packets, just like we already ignore the first video frame, but I'm not sure this is warranted unless we see more of those during testing.

This is usually pretty reliable in bandwidth constrained setups, but maybe not in this case. The log sample lasts 5 minutes, did the quality drop at the start? If not, when? Did it never recover?

Can you reproduce with "-d bandwidth,stats"? r18714 will only take "slow send" into account when we know that the bandwidth is already limited (default: less than 20Mbps) - if that helps, we'll need to make sure that we still detect congestion properly and bandwidth constrained setups. Ideally, we should relax this rule and make sure that the quality goes back up instead. I will need logs for that.

Also related:


Thu, 05 Apr 2018 23:25:27 GMT - J. Max Mena: owner changed

Upped my server and client to r18990:

In this log snap, I connect, and almost immediately the level drops and I never see the bandwidth go above ~3megabits. However towards the end all of a sudden the level increases and everything gets much clearer. Which is weird, but I'm glad I had the logs running to catch it. Hopefully it helps.

EDIT: In the log sample I'm attaching, I started a server with Firefox and an Xterm as start childs - I connected, opened a couple tabs and loaded up an HTML5 video in one and switched between it and a couple text sites a few times while the logs were running.

Before I give this back, I'd like to request the XPRA_BANDWIDTH_DETECTION and XPRA_MIN_BANDWIDTH flags get turned into control commands and/or at the very least arguments that can be fed into an xpra start. I'm sure there are lots of people who would like to either turn the heuristics off or at the very least leave it on auto but set it to a higher minimum on the fly. I know that there are a few people here who would definitely be interested.


Thu, 05 Apr 2018 23:26:23 GMT - J. Max Mena: attachment set

started a session with an Xterm and Firefox, connected, and opened a few tabs and watched an HTML5 video


Fri, 06 Apr 2018 16:28:48 GMT - Antoine Martin: owner changed

... the limits almost immediately drop my bandwidth down to the minimum ...

Bugs squashed:

Please try again.

I'd like to request ...

Let's first iron out the bugs before deciding what should and should not be configurable, and where. Environment variables are suitable places for the more obscure configuration options, ideally bandwidth detection can be made reliable enough that those options can remain hidden. FYI, they can already be fed to xpra start using:


Thu, 19 Apr 2018 18:11:43 GMT - J. Max Mena: owner changed

Okay I've finally been able to carve out enough time to address some of the more important tickets:

My system is currently sitting at r19020 for client and server, for reference.

As it is now, the bandwidth detection is working much better - I get a little push back upon initial connection, but it smooths out quite a bit after a few seconds, which is good.

I'm going to ask for a control command or start flag to disable it again, because I know that the first thing I'm going to be asked is for exactly that.


Sun, 22 Apr 2018 08:16:05 GMT - Antoine Martin: owner changed

I get a little push back upon initial connection, but it smooths out quite a bit after a few seconds, which is good.

What is "push back"? Do you have the log?


Thu, 03 May 2018 20:57:39 GMT - J. Max Mena: owner changed

The pushback I'm seeing is the heuristics withhold a lot of the initial bandwidth for the first couple of seconds. Upon initial connection, everything is blurry and slow to respond (as is typical during network hiccups), and bandwidth monitors show it consistently using <1-2 mbps. Until a few seconds later.

I'll attach a log, but I'm not of the opinion that this behavior is entirely undesirable. It might not be a bad idea to put some small limits on the initial connection if the network can't handle that much traffic that instantaneously. I think we might be better served by slightly upping the min-bandwidth default. But I'll leave that decision to you.

For reference: I've upped my systems to r19184. However, I'm still running Fedora 26, I should carve out a day very soon to upgrade.


Thu, 03 May 2018 20:58:20 GMT - J. Max Mena: attachment set

Initial connection logs with -d bandwidth,stats showing the initial pushback described in the comments


Sun, 06 May 2018 17:31:06 GMT - Antoine Martin: owner changed

The pushback I'm seeing is the heuristics withhold a lot of the initial bandwidth for the first couple of seconds.

There are different heuristics that cause this.

(..) I think we might be better served by slightly upping the min-bandwidth default

For the initial quality and speed (default is 40), r19222 allows you to change those if you wish, ie: XPRA_INITIAL_QUALITY=80 XPRA_INITIAL_QUALITY=80 xpra start ...

In your log there were 3 congestion events in the first second, and none after:

2018-05-03 13:52:36,151 Python/Gtk2 Linux Fedora 26 TwentySix x11 client version 2.3-r19184 64-bit
2018-05-03 13:52:36,913 record_congestion_event(late-ack for sequence      4: late by 143ms, target latency=105 ((1, 4, 0, 100)), 4940, 0)
2018-05-03 13:52:36,976 record_congestion_event(late-ack for sequence      2: late by  52ms, target latency=103 ((1, 2, 0, 100)), 5133, 0)
2018-05-03 13:52:37,012 record_congestion_event(late-ack for sequence      5: late by  43ms, target latency=105 ((1, 4, 0, 100)), 2940, 0)

So r19224 increases the ack tolerance for the first few seconds after the client connects, or after a window reset. This should prevent those relatively small screen updates from skewing the congestion detection.

Related: r19223 also raises the minimum bandwidth we detect from just 1Mbps to 5Mbps, most users should have at least that nowadays, right? This one is unlikely to help with the initial bandwidth though: it kicks in when we detect bandwidth congestion and when things just start we don't have enough data yet to really trigger it.


Thu, 10 May 2018 20:31:28 GMT - J. Max Mena: owner changed

Upped my server and client to r19268:

I've been playing around with this for the last couple days (mostly using Xpra to write some documentation for something else entirely), and the changes you mentioned in comment:58 have wrapped up the last issue. As far as I'm concerned, this ticket is pretty much done. I'm gonna pass to you in case there's anything left you want to do.


Fri, 01 Jun 2018 11:43:11 GMT - Antoine Martin: status changed; resolution set

Found another wrapper tool which may be useful for simulating various network conditions: comcast


Sun, 08 Jul 2018 09:16:43 GMT - Antoine Martin:

Follow up: #1855, #1912, #2156, #2421, #2812


Sat, 23 Jan 2021 05:12:04 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/999