Xpra: Ticket #1781: Refresh bug in certain applications

I finally managed to reliably reproduce this - it's easiest in Firefox and Chrome on https://sfbay.craigslist.org/search/cta (a site for classifieds - this sub-site happens to be for cars, I've gotten it to trigger on the Motorcycles one as well).

My client and server are both Fedora 26 machines running trunk r18661 built from source.

I started my server with: XPRA_BANDWIDTH_DETECTION=0 xpra start :13 --bind-tcp=0.0.0.0: --start-new-commands=yes --start-child=xterm --start-child=google-chrome --start-via-proxy=no --systemd-run=no --no-daemon (running without the bandwidth limits makes it easier to detect) and attached with xpra attach tcp:ip.

Once attached I nagivated to the link I provided above. Just mousing over a few of the ads (and cycling through the photos) is enough to induce a state where the client is constantly repainting. Continued use of this site makes the paints more and more aggressive to the point where it becomes almost impossible to type into the Xterm.

I ran two different sessions - one to get the xpra info of both a fresh and a bugged state. And then a second one with -d refresh, compress enabled. On the second run I got into the bugged state almost instantly - I let it sit for at least 10-20 seconds before trying to stop the session with the Xterm. Hopefully that's enough.

Fri, 02 Mar 2018 17:33:04 GMT - J. Max Mena: attachment set

attachment set to 1781freshinfo.txt

Fresh xpra info

Fri, 02 Mar 2018 17:33:19 GMT - J. Max Mena: attachment set

attachment set to 1781buggedinfo.txt

After stuck in a constant refresh state

Fri, 02 Mar 2018 17:41:19 GMT - J. Max Mena: attachment set

attachment set to 1781refreshcompresstrimmed.log

uncompressed and trimmed log - the first few seconds, and the last few seconds (at the point where it was sitting and not being interacted with)

Fri, 02 Mar 2018 17:41:44 GMT - J. Max Mena: attachment set

attachment set to 1781log.tar

tarball of the entire log - tarred with -caf

Mon, 05 Mar 2018 16:14:32 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to J. Max Mena

Your log file is empty:

$ tar -xvf 1781log.tar
1781refreshcompress.log
$ ls -la 1781*
-rw-rw-r--. 1 antoine antoine 20480 Mar  3 12:40 1781log.tar
-rw-rw-r--. 1 antoine antoine 10240 Mar  3 00:38 1781refreshcompress.log
$ wc -l 1781refreshcompress.log
0 1781refreshcompress.log

In any case, never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format. Use xz instead.

As for the log sample attached, it is trimmed just around the time where things got interesting so I don't really know how things got into this state.

Tue, 06 Mar 2018 17:16:58 GMT - J. Max Mena: attachment set

attachment set to 1781log.tar.xz

(properly) compressed full log

Tue, 06 Mar 2018 17:21:02 GMT - J. Max Mena: owner changed

owner changed from J. Max Mena to Antoine Martin

Well I've learned something today:

The a flag on tar will auto use the correct compression based on the name, rather than magically auto-selecting the proper one based on what I assumed it was before

I've properly tarred (and verified that it's correct) the full log from start to finish.

Tue, 06 Mar 2018 17:27:57 GMT - Antoine Martin:

As per comment:1:

never put a single file in a tar file. That never does anything useful except increasing the file size, tar is not a compressed format, it is archival format. Use xz instead.

200	1781log.tar.xz
196	1781refreshcompress.log.xz

Sat, 10 Mar 2018 05:51:26 GMT - Antoine Martin: status changed

status changed from new to assigned

I believe this is caused by r17540, triggering a refresh loop. This bug is present in the 2.2.x branch, but I believe that the bandwidth detection changes in 2.3 (#999) make it easier to trigger.

FYI: you can generally avoid refresh related problems by turning off auto-refresh.

r18683 disables lossy refresh in the 2.2.x branch. I will find a better solution for 2.3.

Sat, 10 Mar 2018 09:27:24 GMT - Antoine Martin: owner, status changed

owner changed from Antoine Martin to J. Max Mena
status changed from assigned to new

My guess in comment:4 was based on log events like this one:

compress:   0.1ms for    2x1    pixels at 1126,965  for wid=2     using      webp with ratio 525.0%  (    0KB to     0KB), sequence  3608, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}
compress:   0.3ms for    3x16   pixels at 1126,22   for wid=2     using      webp with ratio  21.9%  (    0KB to     0KB), sequence  1165, client_options={'flush': 3, 'quality': 100, 'rgb_format': 'BGRX'}

We shouldn't be using "webp" for such tiny screen updates.

But it turns out that this may not be fixed by r18683, r18685 is likely to be a more correct solution. (not sure this should be backported though).

It's not clear to me which part of the screen was refreshing unnecessarily. Can you specify the coordinates or at least take a screenshot with paint boxes turned on. And assuming that you can reproduce this reliably still, please try turning off features to see if they are involved in this, ie: video region detection, refresh delay, etc..

Tue, 13 Mar 2018 20:34:17 GMT - J. Max Mena: owner changed

owner changed from J. Max Mena to Antoine Martin

Upped client and server to r18711

I no longer see the excessive flickering that made it obvious what was going on. However, my session still bogs down to the point of being unusable in the same manner, and the logs still show a flood of Webp refreshes -d refresh (see below). Turning auto-refresh off xpra control :13 auto-refresh 0 makes the issue go away; alternatively detaching and re-attaching help, but the issue will crop up again eventually.

2018-03-13 13:17:53,688 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,688 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(474, 78, 637, 28), refresh regions=[])
2018-03-13 13:17:53,735 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:53,901 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,067 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,290 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,368 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,492 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 877), refresh regions=[])
2018-03-13 13:17:54,585 auto refresh:  h264 screen update (actual quality=  0, lossy= True), added pixels to refresh regions, scheduling refresh in 150ms (pct=0, batch=28) (region=rectangle(0, 73, 1128, 804), refresh regions=[])
2018-03-13 13:17:54,586 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 0, 1128, 73), refresh regions=[])

I also managed to get a traceback once or twice(something about video subregion), but can't seem to get it again for this comment.

Wed, 14 Mar 2018 00:20:18 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to J. Max Mena

Hmmm, looking at this again, maybe r18712 will fix this. This code is far from being new though, so I would have expected it to be reported sooner (every version going back to pre 1.0 releases would be affected, unless somehow a changeset like r17480 made it easier to trigger).

If you still hit the issue, please provide "-d damage,refresh,regionrefresh,subregion" with more than a few lines and maybe also with "-d encoding" added. And if it does fire similar large-ish webp refreshes, does the region being refreshed match xpra info | grep video_subregion?

The traceback could be critical, please always save and report ALL tracebacks.

Fri, 16 Mar 2018 17:13:19 GMT - J. Max Mena: owner changed

owner changed from J. Max Mena to Antoine Martin

Upped client and server to r18739:

I can still trigger this exact same state fairly reliably - by opening Chrome and navigating to [sfbay.craigslist.org] and navigating a few of the ads (cars and motorcycles in particular). Oddly enough it was far more difficult to trigger while the logs were running, at least in my case. It took a few tries yesterday afternoon, but I managed to get it fairly quickly this morning before the log file got way too large. Running with the OpenGL paint boxes makes it much easier to tell when you've gotten into the correct state - it's most apparent when you see a ton of small paint boxes along the edges of the windows. I'll attach the log you asked for.

NOTE: I switched from a browser and terminal start children to two start terminal children, and I used one for starting the browser and the other to stop the session quickly.

As you mentioned as well, I double checked to see if they matched from xpra info | grep video_subregion and....they kind of do.

From my Xpra info (different session than the logs I'll be attaching):

window.3.video_subregion.rectangle=(1, 1, 1127, 965)
window.3.video_subregion.score=99
window.3.video_subregion.scores.(1, 1, 1127, 965)=99
window.3.video_subregion.set-at=7354
window.3.video_subregion.supported=True
window.3.video_subregion.time=65861
window.3.video_subregion.video-mode=False
window.3.video_subregion.width=1127
window.3.video_subregion.x=1
window.3.video_subregion.y=1

and from the logs (same session as the one I got the info from):

2018-03-16 09:45:43,432 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 106, 4, 856), refresh regions=[])
2018-03-16 09:45:43,470 remove_refresh_region(rectangle(161, 301, 6, 13)) updated refresh regions=[]
2018-03-16 09:45:43,470 auto refresh: rgb24 screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(161, 301, 6, 13), refresh regions=[])
2018-03-16 09:45:43,479 refresh_subregion([R(0, 3, 1, 960)])
2018-03-16 09:45:43,503 remove_refresh_region(rectangle(1, 1, 1126, 105)) updated refresh regions=[]
2018-03-16 09:45:43,504 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1, 1, 1126, 105), refresh regions=[])
2018-03-16 09:45:43,505 remove_refresh_region(rectangle(1126, 1, 2, 383)) updated refresh regions=[]
2018-03-16 09:45:43,505 auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(1126, 1, 2, 383), refresh regions=[])
2018-03-16 09:45:43,600 refresh_subregion([R(0, 0, 1128, 3), R(1124, 3, 4, 959), R(0, 962, 1128, 4)])
2018-03-16 09:45:43,602 refresh_subregion([R(3, 0, 1122, 1)])
2018-03-16 09:45:43,602 remove_refresh_region(rectangle(0, 0, 1129, 967)) updated refresh regions=[]
2018-03-16 09:45:43,602 refresh_subregion([R(1, 1, 1126, 105), R(1, 106, 4, 856), R(1124, 106, 3, 856), R(1, 962, 1126, 3)])

Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info. The start coordinates are the same, and the ending X-coordinates almost match, but the subregion from the info is much taller than what's getting auto-refreshed. That feels like it's too big of a coincidence, but I have a feeling you know a lot more about what's going on here.

Fri, 16 Mar 2018 17:14:19 GMT - J. Max Mena: attachment set

attachment set to 1781damagerefreshregionsubregion.log

Fri, 16 Mar 2018 17:51:26 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to J. Max Mena

Note the (1, 1, 1126, 105) from the logs and the (1, 1, 1127, 965) from the info.

That's very likely to be because the browser paints the video area using horizontal bands. This data from the log file uses a different geometry but the idea is the same (here with 58 pixel high bands):

damage(5, 48, 1119, 58, {})    wid=3, scheduling batching expiry for sequence 1809 in 34.0 ms
damage(5, 106, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 164, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
...
damage(5, 860, 1119, 58, {})   wid=3, using existing delayed {} regions created 0.0ms ago
damage(5, 918, 1119, 44, {})   wid=3, using existing delayed {} regions created 0.0ms ago

We then run some clever code to identify the "real" video region from that. (usually by aggregating the horizontal bands together) We also round down the dimensions (1127 becomes 1126) because video encoders often use colourspace subsampling, which requires even dimensions to operate.

This log file looks like it would be useful, but I'm not sure where to look (~25000 lines in there): did the problem appear at the start? If not, when? Do you have the corresponding "paint boxes" screenshot so I can try to reconcile? Or at least the "xpra info"?

Fri, 16 Mar 2018 17:53:01 GMT - J. Max Mena:

I managed to get it into the stuck state in the last few seconds of the log - when the flood of webp paints start is usually when it's in the stuck state.

I'll try again and post a new log with screenshots and an xpra info.

Fri, 16 Mar 2018 18:19:14 GMT - J. Max Mena: owner changed

owner changed from J. Max Mena to Antoine Martin

Okay got myself stuck into the state really quick with the logs running, so I'll attach them and a screenshot and an Xpra info.

Fri, 16 Mar 2018 18:19:45 GMT - J. Max Mena: attachment set

attachment set to 1781loground2.log

Fri, 16 Mar 2018 18:19:58 GMT - J. Max Mena: attachment set

attachment set to 1781loginfo.txt

Fri, 16 Mar 2018 18:20:44 GMT - J. Max Mena: attachment set

attachment set to 1781 paint bug.png

Stuck paint state - note the boxes along the edge

Sat, 17 Mar 2018 17:27:51 GMT - Antoine Martin: owner changed

owner changed from Antoine Martin to J. Max Mena

From the log, we see refreshes following the last add_video_refresh on line 7802:

refresh_subregion([R(1124, 3, 4, 184), R(1124, 207, 4, 25), R(1124, 251, 4, 694), R(1124, 958, 4, 3), R(1124, 961, 4, 1), R(0, 962, 1128, 1)])

Which would match the right and bottom edges of the window where the black paint boxes are on the screenshot: window.3.size=(1129, 967). Just noticed that webp was missing from the default paint colours definition, which is why it showed up as black rather than pink, this is fixed in r18745.

Then each one of those regions gets refreshed, ie:

auto refresh:  webp screen update (actual quality=100, lossy=False), nothing to do (region=rectangle(0, 962, 1128, 1), refresh regions=[])

Yet, after about 130ms, we're back to refreshing the same list again. Without seeing any damage(..) events in between.

That's because the video area shrank and we then request the areas that are no longer managed as video to get refreshed... and the method fires repeatedly instead of just once. The minimal fix for this is r18747. This bug affected all versions that have video region support. (some recent optimizations may have made it easier to hit)

r18748 is a more complete code update in this area and r18749 is another related minor improvement.

Wed, 28 Mar 2018 15:57:46 GMT - J. Max Mena: status changed; resolution set

status changed from new to closed
resolution set to fixed

Meant to close this one but somehow I never got around to it. I'll just blame that on being retasked.

Anyways, post r18749, I no longer see the flickering or excessive webp refreshes. I do see some fairly consistent small updates after a short time in some cases, but I can't rule out odd Firefox or Chrome painting behavior as being a culprit, and it doesn't seem to have any notable impact on my session like I was seeing before.

Closing

Sat, 23 Jan 2021 05:33:37 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1781