Xpra: Ticket #258: ghost windows

I can reproduce it in kmail by moving the mouse cursor in a circle over the list of emails long enough. Previously this also happened with "gtkperf -a".

Related info:


What may be happening (since this problem has been reported a while back and is still not fixed - this may well be wrong):

What to do next:

Wed, 13 Feb 2013 14:23:33 GMT - onlyjob: cc set

Probably this is a regression since it is not reproducible in 0.7.8

Wed, 13 Feb 2013 15:45:59 GMT - onlyjob:

I was looking for application where we could reproduce this issue easier and found filelight where I see a problem that is not exactly the same but might be related.

In filelight when I move mouse over the folders there is a semi-transparent rectangle hover-tooltip with folder name, size and number of files.

If filelight is running within Xpra session there are number of problems with popup tooltips:

When those popups are displayed the following appears in server log:

2013-02-14 02:40:57,084 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:57,210 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:57,683 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:58,093 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628
2013-02-14 02:40:58,121 not found transient_for=<gtk.gdk.Window object at 0x7ffca2bb2d70 (GdkWindow at 0x7ffca2ba46c0)>, xid=77594628

Just before those tooltips disappear completely the following appears in log (and the above log records are stop appearing):

2013-02-14 02:41:03,811 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
2013-02-14 02:41:03,966 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?
2013-02-14 02:41:06,098 the window <OverrideRedirectWindowModel object at 0x1acec30 (wimpiggy+window+OverrideRedirectWindowModel at 0x3bc4b40)> is not composited!?

Wed, 13 Feb 2013 22:37:48 GMT - onlyjob:

I tried to reproduce by connecting to 0.8.3 from client 0.7.8: instead of expected "sticky" indestructible popup window on client side the server-side Xpra crashed:

found large packet (7874 bytes): new-window, argument types:[<type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'int'>, <type 'dict'>, <type
'dict'>], sizes: [1, 1, 2, 4, 3, 22146, 2], packet head=['new-window', 1, 0, 63, 1280, 737, {'size-constraints': {'minimum-size': (307, 293)}, 'window-type': ['_NET_WM_WINDOW_T
YPE_NORMAL'], 'modal': False, 'title': 'XXXXXX@gmail.com/@FSF \xe2\x80\x93 KMail', 'class-instance': ['kmail', 'Kmail'], 'client-machine': 'debstor', 'pid': 12610, 'icon': ....

Wed, 13 Feb 2013 22:46:23 GMT - onlyjob:

Unreproducible with Xpra-0.8.4 on client side and 0.7.8 on server-side.

Thu, 14 Feb 2013 02:10:50 GMT - onlyjob: priority changed

Further testing revealed that it is possible to crash Xpra-server 0.8.4 from client 0.8.4 by circular mouse motion over list of emails in kmail.

r2281 is related to the problem. After hours of trying miscellaneous patches to partially undo r2281 I ended up with "ghost-popup.patch" (attached):

--- a/wimpiggy/composite.py
+++ b/wimpiggy/composite.py
@@ -69,8 +69,9 @@
             trap.swallow_synced(xcomposite_unredirect_window, self._window)
         if self._damage_handle:
             trap.swallow_synced(xdamage_stop, self._window, self._damage_handle)
             self._damage_handle = None
+            self._contents_handle = None        # needed to avoid crash
         self._window = None
     def acknowledge_changes(self):
         if self._damage_handle is not None and self._window is not None:

I have no idea how it work, but it eliminate server-side crashes with client-0.8.4 and reduces probability of indestructible ghost-popups in kmail. Unfortunately even with this patch vigorous testing still reproduces ghost-popups even though it is harder. Also patch does not help to avoid server-side crash (0.8.4) with client 0.7.8.

I'm raising this ticket's priority as server-side crashes is a serious issue.

Thu, 14 Feb 2013 02:11:47 GMT - onlyjob: attachment set

Thu, 14 Feb 2013 02:14:11 GMT - onlyjob:

Apparently patch have no effect on filelight behaviour.

Thu, 14 Feb 2013 02:54:19 GMT - Antoine Martin:

Hang on, I thought the server-side crashes from comment:3 were with a 0.7.8 server? Or did you apply the patch in comment:5 on top of 0.7.x? I really don't see how it would make any difference going from 0.8.3 to 0.8.4 - sounds like you just got (un)lucky there. It looks like a separate issue to me, one for a new ticket. How does this crash manifest itself? Anything in the logs? The sample in comment:3 is just a warning message (it shouldn't be there - but still), not a fatal error. Does it reject the client connection or actually crash the server? Do you have more complete log messages, or even better, a gdb backtrace?

As for the composite change, it's an odd one: invalidate_pixmap (which does clear the self._contents_handle) is already called from destroy and this is all done from the UI thread, so I don't see how it could be anything but None already... That is, unless:

The only way this would help is if we use window.get_property("client-contents") afterwards, and the only place we do this is from the UI thread (in window_source.process_damage_region), so by that point, the window.do_unmanaged function should have finished (since it also runs from the UI thread) and the reference to the composite window should be gone...

If anything it should be a call to self._cleanup_listening(), and indented to run in all cases (or maybe even keep both invalidate_pixmap calls since they're cheap):

--- src/wimpiggy/composite.py	(revision 2715)
+++ src/wimpiggy/composite.py	(working copy)
@@ -64,12 +64,12 @@
             log.warn("composite window %s already destroyed!", self)
         remove_event_receiver(self._window, self)
-        self.invalidate_pixmap()
         if not self._already_composited:
             trap.swallow_synced(xcomposite_unredirect_window, self._window)
         if self._damage_handle:
             trap.swallow_synced(xdamage_stop, self._window, self._damage_handle)
             self._damage_handle = None
+        self.invalidate_pixmap()
         self._window = None
     def acknowledge_changes(self):

I would also consider adding this to do_get_property_contents_handle to prevent trying to get the pixels once we've destroyed the reference to the window (although, like I said above, this should never happen):

    def do_get_property_contents_handle(self, name):
        if self._window is None:
            self._contents_handle = None
            return None

Thu, 14 Feb 2013 03:18:19 GMT - onlyjob:

No, Server was always 0.8.4 (except for comment where I didn't upgrade it from 0.8.3 yet). Why would I test old server?

I don't know python and I don't understand the code so I have no idea how it work. However test results are reliable.

As for crash, the comment 3 already have fragment from server's log. How does it manifest? I move mouse over list of emails in kmail for a little while and server dies (doesn't respond to "attach" until "xpra upgrade"). It crashes all the time if client is 0.7.8. When client is 0.8.4 it doesn't crash when my patch is applied.

Unfortunately I can't provide GDB backtrace for this (yet). I'll try to think how can it can be done. My email (and therefore xpra server) is on the machine where I can't install *-dbg packages. I don't know how to reproduce in other applications but kmail. I need to build a dedicated test environment for this...

It looks like ghost-window make server very fragile so the crash itself perhaps is a standalone issue but somehow they are related...

Thu, 14 Feb 2013 03:41:14 GMT - onlyjob:

Last record in server log after crash is this:

The program 'xpra' received an X Window System error.
This probably reflects a bug in the program.
The error was 'BadWindow (invalid Window parameter)'.
  (Details: serial 301725 error_code 3 request_code 12 minor_code 0)
  (Note to programmers: normally, X errors are reported asynchronously;
   that is, you will receive the error a while after causing it.
   To debug your program, run it with the --sync command line
   option to change this behavior. You can then get a meaningful
   backtrace from your debugger if you break on the gdk_x_error() function.)

Thu, 14 Feb 2013 16:07:47 GMT - Antoine Martin:

My initial confusion was over what caused the crash, the log sample you had provided only showed the hello packet so I assumed this was a crash on connection - now we got that cleared!

Forgive me if I am being a bit slow, but are you saying that if the client is 0.7.8 then the patch does not prevent the crash? That would be quite odd, and the sign of a race.

A gdb backtrace would really help, even one without the debug symbols might give us a clue (better than nothing I guess).

Fri, 15 Feb 2013 08:04:25 GMT - Antoine Martin:

r2735 does something similar to your patch, but without being racy (I think), it also fixes an important bug (which should be backported to v0.7.x too)

This should improve things, though from what you're saying this may not solve the ghost window problem and/or crashes. Please let me know if this is a step in the right direction at least.. (and any kind of gdb bt would be awesome)

Fri, 15 Feb 2013 08:06:17 GMT - onlyjob: attachment set

gdk_x_error fired upon potentially problematic popup in kmail

Fri, 15 Feb 2013 08:07:45 GMT - onlyjob:

Please forgive me for providing incomplete information.

It is correct that patch did not prevent crash with client-0.7.8 connecting to server-0.8.4. However I think it could be that patch had little or no effect to stability of 0.8.4... because I don't have automated test the crash is hard to reproduce (it takes time) so it is possible that I just didn't try hard enough... Sometimes it takes longer to reproduce so I'm not convinced that patch actually fixes the problem.

Today I tried to get backtrace in "controlled environment" with *-gdb packages. The problem is that there is no backtrace on crash (application exit before gdb can get anything?) so I tried to set breakpoint to gdk_x_error but it fires too often. For what's it worth I'm attaching backtrace I've taken from interrupt on gdk_x_error that fired on popup, but it is not a crash backtrace.

I tried number of builds going down to r2276 and in all of them I can reproduce indestructible popups. Either regression was introduced earlier or maybe we just can't track it to one particular commit?

Fri, 15 Feb 2013 08:46:13 GMT - onlyjob:

By the way after crash Xpra cannot start on the same screen until I manually terminate process /usr/bin/Xorg-for-Xpra-:13... Can you detect this?

I tried xpra-0.8.4+r2735 but it crashed pretty quick so it probably didn't help. Again I couldn't get backtrace from crash (please advise); Attaching kmail popup's backtrace from breakpoint on gdk_x_error (I skipped number of breaks to let application start).

Fri, 15 Feb 2013 08:47:25 GMT - onlyjob: attachment set

break on gdk_x_error fired from potentially problematic popup in kmail

Fri, 15 Feb 2013 08:48:41 GMT - onlyjob: attachment set

break on gdk_x_error fired from potentially problematic popup in kmail

Fri, 15 Feb 2013 08:50:24 GMT - onlyjob:

Attachments xpra-0.8.4+r2275_backtrace.txt.xz​ and xpra-0.8.4+r2735_backtrace.txt.xz​ are same, sorry for typo in rev. number.

Fri, 15 Feb 2013 09:25:08 GMT - Antoine Martin:

I've looked at both stacktraces and although there are X errors there, these should not be fatal (xcomposite_unredirect_window can fail if the window is already destroyed I guess, we catch them using a trap.swallow call)

This probably isn't what is causing the server crash. So.. we need to find the real cause, and not having gdb to help complicates things - I don't understand why it wouldn't catch the crash. Can you rule out encoding issues by testing with png/rgb24 only? There was another report of an x264 crash today (#261) which makes me quite suspicious of the libav upgrade and the patching it requires to build in some cases.

Fri, 15 Feb 2013 12:06:01 GMT - onlyjob:

I'm not experienced with GDB so perhaps I don't know how to catch a crash... Your ideas are welcome... I'm following the procedure we discussed previously when we were troubleshooting another crash: gdb attach, set breakpoint etc.

I ruled out the encodings. I can reproduce at least with 3 of them: x264, png, rgb24; as well as with local mmap connection.

I've seen message in mail list and your answer. As you know I'm not testing your "official" packages. I'm very confident regarding our (Debian) packaging, libav and encodings. We just have a genuine but little understood problem...

Fri, 15 Feb 2013 12:17:07 GMT - Antoine Martin: status changed; owner set

The gdb stuff is now documented here, you should use gdb python followed by attach $PID (with breakpoints as needed). Though the backtraces you provided looked good enough already.

You also said you could not start xpra again ("xpra upgrade" I assume) against the display after it has crashed? That would imply that the X11 server is left in an unusable state - which is a relatively hard thing to do. Do you have logs (-d all) for this case? Can you start an application against that display still? (ie: "DISPLAY=:NN xlsclients")

Maybe we're doing something totally illegal/buggy, causing the X11 server to crash. It would be worth trying other distros/versions to see if the problem is also present there (may help in narrowing things down).

Fri, 15 Feb 2013 12:55:35 GMT - onlyjob:

Yep, those GDB instructions is pretty much how I've taken those dumps.

Definitely xpra upgrade couldn't continue, but X was alive and well -- I could run apps or even xpra shadow and see that apps are still there working. Yes I could start new applications on that DISPLAY.

I'm attaching new xpra.log taken with "-d all". It is a complete log from start to crash where I reproduce the crash in less than a minute on 0.8.4+r2735.

Fri, 15 Feb 2013 12:56:12 GMT - onlyjob: attachment set

Fri, 15 Feb 2013 12:59:00 GMT - onlyjob:

The following was printed to terminal when Xpra crashed:

X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 10 (X_UnmapWindow)
  Resource id:  0x800009
X Error: RenderBadPicture (invalid Picture parameter) 173
  Extension:    149 (RENDER)
  Minor opcode: 7 (RenderFreePicture)
  Resource id:  0x44
X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 18 (X_ChangeProperty)
  Resource id:  0x800009
X Error: BadWindow (invalid Window parameter) 3
  Major opcode: 4 (X_DestroyWindow)
  Resource id:  0x800009

Fri, 15 Feb 2013 13:35:05 GMT - Antoine Martin:

Good idea to use "xpra shadow" or "xpra screenshot" to verify.

0x800009 is 8388617 in decimal. Now if we grep for this we find:

dock_tray(8388617) window=<gtk.gdk.Window object at 0x25c6460 (GdkWindow at 0x252f480)>, \
    geometry=(0, 0, 2560, 1280, 24), visual.depth=24
dock_tray(8388617) setting tray properties
dock_tray(8388617) resizing and reparenting
dock_tray(8388617) new tray container window 4194356
do_wimpiggy_child_map_event(<AdHocStruct object, contents: \
   {'delivered_to': <gtk.gdk.Window object at 0x241c6e0 (GdkWindow at 0x2008360)>, \
    'send_event': 0, 'override_redirect': 1, \
    'window': <gtk.gdk.Window object at 0x25c6500 (GdkWindow at 0x252f5a0)>, \
    'serial': 983L, 'type': 19, \
    'display': <gtk.gdk.Display object at 0x2189eb0 (GdkDisplayX11 at 0x2278230)>}>)
2013-02-15 23:44:02,455 Discovered new override-redirect window: 4194356 (tray=8388617)

So this looks like something is complaining about the tray window disappearing (which happens when xpra crashes or is upgraded - this needs to be moved to a separate process, but that's for another ticket). I don't think the system tray has anything to do with this bug, but it's worth trying the server with "-no-system-tray" just to be sure.

Now, the log also shows that before the actual crash, all we have are "will process ui packet pointer-position". Now it could just be that this is all that was happening at the time, or that the UI thread is already dead at this point. Either way, we're none the wiser. Time for the big guns, if you set (as per wiki/Debugging):

and maybe also:

Before starting xpra, the server will produce a very verbose logfile, but we are only interested in what happens just before the crash (say, the last second or so). Hopefully there will be some insight in there, because at the moment I just don't know.

Sat, 16 Feb 2013 03:50:40 GMT - onlyjob: attachment set

XPRA_X11_LOG=1 XPRA_X11_DEBUG=1 xpra start :13 -d all --no-system-tray

Sat, 16 Feb 2013 03:52:40 GMT - onlyjob:

Thank you for debbugging hints and fantastic interpretation of existing data. I attached tail of crash log with last ~4 seconds. I hope it makes sense to you.

Sat, 16 Feb 2013 04:33:25 GMT - Antoine Martin:

Hmmm, no definitive answer in the logs - though there are some clues: cursors and damage are the events that stand out. For cursors, see the patch below, for damage I will have to review by hand...

Does this help at all? (I don't see why the call would trigger window errors since windows are involved directly with cursors... but you never know):

--- src/xpra/server.py	(revision 2737)
+++ src/xpra/server.py	(working copy)
@@ -209,7 +209,7 @@
         self.send_cursor_pending = False
         self.cursor_data = None
         def get_default_cursor():
-            self.default_cursor_data = get_cursor_image()
+            self.default_cursor_data = trap.call_synced(get_cursor_image)
             log("get_default_cursor=%s", self.default_cursor_data)
@@ -387,7 +387,7 @@
     def send_cursor(self):
         self.send_cursor_pending = False
-        self.cursor_data = get_cursor_image()
+        self.cursor_data = trap.call_synced(get_cursor_image)
         if self.cursor_data:
             pixels = self.cursor_data[7]
             if self.default_cursor_data and pixels==self.default_cursor_data[7]:

(afterthought: an easier way to test if this is the cause of the problem is to disable cursors with "--no-cursors")

If it does not, and maybe even if it does and if you have time, can you try with the modified error.py as per Debugging X11 errors? (setting XPRA_LOG_ALL = True and XPRA_TRACE_ALL = True)

Sat, 16 Feb 2013 09:50:03 GMT - onlyjob:

Reproduced crash with "--no-cursors". Will get more detailed log soon.

Sat, 16 Feb 2013 10:34:29 GMT - onlyjob:

I can't reproduce neither ghost-window nor crash with modified "error.py"... Looks like synchronous mode is preventing the problem.

Please advise if there is anything I could do to troubleshoot this...

Also I've noticed that I was able to recover from crash (prior to applying debugging error.py) using xpra upgrade... Could it be effect of r2735 that I have applied?

Sat, 16 Feb 2013 10:51:41 GMT - Antoine Martin:

Well, that's excellent news at last. It means the problem was caused by #224.

That makes it very easy to spot where I've mistakenly used unsynced calls as there are only a few left outside areas which don't need them (in window.py those are followed by synced calls before returning and therefore safe):

I think this one is OK as I can't see how it would fail - but then again the X11 API might well be different from the GTK API.

Please replace trap.swallow_unsynced by trap.swallow_synced and see if the problem is solved. Fingers crossed, and thank you very much for your time!

Sat, 16 Feb 2013 12:25:58 GMT - Antoine Martin:

The clipboard one was ok actually (though I've cleaned it up in r2738)

Even before getting any feedback, I am convinced that we should use a synced call for raise_and_move (well, at least the raise portion which requires the window to still exist), so r2739 now uses synced X11 calls.

Does this help?

Sat, 16 Feb 2013 23:43:53 GMT - onlyjob:

Yes it is OK now. :) :) r2739 is totally fixed it: I can't reproduce neither indestructible "ghost" popups nor crashes. Fantastic, I'm so relieved that this testing marathon is over. :) Finally the issue has gone, hopefully forever. :)

I also tested with xpra-0.7.8 as client and couldn't crash latest Xpra-server any more. Great job.

Issues related to filelight hasn't changed so they are unrelated: I've made a new ticket #262.

As far as I'm concerned this ticket can be closed. Does it qualify for 0.8.5 release? ;)

Sun, 17 Feb 2013 01:58:04 GMT - Antoine Martin: status changed; resolution set

Definitely qualifies for 0.8.5, will try to push this today.

Thanks again for your time.

Sat, 23 Jan 2021 04:49:49 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/258