Xpra: Ticket #257: can't reattach after losing connection

When using windows client 0.8.2 after losing connection to the network I can't connect to the running xpra session again

2013-02-12 00:47:50,665 New connection received: SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61569))
2013-02-12 00:47:51,076 Handshake complete; enabling connection
2013-02-12 00:47:51,078 Python/Gtk2 win32 client version 0.8.2 connected from 'somecomputer'
2013-02-12 00:47:51,078 using x264 as primary encoding
2013-02-12 00:47:51,078 max client resolution is 1366x768 (from [[1366, 768]]), current server resolution is 1440x900
2013-02-12 00:47:51,079 best resolution matching 1366x768 is unchanged: 1440x900
2013-02-12 00:47:51,080 setting key repeat rate from client: 500ms delay / 33ms interval
2013-02-12 00:47:51,082 keyboard mapping already configured (skipped)
2013-02-12 00:47:51,236 using sound codec mp3
2013-02-12 00:47:51,315 starting sound using pulseaudio device null.monitor
2013-02-12 00:47:51,595 codec: MPEG-1 Layer 3 (MP3)
2013-02-12 00:49:03,835 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:18,901 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:33,960 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:49,005 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:49:49,072 get_backlog found some damage acks that have been pending for too long, expiring them: [600, 601, 602, 603]
2013-02-12 00:50:04,058 delayed_region_timeout: sending now - something is wrong!
2013-02-12 00:50:04,123 get_backlog found some damage acks that have been pending for too long, expiring them: [604]
2013-02-12 00:50:11,085 New connection received: SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61598))
2013-02-12 00:50:11,458 Handshake complete; enabling connection
2013-02-12 00:50:11,459 Disconnecting existing client Protocol(SocketConnection(('1.2.3.100', 1200) - ('1.2.3.4', 61569))), reason is: new valid connection received, this session does not allow sharing
2013-02-12 00:50:31,370 unknown or invalid packet type: ping
2013-02-12 00:50:31,371 connection closed after 1 packets received (3382 bytes) and 0 packets sent (0 bytes)


Tue, 12 Feb 2013 09:15:14 GMT - Antoine Martin: status changed; owner set

Need more details please, as per wiki/ReportingBugs: full command lines, etc Which connection is the "good" one, etc.


Tue, 12 Feb 2013 09:23:34 GMT - Smo:

Server OS is Fedora 18 x86_64 Client OS is Windows 7 Ultimate 64 bit

Instructions to reproduce

  1. Start xpra on server with command dbus-launch xpra --bind-tcp=0.0.0.0:1200 --no-daemon --start-child=chromium-browser start :11
  1. Connect with xpra client with command C:\Program Files (x86)\Xpra>xpra_cmd attach tcp:1.2.3.100:1200
  1. Sever your connection on purpose and wait for xpra window to time out
  2. Reestablish your connection and connect back to xpra

Sat, 16 Feb 2013 05:14:04 GMT - Antoine Martin:

I've been unable to reproduce on a Fedora 18 host, here's how I tested:

What I see is that the first client eventually times out (after about 60s, and it says so) and that the second client connects immediately without problems.

Can you confirm that you get the same results?

Now, the main difference between those two tests is the client (Fedora vs Windows), but since the connection is meant to have been severed (by wifi drop out in your test case or iptables rule in my example), I don't really see why this should make any difference at all.... Ideas welcome.

Are you using the UI on MS Windows? Can you try it from the command line instead? Can you test with a plain xterm instead of Chrome?


Sat, 16 Feb 2013 05:16:10 GMT - Antoine Martin: version, summary changed

Also, please try a v0.7.8 client/server so we can figure out if this is a recent regression.

(fixed ticket metadata)


Sat, 16 Feb 2013 12:42:42 GMT - Antoine Martin:

Looking at the log messages again and something looks suspicious:

unknown or invalid packet type: ping

This happens after the old dead connection has been dropped, so it must be dropping the new connection at this point? But why would it? r2740 will log the source of the invalid packet so we will know for sure (so please update to r2740 or later) Also, please try both with normal logging and with "-d all" so we can get more details from the log messages.


Tue, 19 Feb 2013 23:50:03 GMT - alas:


Wed, 20 Feb 2013 02:42:31 GMT - Antoine Martin:

comment:6 which has been deleted shows that you are running the server with --no-deamon and via putty, this is the source of the problem - just do not do that!

Xpra is a server process, running it with --no-deamon leaves it attached to stdin/stdout and those fds end up being network connections that you then kill - the process is then unable to close properly and the kernel will leave the sockets in dead state for a while and refuse to allow a new process to use them. The exact timeout after you kill -9 the process can be tuned via sysctl IIRC - but you should just never do that.

Alternatives: use screen, use nohup, don't use --no-deamon, etc..

Please confirm so I can close this ticket as invalid. (always posting the exact command lines and setup environment does save a lot of time)


Thu, 28 Feb 2013 09:01:49 GMT - Antoine Martin: status changed


Thu, 28 Feb 2013 09:03:16 GMT - Antoine Martin: owner changed

Please see comment:5, it would be good to figure out where this message came from.

Apart from that, I believe this bug can be closed as invalid.


Thu, 28 Feb 2013 18:47:56 GMT - Smo:

As requested I tried some older clients I haven't tried older server but here are my findings.

Server Fedora 18 xpra-0.8.5-1.fc18.i386
Client Windows 7 ultimate 32 bit

Client command

Xpra_cmd.exe attach tcp:1.2.3.71:10000

Server command

xpra --no-daemon --bind-tcp=0.0.0.0:10000 --start-child=xterm start :12

Windows client 0.7.8 No problem 0.7.8 disconnects cleanly when it loses its connection

Windows client 0.8.7 2 tests

fails to disconnect cleanly xterm window remains open and non responsive after killing connection on the server end Need to kill task in the task manager to continue. After reconnecting here is the client log

xpra client version 0.8.7
2013-02-28 10:36:43,048 server ping timeout - waited 60 seconds without a respon
se
2013-02-28 10:36:43,053 connection closed after 0 packets received (0 bytes) and
 5 packets sent (3493 bytes)
2013-02-28 10:36:43,058 Connection lost

And here is the log on the server end

New connection received: SocketConnection(('1.2.3.71', 10000) - ('1.2.3.57', 49187))
Handshake complete; enabling connection
Disconnecting existing client Protocol(SocketConnection(('1.2.3.71', 10000) - ('1.2.3.57', 49186))), reason is: new valid connection received, this session does not allow sharing
unknown or invalid packet type: ping
connection closed after 1 packets received (3373 bytes) and 0 packets sent (0 bytes)

As you can see that mysterious message is back but only with the newer version of the client.


Fri, 01 Mar 2013 05:24:56 GMT - Antoine Martin: owner, status changed

How do you disconnect the existing client, by making a new connection to the server? I am confused because this ticket is originally about disconnection because of network dropouts, but the log you supply indicates a disconnection due to a new client connection - those are very different things. The former will rely on a timeout, which may take some time, the latter should be instantaneous.

Do you get the same behaviour if you Control-C the client?


Fri, 01 Mar 2013 05:35:29 GMT - Antoine Martin:

Does this patch fix the issue?

--- src/xpra/server_source.py	(revision 2850)
+++ src/xpra/server_source.py	(working copy)
@@ -398,8 +398,10 @@
         self.window_sources = {}
         self.window_metdata_cache = {}
         self.close_mmap()
-        self.protocol = None
         self.stop_sending_sound()
+        if self.protocol:
+            self.protocol.close()
+            self.protocol = None
     def user_event(self):
         self.last_user_event = time.time()

This should make sure we disconnect from the server end, though I would have thought that the client would realize the connection is dead within 1 minute or so as it doesn't get the ping reply.

Also, I don't see why it would help as the server code already does this:

protocol.flush_then_close(["disconnect", reason])

But maybe the write flushing spins endlessly - in which case we may need yet more code to deal with that.. (make it timeout eventually)


Sat, 02 Mar 2013 01:29:05 GMT - Smo:

The server log is from the new connection from the client log right above it.

I was removing my network cable to test this and then plugging it back in after the time-out to reconnect.

We only stumbled upon it since there is one guy that routinely just shuts his laptop lid and never disconnects his client causing the same issue.

Afterwards the server won't accept any new connections at all.

Oddly enough I built a new server from source today and using the windows client 0.8.4 and 0.8.7 that you built I'm unable to reproduce the problem.

As soon as I connect one of those clients to an older server 0.8.5 32bit that you have built as well I can reproduce it right away.

I guess I will do some further testing and let you know what I find out.


Tue, 05 Mar 2013 21:25:31 GMT - Antoine Martin: priority changed

It's a shame it's not very easy to reproduce, because otherwise we could bisect this..

(raising priority as this can cause serious problems for people)


Wed, 06 Mar 2013 17:32:36 GMT - Antoine Martin: status changed; resolution set

Sorry it has taken so long to fix this, I was finding it hard to reproduce reliably, but in fact this was trivial using a VirtualBox VM and using "save machine state".

Fixed in r2878, will backport to v0.8.x and release the bug fix.


Thu, 07 Mar 2013 06:55:07 GMT - Antoine Martin:

Follow up:


Thu, 07 Mar 2013 10:26:12 GMT - Antoine Martin:

as of r2894, the server will kill client connections when it does not get a response to its pings (default: ping every 10 seconds, kill after 1 minute)


Sat, 23 Jan 2021 04:49:47 GMT - migration script:

this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/257