xpra icon
Bug tracker and wiki

Opened 10 months ago

Closed 7 months ago

Last modified 6 months ago

#1447 closed defect (fixed)

Server does not accept initial connection on remote start.

Reported by: psycho_zs Owned by: psycho_zs
Priority: major Milestone: 2.1
Component: server Version: trunk
Keywords: Cc:

Description

Starting xpra with xpra start ssh:....

Server on remote machine is started, but this is being spammed continiously into the log:

2017-02-21 21:37:05,892 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:05,991 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:05,992 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:05,993 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,092 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,093 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,094 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,193 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,194 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,294 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,295 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,296 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,396 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,397 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,496 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,497 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,498 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,597 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,598 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,599 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,699 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,700 New unix-domain connection received on /var/run/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /var/run/user/1000/xpra/hostname-100
2017-02-21 21:37:06,800 New unix-domain connection received on /home/username/.xpra/hostname-100
2017-02-21 21:37:06,802 New unix-domain connection received on /var/run/xpra/hostname-100

Client soon exits, log spamming continues until I connect with xpra attach.
Then server's log is truncated and replaced with this message:

(EE) 
Fatal server error:
(EE) Server is already active for display 100
	If this server is no longer running, remove /tmp/.X100-lock
	and start again.
(EE) 
(EE) 
Please consult the The X.Org Foundation support 
	 at http://wiki.x.org
 for help. 
(EE) 
2017-02-21 21:37:21,079 
2017-02-21 21:37:21,079 Xvfb command has terminated! xpra cannot continue
2017-02-21 21:37:21,079  if the display is already running, try a different one,
2017-02-21 21:37:21,079  or use the --use-display flag
2017-02-21 21:37:21,079 
2017-02-21 21:37:21,080 killing xvfb with pid 14651
2017-02-21 21:37:21,080 failed to kill xvfb process with pid 14651:
2017-02-21 21:37:21,080  [Errno 3] No such process

Attachments (4)

server.log (28.3 KB) - added by psycho_zs 10 months ago.
client.log (1.2 KB) - added by psycho_zs 10 months ago.
server_3sockets.log (10.9 KB) - added by psycho_zs 9 months ago.
server-r15494.log (4.1 KB) - added by psycho_zs 8 months ago.

Download all attachments as: .zip

Change History (33)

comment:1 Changed 10 months ago by Antoine Martin

Milestone: 2.0
Owner: changed from Antoine Martin to psycho_zs

I did find a bug, fixed in r15133, but this would have caused the client side to also emit a large number of warning messages to stdout - did you not get those?

Please close if that fixes things.

comment:2 Changed 10 months ago by psycho_zs

Resolution: fixed
Status: newclosed

r15180 works well on remote start, no log spam on both ends.

Changed 10 months ago by psycho_zs

Attachment: server.log added

Changed 10 months ago by psycho_zs

Attachment: client.log added

comment:3 Changed 10 months ago by psycho_zs

Resolution: fixed
Status: closedreopened

If server has leftovers from previous run, client can not initiate remote start and same log spam happens on server.
Attached server log and client output.

comment:4 Changed 9 months ago by Antoine Martin

I'm not sure what you mean by "has leftovers from previous run", or how I would reproduce this.

Assuming that you hit the case where a server was already running on the display that you tried to start remotely, one could argue that the code is working as intended by not matching this server, but r15190 will now allow us to match it and connect to it.
(could be backported I guess - not sure that's necessarily right)

comment:5 Changed 9 months ago by psycho_zs

Sorry, I'm speaking about sockets:

/var/run/user/1000/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:05,310 created unix domain socket: /var/run/user/1000/xpra/server-100
/home/user/.xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:09,315 created unix domain socket: /home/user/.xpra/server-100
/var/run/xpra/server-100 is not responding, waiting for it to timeout before clearing it.....
2017-02-28 21:09:13,321 created unix domain socket: /var/run/xpra/server-100

Maybe client doesn't wait for this process to finish?

Last edited 9 months ago by psycho_zs (previous) (diff)

comment:6 Changed 9 months ago by Antoine Martin

Milestone: 2.02.1
Owner: changed from psycho_zs to Antoine Martin
Status: reopenednew

Those sockets are usually left behind when a server crashes.
Those messages were not in the log samples your had provided.

Fixing this will be harder, we must wait in case the server is still alive, and the client also needs to give up after waiting too long...
What we should do in this case is wait for all the sockets in parallel to save time.
This will be a bigger change, too late for this version.

comment:7 Changed 9 months ago by psycho_zs

I copied that snippet directly from the attached server.log )

Viewing server and client log live it seemed that client gives up just at the same time server finises dealing with sockets. So a dirty fix would be to give the client a couple more seconds of timeout.

I will try catching the crashes.

comment:8 Changed 9 months ago by Antoine Martin

I copied that snippet directly from the attached server.log )

Sorry, yes - I missed it!

comment:9 Changed 9 months ago by Antoine Martin

r15278 raised the socket timeout to 20s (r15300 for v1.0.x branch).
This will help until we can do parallel socket probing.

comment:10 Changed 9 months ago by Antoine Martin

Owner: changed from Antoine Martin to psycho_zs

It's not really done in parallel as that would be too hard to implement, but r15381 should make things more reliable.
I've struggled to test this properly because if I "kill -9" we correctly detect that the socket is dead, and if I just hang the server then the socket still looks alive..

@psycho_zs: does that work for you? how can I test hung sockets?

comment:11 Changed 9 months ago by psycho_zs

I was left with old sockets some time ago when xpra server crashed sometimes. Now it does not, so I do not know how to test this situation now.
I will test remote start in the next couple of days.

comment:12 Changed 9 months ago by psycho_zs

In 2.0: killing all xpra-related processes with kill -9 leaves socket $HOME/.xpra/$SERVERNAME-$PORT. Trying to remote-start after that produces described behavior despite increased timeouts. It also seems that "New unix-domain connection received on..." message spam is continued to removed log with old file descriptor (seen if tailing log throughout the whole connection attempt).

I have troubles building current trunk, it does not seem to produce some files listed in xpra.install, like etc/xpra/xorg.conf, usr/lib/tmpfiles.d/xpra.conf. Would you add a fresh beta package?

comment:13 Changed 9 months ago by Antoine Martin

  • found a way to do proper parallel socket probing using threads, implemented in r15421
  • the packaging issue with tmpfiles.d / sysusers.d files should be fixed in r15422 (see #1450 for details)

Older branches will probably not be patched - as I can't think of a way to make things better without changing too much code.

comment:14 Changed 9 months ago by psycho_zs

r15425 seems to be working even with leftover sockets. Although it failed once on the first test, right after upgrade. (I killed xpra processes before upgrading).

comment:15 Changed 9 months ago by Antoine Martin

Resolution: fixed
Status: newclosed

Closing, feel free to re-open if you can give steps to reproduce problems with trunk.

comment:16 Changed 9 months ago by psycho_zs

Resolution: fixed
Status: closedreopened

I reproduced the situation in which client again does not wait long enough to connect. Xpra server did not survive dbus restart after recent system upgrade, leaving behind xpra _proxy_start :100 process and and X. So I killed everything with killall -u username and did remote start via xpra start ssh:.... Server waited for 3 sockets, client disconnected just before server finished waiting on them:

Warning: some of the sockets are in an unknown state:
 /var/run/user/1000/xpra/servername-100
 /home/username/.xpra/servername-100
 /var/run/xpra/servername-100

Full server log in attachment...

Changed 9 months ago by psycho_zs

Attachment: server_3sockets.log added

comment:17 Changed 9 months ago by Antoine Martin

Sorry but I cannot reproduce this.
Please try to provide specific steps.

comment:18 Changed 8 months ago by psycho_zs

With new initialization mechanism are there any potential race conditions left?

I was able to reproduce it the second time:
On server with running xpra restart dbus with related daemons

systemctl restart dbus ; systemctl daemon-reexec ; systemctl restart systemd-logind systemd-journald NetworkManager

kill all user processes:

killall -u $USER

Try to do xpra start ssh:...

Client gives up and disconnects somewhere around the time server prints "created unix domain socket" messages. Attach works a moment later.

comment:19 Changed 8 months ago by Antoine Martin

Timeouts are inherently racy.
It should be enough time for the server to start, even when waiting and then clearing the sockets, but there's no way we can be sure of that. And if the server is quite slow, it won't be enough. (add to that the ssh login delay)
Raising the client timeout indefinitely is not a good solution either. It is already at 15 seconds, which is quite a long time.
Are you sure your 2.1 server takes that long to start? (are you even testing with 2.1?)

No need to run "xpra shadow ssh:..." which just complicates things.
Just kill it and watch it start with "xpra shadow :0 --no-daemon" - see how long that takes.
The output should include timestamps, please post that information. (or the server log file if you wish)
Without hard data, it is difficult to know where to go.

Example on my relatively beefy system:

2017-04-03 22:53:23,078 created unix domain socket: /run/user/1000/xpra/desktop-10
2017-04-03 22:53:23,078 created unix domain socket: /home/antoine/.xpra/desktop-10
2017-04-03 22:53:23,079 created unix domain socket: /var/run/xpra/desktop-10
(..)
2017-04-03 22:53:24,038 xpra is ready.

So that's 1 second from the socket creation until fully ready.
(and the socket timeout + clearing should be around 8 seconds maximum)

Last edited 8 months ago by Antoine Martin (previous) (diff)

comment:20 Changed 8 months ago by psycho_zs

See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment. 2.1-20170326r15425-1

comment:21 Changed 8 months ago by Antoine Martin

See server_3sockets.log​, it is the server log from the similar situation as described in my previous comment.

That log is against a 2.1-r15340 server, therefore missing any of the fixes from comment:10 onwards. We want the timing information of the current code.

comment:22 Changed 8 months ago by psycho_zs

I've just rechecked, the package version is 2.1-20170326r15425-1, but the log reports 2.1-r15340. OK, I'll try a fresher one.

comment:23 Changed 8 months ago by psycho_zs

r15494 package matches. Attaching log...
Remote start after dbus restart and killed user processes worked. (But there were no substantial system upgrade preceding it).

Changed 8 months ago by psycho_zs

Attachment: server-r15494.log added

comment:24 Changed 8 months ago by Antoine Martin

OK, so in this log we can see:

2017-04-03 20:44:01,438 Warning: some of the sockets are in an unknown state:
19	2017-04-03 20:44:01,438  /var/run/user/1000/xpra/servername-100
20	2017-04-03 20:44:01,438  /home/username/.xpra/servername-100
21	2017-04-03 20:44:01,439  /var/run/xpra/servername-100
22	2017-04-03 20:44:01,439  please wait as we allow the socket probing to timeout
23	2017-04-03 20:44:09,464 created unix domain socket: /var/run/user/1000/xpra/servername-100
24	2017-04-03 20:44:09,465 created unix domain socket: /home/username/.xpra/servername-100
25	2017-04-03 20:44:09,465 created unix domain socket: /var/run/xpra/servername-100
(..)
41	2017-04-03 20:44:12,034 xpra is ready.

So that's:

  • 8 seconds for timing out the old sockets
  • 3 seconds for creating the new ones and starting up

For a total of 11 seconds, which is well under the client timeout of 15 seconds..
r15508 reduces from 8 to just 6 seconds, which should still be enough to distinguish between a server that is starting up (3 seconds in your case) and one that is just dead.

comment:25 Changed 7 months ago by Antoine Martin

@psycho_zs: can I close this ticket or can you reproduce with r15508 or later?

comment:26 Changed 7 months ago by psycho_zs

Thanks. I'll try to check in the next couple of days. Busy month.

comment:27 Changed 7 months ago by psycho_zs

Xpra from latest svn can not start server

xpra start :100 -d all
using systemd-run to wrap 'start' server command
'systemd-run' '--description' 'xpra-start' '--scope' '--user' '/usr/bin/xpra' 'start' ':100' '-d' 'all' '--systemd-run=no'
Failed to create bus connection: No such file or directory

Latest deb build 2.1-20170409r15542-1 seem to be working, at least it connected successfully with leftover sockets on server a couple of times.
First timestamp visible in client output: 10:19:10
First timestamp visible in server log: 10:19:13
"Xpra is ready" message on server: 10:19:21
Handshake complete: 10:19:22

Last edited 7 months ago by Antoine Martin (previous) (diff)

comment:28 Changed 7 months ago by Antoine Martin

Resolution: fixed
Status: reopenedclosed

The timeout issue is fixed so I am closing this bug.

If you are having systemd startup problems, please file a separate ticket.
The exact systemd-run command we use is printed out, if that fails then it is quite unlikely to be an xpra bug.

comment:29 Changed 6 months ago by Antoine Martin

Saw something similar in ticket:1537#comment:3.

Note: See TracTickets for help on using tickets.