I'm making an April Fools' Day button for spacemacs.org with Xpra HTML5 and Docker Swarm back-end with the default load balancer. Here it is: http://xpratest.tk (temporary location) I had to modify Xpra code to make it work properly(ish). I think it will be great to make it doable with the unmodified Xpra
So the modifications:
First of all client need the ability to reconnect without refreshing page. It includes killing web workers and timers + Client
constructor calls Utilities.getAudioContext
that creates audio contexts (they are limited per page)
Also the error WebSocket connection to 'ws://****/' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED
should be handled properly (It occurs when there is no live backends)
At the server side I needed a way to drop the new WS client instead of the old one - this way a client will attempt to connect until it hits a live container without a client. And it has to be done before the new client will be able to mess up with the previous one (for example, force disconnect)
WARNING: eye bleed inducing code smell ahead!
server: https://github.com/JAremko/browsermax client: https://github.com/JAremko/develop.spacemacs.org
For now random hello timeouts seems to be the biggest problem. I had to UP the timeout substantially. (may be Docker related problem - I built it from github trunk, because I needed some extra sandbox features)
Also protecting Xpra server with CAPTCHA would be great :P For the cases like this. When you just want to embed Xpra window on a site and let users access it with some basic abuse prevention, but without custom reverse proxy shenanigans.
A more generic solution to your "Drop-latest-WS-connection.patch" has been added in r15393: if "steal" is false (not the default), we reject the connection if a client is already connected. (with some other cleanups thrown in)
The AudioContext
limitation: wouldn't it be cleaner to just cache the return value in Utilities.getAudioContext
? (this should be safe as it is never called from a worker?)
As for the hello timeouts: do you really need 120 seconds?! Where is it getting stuck during all that time?
The Error in connection establishment: net::ERR_CONNECTION_REFUSED
- isn't this handled already? I get sent back to the connect page.
I think that captcha stuff is out of scope as this would make it tied to an API.
Hi Antoine, thanks for responding!
Replying to Antoine Martin:
A more generic solution to your "Drop-latest-WS-connection.patch" has been added in r15393: if "steal" is false (not the default), we reject the connection if a client is already connected. (with some other cleanups thrown in)
Thx. I'll look into it.
The
AudioContext
limitation: wouldn't it be cleaner to just cache the return value inUtilities.getAudioContext
? (this should be safe as it is never called from a worker?)
you're right. This is how it should be handled in the "oficial implementation" I just do not need sound at all :)
As for the hello timeouts: do you really need 120 seconds?! Where is it getting stuck during all that time?
I think it is something related to my docker setup. Or due to web-browser "HTTP simultaneous connections per host limit". So I need better tests - if I want to go serious with this :P So far it looks like this huge timeout thingy doesn't hurt.
The
Error in connection establishment: net::ERR_CONNECTION_REFUSED
- isn't this handled already? I get sent back to the connect page.
new WebSocket(...)
in the web worker probably should be wrapped in something that allows retrying.
I think that captcha stuff is out of scope as this would make it tied to an API.
How about some kind of a universal interface for the captcha providers? Example: xpra start --captcha-provider=/usr/local/bin/captcha ...
Where /usr/local/bin/captcha
is a user made proxy to a captcha API like recaptcha it returns html code that will be shown by Xpra HTML5 client + a way to verify it (may be if called with a user response as an argument?) If server's captcha will use the same TCP connection (WS channel) - it will simplify load balancing. But it will need timeout on captcha solving to prevent DoS.
Also for restarting Client it doesn't make sense to retest stuff like web worker support. And Xpra Client should clean the Xpra container element.
If Xpra had Client event onconnect
it would help with switching host page interface (changing from starting to started state, for example)
And events like "on first GUI element(window?) appeared" and "on last GUI element disappeared" will make this ugly сrutch unnecessary.
ERR_CONNECTION_REFUSED
Note: this doesn't fire for when the server is terminated normally ("disconnect" packet handler still goes back to the connect page)
Tested by killing the server with "kill -9" and then re-starting one quickly: the html5 client connects to the new server. (forcibly killing the TCP connection should have the same effect)
I don't think I will ever have time or interest in the captcha API feature, so please create a separate ticket for that if you wish. (bearing in mind that unless you start working on it, not much is likely to happen...)
Replying to Antoine Martin:
This is amazing, thank You very much!
Replying to Antoine Martin:
- "reconnect" option added in r15402
Can this.reconnect_count=0
(or -1
) mean infinitely?
Replying to Antoine Martin:
I'm getting error Uncaught ReferenceError: me is not defined at Client.js:30
// assign callback for window resize event if (window.jQuery) { jQuery(window).resize(jQuery.debounce(250, function (e) { me._screen_resized(e, me); })); }
http://xpra.org/trac/browser/xpra/trunk/src/html5/js/Client.js?rev=15402#L30
chrome tab dies if server is unreachable due to multiple Protocol.js workers thread alive simultaneously
http://i.imgur.com/lG8jtOx.png
Also they're keep on trying to connect to the server even when I'm clearly connected and can interact with GUI
Reproduce at http://xpratest.tk/ : Connect, then disconnect by closing window and try to connect from the same tab.
Can it be because I have such a huge hello timeout?
I set back-end count to 1 so it will be easier to debug. Also firewall allows only 1 new connection in 20 seconds (from the same IP)
The steal=false
option seems to work, but the logs entry may be incomplete. (I don't see the text part)
Should reconnect occur when server or client timeout happens?
Uncaught ReferenceError: me is not defined at Client.js:30
is fixed in r15413
chrome tab dies if server is unreachable due to multiple Protocol.js workers thread alive simultaneously Also they're keep on trying to connect to the server even when I'm clearly connected and can interact with GUI
How do I reproduce this with the default xpra html5 client? There should only be a single worker, which we re-use. The re-connection should only happen when the current connection failed or dropped.
The steal=false option seems to work, but the logs entry may be incomplete. (I don't see the text part)
I don't understand what you mean. What logs? Can you show an "incomplete" sample of the log you are talking about?
Should reconnect occur when server or client timeout happens?
As of r15414, we re-connect on ping echo timeouts - which now also trigger more quickly. (15 seconds, configurable)
FYI: r15415 may be of interest to you too, it shows the connection setup progress
Replying to Antoine Martin:
Uncaught ReferenceError: me is not defined at Client.js:30
is fixed in r15413- to retry infinitely, just use Number.MAX_SAFE_INTEGER
chrome tab dies if server is unreachable due to multiple Protocol.js workers thread alive simultaneously Also they're keep on trying to connect to the server even when I'm clearly connected and can interact with GUI
How do I reproduce this with the default xpra html5 client? There should only be a single worker, which we re-use. The re-connection should only happen when the current connection failed or dropped.
I had the same problem with my old implementation when WebSocket? constructor failed many times. I solved it with this. May be you can make sure that the old Protocol worker is removed before creating a new one?
Mb log will help https://gist.github.com/JAremko/abfb8130d87e85d7df397ea6b112ca80
I think to detect it with the default client you need to disable redirect on disconnect and connect to a wrong WS address. (but currently my client is pretty "default") https://github.com/JAremko/develop.spacemacs.org/blob/gh-pages/index.html#L383
The steal=false option seems to work, but the logs entry may be incomplete. (I don't see the text part)
I don't understand what you mean. What logs? Can you show an "incomplete" sample of the log you are talking about?
Oh borrower log show only "session busy" I was looking for "this session is already active" ok.
I noticed that after a reconnect the Client doesn't honor this rule https://github.com/JAremko/develop.spacemacs.org/blob/gh-pages/index.html#L448
Hm...
2017-03-26 06:26:47,482 created unix domain socket: /home/emacs/.emacs.d/.cache/bbd19a81f821-14 Unable to create /home/emacs/.dbus Unable to create /home/emacs/.dbus/session-bus 2017-03-26 06:26:51,360 serving html content from: /usr/share/xpra/www 2017-03-26 06:26:51,467 started command 'emacs -geometry 100x48 --chdir "/home/emacs/.emacs.d/.cache/workspace"' with pid 50 2017-03-26 06:26:51,467 xpra X11 version 2.1 64-bit 2017-03-26 06:26:51,467 uid=1000 (spacemacser), gid=1000 (xpra) 2017-03-26 06:26:51,467 running with pid 12 on Linux 2017-03-26 06:26:51,468 connected to X11 display :14 with 24 bit colors 2017-03-26 06:26:51,469 15.6GB of system memory 2017-03-26 06:26:51,485 xpra is ready. (process:51): GLib-GIO-CRITICAL **: g_settings_schema_source_lookup: assertion 'source != NULL' failed 2017-03-26 06:29:53,152 Handshake complete; enabling connection 2017-03-26 06:29:53,159 HTML5 Linux client version 2.1 2017-03-26 06:29:53,159 automatic picture encoding enabled 2017-03-26 06:29:53,160 also available: 2017-03-26 06:29:53,160 jpeg, png, rgb32 2017-03-26 06:29:53,160 client root window size is 1920x1014 with 1 display: 2017-03-26 06:29:53,160 HTML (508x268 mm - DPI: 96x96) 2017-03-26 06:29:53,160 Canvas 2017-03-26 06:29:53,161 setting keyboard layout to 'us' 2017-03-26 06:29:53,207 client 1: got hello: server version 2.1 accepted our connection 2017-03-26 06:29:53,225 client 1: startup complete 2017-03-26 06:29:58,465 Handshake complete; enabling connection 2017-03-26 06:29:58,465 Disconnecting client 10.255.0.2:52434: 2017-03-26 06:29:58,465 new client (this session does not allow sharing) 2017-03-26 06:29:58,466 xpra client 1 disconnected. 2017-03-26 06:29:58,467 HTML5 Linux client version 2.1 2017-03-26 06:29:58,467 automatic picture encoding enabled 2017-03-26 06:29:58,467 also available: 2017-03-26 06:29:58,467 jpeg, png, rgb32 2017-03-26 06:29:58,467 Last client has disconnected, terminating 2017-03-26 06:29:58,467 xpra is terminating. 2017-03-26 06:29:58,471 client root window size is 1920x1014 with 1 display: 2017-03-26 06:29:58,471 HTML (508x268 mm - DPI: 96x96) 2017-03-26 06:29:58,471 Canvas 2017-03-26 06:29:58,472 keyboard mapping already configured (skipped) 2017-03-26 06:29:58,506 client 2: got hello: server version 2.1 accepted our connection 2017-03-26 06:29:58,508 client 2: startup complete
This server has become a zombie
So the zombie protocol workers and zombie servers are the biggest two problems now.
Looks like all you need to get this "zombie worker" bug in chrome is to disable redirect on disconnect and attempt to connect to a bad(closed, unresponsive) port with many retries and wait a minute.
http://i.imgur.com/gx5qkyt.png
each error block spawns extra workers.
Firefox doesn't seems to be affected.
I made You a test page http://xpratest.tk/zombie-test
(edit: link is now 404 - sigh)
It is the default client (r15425). I only change server, port and enable debug mode.
Lots of fixes in r15430 + r15431. Does that work for you?
(PS: please don't add links to the wiki if those are likely to go 404 in the future)
Replying to Antoine Martin:
Lots of fixes in r15430 + r15431. Does that work for you?
(PS: please don't add links to the wiki if those are likely to go 404 in the future)
Looks good to me. No zombie workers and it works.
Have you fixed zombie servers as well?
Have you fixed zombie servers as well?
What are those?
Can I close this ticket?
Replying to Antoine Martin:
Have you fixed zombie servers as well?
What are those?
2017-03-26 06:29:58,465 Disconnecting client 10.255.0.2:52434: 2017-03-26 06:29:58,465 new client (this session does not allow sharing) 2017-03-26 06:29:58,466 xpra client 1 disconnected. 2017-03-26 06:29:58,467 HTML5 Linux client version 2.1 2017-03-26 06:29:58,467 automatic picture encoding enabled 2017-03-26 06:29:58,467 also available: 2017-03-26 06:29:58,467 jpeg, png, rgb32 2017-03-26 06:29:58,467 Last client has disconnected, terminating 2017-03-26 06:29:58,467 xpra is terminating. 2017-03-26 06:29:58,471 client root window size is 1920x1014 with 1 display: 2017-03-26 06:29:58,471 HTML (508x268 mm - DPI: 96x96) 2017-03-26 06:29:58,471 Canvas 2017-03-26 06:29:58,472 keyboard mapping already configured (skipped) 2017-03-26 06:29:58,506 client 2: got hello: server version 2.1 accepted our connection 2017-03-26 06:29:58,508 client 2: startup complete
It seems to be something with that the first client exited before the new one completed handshake so the Xpra server is simply hangs.
Hard to reproduce...
Please create a new ticket for the server issue - this doesn't look related to the html5 client at all.
ok. Thanks for all the hard work! Really appreciate it.
(edit milestone and title)
See also #1491
re-connect bug: #1586
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/1473