#620 closed defect (fixed)
performance optimizations from profiling
Reported by: | Antoine Martin | Owned by: | alas |
---|---|---|---|
Priority: | critical | Milestone: | 0.14 |
Component: | android | Version: | 0.12.x |
Keywords: | Cc: |
Description (last modified by )
After playing around with wiki/Profiling, I've identified some areas where tuning could lead to some significant improvements:
Protocol.encode
callingCompressed.__len___
fixed in r7021error.XErrorToName
called when the result is generally not used (not very expensive, but used very often!) fixed in r7022region.contains
improved for python>=2.5 in r7025 (+fix in r7027)
Not done yet:
WindowVideoSource.get_best_encoding()
costs 10ms! (including 5ms forget_encoding_options()
!)get_current_speed()
andget_current_quality()
cost about 1ms each, but we call them a lotWindowMode.get_dimensions()
: 2msregion.add_rectangle
: 2.5ms!send_cursor
: we should compress before sending, not in the event loop!set_client_properties
: 30ms! (huge!)
Attachments (9)
Change History (37)
Changed 7 years ago by
Attachment: | damage-r7031.png added |
---|
comment:1 Changed 7 years ago by
Description: | modified (diff) |
---|---|
Owner: | changed from Antoine Martin to Antoine Martin |
Status: | new → assigned |
The graph attached was generated using:
./tests/scripts/pycallgraph -d 10 \ -e libs,std,misc,one_offs,net,logging,x11,connection,keyboard,clipboard,sound,mouse \ -- start :10 ...
Progress:
get_current_speed()
andget_current_quality()
removed in r7030 (+fix in r7042)- for
get_best_encoding
, we should pre-calculate the options, including encoding specific things like themax_webp
value - not sure this can be done for this release. Maybe we can at least catch the special case ofsend_full_window_update
as seen in the graph above? send_cursor
is now doing the compression work later and per-client, which also allows us to use lz4 or lzo if supported
comment:2 Changed 7 years ago by
Some improvements on get_encoding_options
(try to pre-calculate some) in r7178, often much faster for the non-video case - the video case remains harder..
Changed 7 years ago by
Attachment: | encoding-scoring-v2.patch added |
---|
choose the encoding using a scoring mechanism
comment:3 Changed 7 years ago by
Improvements in r7203 and r7204.
The patch above is much nicer than the current code, but unfortunately it is not faster (yet? why not!?) and it is missing some features.
New important bottleneck identified (and expected): the rectangle merging/splitting code. It works fine with a few rectangles of limited size but really slows us down with more demanding tests. Needs to be cythonized.
Changed 7 years ago by
Attachment: | damage-r7215.png added |
---|
fast encoding selection is no longer the bottleneck
comment:4 Changed 7 years ago by
Encoding selection is now much improved in r7215, to the point that this is no longer the bottleneck (though get_best_encoding_default
method could still be improved as it is now a bit ugly and inefficient), as can be seen here (synthetic benchmark - which is hiding the real problem with rectangles..):
I liked the scoring method from the patch above as it allows for smoother encoding transitions than the arbitrary limits set in the current code, maybe we can merge it without slowing things down too much? Or maybe only use it beyond a certain size, to make small regions a fast path (as there is likely to be many more of those and choosing the right encoding is less important and easier: rgb often does a fine job for those)
comment:5 Changed 7 years ago by
comment:6 Changed 7 years ago by
Whereas the "rectangle slow" screenshot was cropped from a larger capture, this one is captured using a dedicated profiling command:
./tests/scripts/pycallgraph -d 10 \ -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio
And the result is much better:
Is it a heisenbug where the profiling code slows down the rectangle loops so much that it becomes a problem? (the first capture has hundreds of expressions, the second one only the 6 shown above)
comment:7 Changed 7 years ago by
Owner: | changed from Antoine Martin to alas |
---|---|
Status: | assigned → new |
I think it is good enough now, please run the profiling as per comment:1 and comment:6 with the scenarios you are interested in, and post the resulting file pycallgraph-xpra.png
.
And look for any unusually high time per call (usually highlighted by a different colour).
It would be worth testing the video region code too with just: -I *subregion*
, assuming #410 isn't in too bad a shape.
comment:8 Changed 7 years ago by
I've updated my pycallgraph to v1.0.1 like it is listed in your graphs.
Following the commands in comment:1 and comment:6 I've ran a few tests this way and I've only once got pycallgraph-xpra.png
in my local directory.
I don't see any errors but I also don't get the resulting graph. What am I doing wrong ?
Changed 7 years ago by
Attachment: | graph-fail.png added |
---|
comment:9 Changed 7 years ago by
Owner: | changed from alas to Smo |
---|
Please post the exact commands that you used for both the server and the client, the command output, etc.
comment:10 Changed 7 years ago by
Priority: | major → critical |
---|
Should be closed for 0.14, raising to critical.
comment:12 Changed 7 years ago by
I'm running the server with
./tests/scripts/pycallgraph -d 10 \ -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400
Then i'm connecting with a windows client
C:\Program Files (x86)\Xpra>Xpra_cmd.exe attach tcp:1.2.3.4:1400
I do a few things in the xterm then disconnect afterwards I stop the server with ctrl+c and there is often no .png in the current directory so i'm not sure what i'm doing wrong
comment:13 Changed 7 years ago by
Owner: | changed from Smo to Antoine Martin |
---|
comment:14 Changed 7 years ago by
Owner: | changed from Antoine Martin to Smo |
---|
Don't control-c the server, try "xpra stop" or when profiling the client, just exit from the tray.
comment:15 Changed 7 years ago by
Tried again using xpra stop instead. I got pycallgraph-xpra.png
this time but it was exactly the same as the previous one attached to this ticket.
I used the same commands at comment:12
comment:16 Changed 7 years ago by
Please re-assign tickets to me if I am supposed to act on them.
Did you wait for at least 10 seconds, which is the initial delay set?
You should then see this message in the server output:
starting trace
If that still does not work for you, please post full versions, log output, etc..
comment:17 Changed 6 years ago by
Owner: | changed from Smo to Antoine Martin |
---|---|
Status: | new → assigned |
Maybe we do need to optimize intersect_rect
: see ticket:670#comment:6.
Changed 6 years ago by
Attachment: | region-cython.patch added |
---|
converts the region code to use a cython class, doubling the speed
comment:18 Changed 6 years ago by
The reason why the patch above only doubles the speed is because most of the expensive operations still work on python lists of objects (untyped).
That's not an easy problem to solve. One way would be to use numpy arrays for the list of rectangles, but we would lose the nice encapsulation.
comment:19 Changed 6 years ago by
Owner: | changed from Antoine Martin to alas |
---|---|
Status: | assigned → new |
r9146 + r9147 re-implements the rectangle code using cython, similar to the patch above but with more code unrolling.
Also adds a test to measure the impact of this change and the results are good (elapsed time - lower is better):
Test | 0.14 | 0.15 | Improvement |
---|---|---|---|
R1 create | 0.26 | 0.07 | 3.7x |
R1 add | 0.93 | 0.19 | 4.8x |
R1 remove | 0.44 | 0.14 | 3.1x |
R1 contains | 0.08 | 0.08 | 1x |
R2 create | 0.10 | 0.03 | 3.3x |
R2 add | 0.48 | 0.10 | 4.8x |
R2 remove | 0.17 | 0.05 | 3.4x |
R2 contains | 0.08 | 0.08 | 1x |
merge all | 2.09 | 0.02 | 104x |
@afarr: I am re-assigning to you because you guys need to be able to run the profiler.
I have just ran it again, and now that the code is in cython, we don't get as much output - but we still get something..
comment:20 Changed 6 years ago by
comment:21 Changed 6 years ago by
Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.
[tlaloc@jimador src]$ xpra --version 2015-06-02 17:34:20,115 rencode at '/usr/lib64/python2.7/site-packages/rencode/__init__.pyc' lacks versioning information xpra v0.16.0
I suppose I'm missing some other detail (probably fairly obvious), because when I try to run with the same command as Smo from comment:12 I get the following traceback:
[tlaloc@jimador src]$ ./tests/scripts/pycallgraph -d 10 \ > -I "*rectangle*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" \ > -- start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1400 Traceback (most recent call last): File "./tests/scripts/pycallgraph", line 7, in <module> from pycallgraph import PyCallGraph, Config ImportError: No module named pycallgraph
Checking from the src directory, I do indeed see what appears to by a pycallgraph in tests/scripts ... so - does this look like an issue related to not having rencode versioning?
- Trying to install python-rencode I get this:
[tlaloc@jimador src]$ sudo yum install python-rencode [sudo] password for tlaloc: Loaded plugins: langpacks https://winswitch.org/beta/Fedora/21/x86_64/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found Trying other mirror. Package python-rencode-1.0.3-1.fc21.x86_64 already installed and latest version Nothing to do
Trying to install python3-rencode (which I see in the dists directory for Fedora 21) I get apparent success:
Installed: python3-rencode.x86_64 0:1.0.3-1.fc21
... maybe not the right version?
comment:22 Changed 6 years ago by
Ok, Smo held my hand through the install of the pycallgraph (didn't realize it needed a separate install).
Was able to generate a graph of a session with xterm start child launching a firefox window... opening a couple of tabs and watching video - with a 0.15.0 client connected to a 0.16.0 server. Not that I'm sure exactly what the chart indicates.
comment:24 Changed 6 years ago by
Well... I tried to install on fedora 21, seemed to get everything installed, but still apparently lacking rencode versioning.
python-rencode
is in the fedora repo, if you have this repo installed, simply doing a yum update
will get you the up to date rencode version, which will get rid of this warning.
comment:25 Changed 6 years ago by
I think I finally found a rencode library that at least causes no complaints.
Ran another test, got the same results as above, so presumably the rencode wasn't at issue.
Looking through the wiki/profiling page clarifies a number of the parameters I was copying from your/Smo's example... but I'm not entirely clear on what bits list I would use to tweak the profiling.
As an experiment I tried a small tweak:
./tests/scripts/pycallgraph -d 10 -I "*rectangle*,*encoding*,*.damage,*.do_send_delayed_regions,*.make_data_packet_cb,*._refresh_region,*.add_refresh_region" -- \ start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm
Which generated a considerably less interesting graph (just the _main_
grey box, really...) and also managed to produce some tracebacks/access violations client-side (prior to expected ctrl-c issues with win32 client): moved to #929.
So... is there another wiki page that outlines the structure and which "pycall"s should be expected along certain ... err... "usepaths"?
Changed 6 years ago by
Attachment: | pycallgraph-xpra--i-subregion-15-4.png added |
---|
subregion pycallgraph test
comment:26 Changed 6 years ago by
Looking back through this ticket I noticed the -I *subregion*
suggested setting.
Running with the following command line:
./tests/scripts/pycallgraph -d 10 -I "*subregion*" -- \ start :10 --no-notifications --no-daemon --start-child=xterm --no-mdns --no-pulseaudio --bind-tcp=0.0.0.0:1201 --start-child=xterm
0.15.4 r10055 win32 client against 0.15.4 r10082 fedora 21 server (smo's build) I was able to generate the following graph, after using a start-child xterm to start firefox and watch some video.
So is this sufficient "mastery" (to use the term very very loosely) to close this ticket and feel (vaguely) comfortable that I will be able to run this graphing should it be called for in the future? Or is there something glaringly obvious that I'm still missing?
(Hmm... and should I run the test again to produce non-reduced to fit the ticket results?)
comment:27 Changed 6 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Looking back through this ticket I noticed the -I *subregion* suggested setting.
What is included will depend on what we're looking for. subregion is a good one to look at.
I was able to generate the following graph
This will do for now.
comment:28 Changed 6 weeks ago by
this ticket has been moved to: https://github.com/Xpra-org/xpra/issues/620
example profiling output taken at r7031