Version 5 (modified by 8 years ago) (diff) | ,
---|
Profiling
First, please see Debugging for information on logging and the environment variables that influence the behaviour of the system.
cProfile
cProfile outputs text only, which may be more appropriate in some cases. Simply replace xpra
with tests/scripts/cProfile
on the command line:
tests/scripts/cProfile start :10
And on exit it will dump the statistics to the terminal, ie:
Mon Jan 7 16:24:19 2013 profile.stats 28226 function calls (27385 primitive calls) in 4.409 seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 3.958 3.958 4.050 4.050 {gtk._gtk.main} 1 0.090 0.090 0.113 0.113 /usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py:23(<module>) 2 0.050 0.025 0.050 0.025 {wimpiggy.lowlevel.bindings.get_screen_sizes} 1 0.049 0.049 0.057 0.057 /usr/lib64/python2.7/site-packages/gst-0.10/gst/__init__.py:24(<module>) 2 0.026 0.013 0.026 0.013 {method 'read' of 'file' objects} 2 0.014 0.007 0.014 0.007 {method 'search' of '_sre.SRE_Pattern' objects} 1 0.012 0.012 0.012 0.012 {method 'close' of 'gtk.gdk.Display' objects} (...)
pycallgraph
pycallgraph can generate pretty diagrams.
We now ship a wrapper which takes care of simplifying the calls to pycallgraph
:
./tests/scripts/xpra/pycallgraph
This command includes a number of working examples and lists the groups and threads available for profiling.
Using pycallgraph to identify bottlenecks ==
General guidelines:
- start with a small delay (ie:
-d 2
) to skip the initial setup, potentially longer if you start a server and then also need to connect the client before profiling - when comparing versions/changes/settings: use the runtime parameter to ensure that each run will last for exactly the same amount of time - this makes it easier to compare the labels using absolute values, and script the whole thing so the client connects at exactly the same time if profiling the server
- focus on one thread at a time - there is not much overlap/impact between different threads
- choose an application to test which will stress the desired portion of the code (ie: if testing the network or encoding thread, an application that generates a high frames-per-second is more appropriate than what may ultimately be the target application for the system)
- choose an application whose behaviour is predictable: preferably not using the network or relying on any external resources. So the whole thing can be scripted, run repeatedly/reliably for comparing with previous runs.
- start with most groups enabled
- most groups can usually be disabled early - unless they are being profiled specifically
- some Debugging environment variables may be useful
- save useful graphs with all the information relating to the version used, modifications, environment, etc. So these can be reproduced later.
Fixing Bottlenecks
First, write a test to measure/isolate the problematic part of the code.
There are no magic solutions to apply, some common solutions:
- threading (we already have too many threads though..)
- re-writing in C or Cython
- better algorithms
Examples
maths.py re-written in Cython
Merged in r2462
- Before:
$ XPRA_CYTHON_MATH=0 ./tests/xpra/test_maths.py calculate_time_weighted_average(100000 records)=(0.5129547191923681, 0.529864013016996) elapsed time: 72.6ms test_calculate_timesize_weighted_average(100000 records)=(70109.42497899215, 22337.16293859974) elapsed time: 204.9ms
- After:
$ XPRA_CYTHON_MATH=1 ./tests/xpra/test_maths.py calculate_time_weighted_average(100000 records)=(0.5226684212684631, 0.5245391130447388) elapsed time: 4.0ms test_calculate_timesize_weighted_average(100000 records)=(180154.15625, 35980.69140625) elapsed time: 19.7ms
Which is 10 to 20 times faster.