xpra icon
Bug tracker and wiki

This bug tracker and wiki are being discontinued
please use https://github.com/Xpra-org/xpra instead.



Version 6 (modified by Antoine Martin, 8 years ago) (diff)

--

Profiling

First, please see Debugging for information on logging and the environment variables that influence the behaviour of the system.

cProfile

cProfile outputs text only, which may be more appropriate in some cases. Simply replace xpra with tests/scripts/cProfile on the command line:

tests/scripts/cProfile start :10

And on exit it will dump the statistics to the terminal, ie:

Mon Jan  7 16:24:19 2013    profile.stats

         28226 function calls (27385 primitive calls) in 4.409 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    3.958    3.958    4.050    4.050 {gtk._gtk.main}
        1    0.090    0.090    0.113    0.113 /usr/lib64/python2.7/site-packages/gtk-2.0/gtk/__init__.py:23(<module>)
        2    0.050    0.025    0.050    0.025 {wimpiggy.lowlevel.bindings.get_screen_sizes}
        1    0.049    0.049    0.057    0.057 /usr/lib64/python2.7/site-packages/gst-0.10/gst/__init__.py:24(<module>)
        2    0.026    0.013    0.026    0.013 {method 'read' of 'file' objects}
        2    0.014    0.007    0.014    0.007 {method 'search' of '_sre.SRE_Pattern' objects}
        1    0.012    0.012    0.012    0.012 {method 'close' of 'gtk.gdk.Display' objects}
(...)

pycallgraph

pycallgraph can generate pretty diagrams. We now ship a wrapper which takes care of simplifying the calls to pycallgraph:

./tests/scripts/xpra/pycallgraph

This command includes a number of working examples and lists the groups and threads available for profiling.

Using pycallgraph to identify bottlenecks ==

General guidelines:

  • start with a small delay (ie: -d 2) to skip the initial setup, potentially longer if you start a server and then also need to connect the client before profiling
  • when comparing versions/changes/settings: use the runtime parameter to ensure that each run will last for exactly the same amount of time - this makes it easier to compare the labels using absolute values, and script the whole thing so the client connects at exactly the same time if profiling the server
  • focus on one thread at a time - there is not much overlap/impact between different threads
  • choose an application to test which will stress the desired portion of the code (ie: if testing the network or encoding thread, an application that generates a high frames-per-second is more appropriate than what may ultimately be the target application for the system)
  • choose an application whose behaviour is predictable: preferably not using the network or relying on any external resources. So the whole thing can be scripted, run repeatedly/reliably for comparing with previous runs.
  • start with most groups enabled
  • most groups can usually be disabled early - unless they are being profiled specifically
  • some Debugging environment variables may be useful
  • save useful graphs with all the information relating to the version used, modifications, environment, etc. So these can be reproduced later.

Fixing Bottlenecks

First, write a test to measure/isolate the problematic part of the code.
There are no magic solutions to apply, here are some solutions which have been used in the past:

  • removing code!
  • threading (we already have too many threads though..)
  • re-writing in C or Cython
  • better algorithms

Examples

maths.py re-written in Cython

Merged in r2462


  • Before:
    $ XPRA_CYTHON_MATH=0 ./tests/xpra/test_maths.py 
    calculate_time_weighted_average(100000 records)=(0.5129547191923681, 0.529864013016996)
    elapsed time: 72.6ms
    test_calculate_timesize_weighted_average(100000 records)=(70109.42497899215, 22337.16293859974)
    elapsed time: 204.9ms
    
  • After:
    $ XPRA_CYTHON_MATH=1 ./tests/xpra/test_maths.py 
    calculate_time_weighted_average(100000 records)=(0.5226684212684631, 0.5245391130447388)
    elapsed time: 4.0ms
    test_calculate_timesize_weighted_average(100000 records)=(180154.15625, 35980.69140625)
    elapsed time: 19.7ms
    

Which is 10 to 20 times faster.

high network cost in UI thread

Work in progress, see #231