[reSIProcate] epoll performance results for resip
Hi,
Attached is some performance test results for the latest resip stack. Data is gathered using resip/stack/test/testStackFlavors.py, which runs the testStack program in the same directory. This is a very simple test that runs two stacks. The sender stack generates UAC REGISTER transactions and the receiver stack is the UAS. The testStack program can be configured (via command line options) with various modes: UDP or TCP, number of ports, epoll or select, etc. The script runs through various combinations of options.
The key metric is transactions-per-second (tps). More precisely, the reported metric is really transaction pairs, since it is doing both the UAC and UAS side. Intent of the test is to measure the relative performance of different optimizations of the stack. The absolute performance isn't so meaningful, though it is likely an upper bound on what any real application might achieve.
The comments in the source code testStack.cxx provide a brief explanation of the different thread modes. I've observed a lot of variation (>20%) in the tps numbers from run-to-run. Thus don't assign much meaning to small tps differences.
This test data was generated on a Dell PowerEdge R610 w/2 Xeon CPU @ 1.2G, 64bit; total 8 cores, running ubuntu linux 2.6.31.
Attachment is ASCII CSV file, open with Excel or your favorite text editor.
Some observations:
* For single port, event(epoll-based) is comparable to pre-existing behavior. In this particular test it appears faster, but I've also seen it perform slightly worse on other machines.
* The performance penalty going from 1 port to 10k ports is between 25% and 50%. My belief is this penalty is due to the O(logN) searches within TransportSelector, not epoll itself. But haven't really investigated. The first version had 10x penalties, so I'm pretty happy with current results.
One last comment. I have a repro instance using the epoll mode running as TCP-UDP gateway, and it shows between 2k and 3k tps throughput for non-invite transactions when handling 50k concurrent TCP connections with simulated traffic. It is CPU limited. I've just started profiling this. If anyone has previously profiled repro and/or has good ideas for performance optimization, please let me know your thoughts.
Finally, I'd like to see similar data for Windows or other platforms. I don't build or run under Windows myself. Please feel free to modify the test script if needed in order to get it to run under Windows.
Regards,
Kennard
Attachment:
misc01-testStack.csv
Description: Binary data