I've spent the last number of days testing out and fixing up the Tekelec performance branch resip-b-TKLC-perf_work on Windows and Linux. I have just completed the merge of this branch with resip SVN trunk/mainline.
Thanks to Byron Campen for the initial contribution of the branch, and for testing on OS/X.
1. Many performance related changes, including a new multi-threaded stack mode. Using testStack (Registration Transactions over TCP), I measured a 15-30% performance increase under Windows (multi-core) and a 20-30% performance increase under linux (single-core). I expect much higher numbers for linux on a multicore system (especially when using the new multithreaded stack feature), but I didn't have one available for testing.
2. New Congestion Management framework.
3. Reduced memory footprint.
The following notes highlight the changes in this merge. Lot's of neat new stuff. Happy reading!
Merge Notes:
Various optimizations of Data
=============================
Made Data smaller, without sacrificing functionality. Data is 20 (56 vs 36)
bytes smaller on 64-bit libs, and 4 (36 vs 32) bytes smaller on 32-bit libs.
This was accomplished by making mSize, mCapacity, and mShareEnum 4-bytes on
64-bit platforms (mShareEnum could be one byte, but it turns out this imposes a
detectable performance penalty), and by having mShareEnum do double-duty as a
null-terminator for mPreBuffer (Borrow==0), instead of requiring an extra byte
at the end of mPreBuffer.
Several very simple functions have been inlined.
Functionality enhancements to a couple of functions:
- Data::md5() has been changed to Data::md5(Data::EncodingType type=HEX); this
allows the output of md5() to be encoded as hex or Base64, or not encoded at all
(binary).
- Data::replace(const Data& match, const Data& target) has been updated to
Data::replace(const Data& match, const Data& target, int max=INT_MAX); this
allows the maximum number of replacements to be specified.
Lastly, a few specialized hashing and comparison functions have been added:
- Data::caseInsensitiveTokenHash(); this is a case-insensitive hash that assumes
that the Data is an RFC 3261 token (eg; branch params). This character set has
the property that no character is equal to any other character when bit 6 is
masked out, except for the alphabetical characters. (For alphabetical
characters, bit 6 specifies whether the character is upper/lower case) This
means that we can mask out bit 6, and then use a case-sensitive hash algorithm
on the resulting characters, without hurting the collision properties of the
hash, and get a case-insensitive hash as a result. This hash function is based
on the Hsieh hash.
- bool Data::caseInsensitiveTokenCompare(const Data& rhs); this is an equality
comparison that assumes that both Datas are RFC 3261 tokens (eg; branch
parameters). This function takes advantage of the same properties of the RFC
3261 token character set as caseInsensitiveTokenHash(), by using a bitmask
instead of a true lowercase() operation. This ends up being faster than
strncasecmp().
- Data& schemeLowercase(); this is a variant of lowercase() that assumes the
Data is an RFC 3261 scheme. This character set has the property that setting bit
6 is a no-op, except for alphabetical characters (0-9, '+', '-', and '.' all
already have bit 6 set). Setting bit 6 on a alphabetical character is equivalent
to lower-casing the character. Note: There is no corresponding schemeUppercase()
function, because clearing bit 6 will convert 0-9, '+', '-', and '.' into
unprintable characters (well, '-' is turned into a CR, but you get the point).
Performance improvements to ParseBuffer
=======================================
- Most functions that returned a Pointer now return a much more lightweight
CurrentPosition object.
- Allow some of the simpler functions to be inlined
- Integer parsing code is more efficient, and overflow detection is better
Performance enhancements to DnsUtil
===================================
- DnsUtil::inet_ntop(): For some reason, the stock system inet_ntop
is dreadfully inefficient on OS X. A dirt-simple hand-rolled
implementation was 5-6 times as fast. This is shocking. The Linux
implementation is plenty efficient, though, so we're using
preprocessor to activate the hand-rolled code.
- DnsUtil::isIpV4Address(): The implementation uses
sscanf(), which is pretty expensive. Hand-rolled some code that
is much faster.
Reduced the memory footprint associated with storing URIs
=========================================================
- Removed the AOR cacheing stuff from Uri; it was horrifically inefficient. Checking
for staleness of the cache was nearly as expensive as regenerating the AOR from
scratch. Not to mention that the AOR cacheing stuff took up a whopping 148 bytes
of space on 64-bit platforms (4 Datas, and an int).
- Reworked the host canonicalization cache to take up less space, and be faster.
Previously, the canonicalized host was put in a separate Data. We now canonicalize
in-place, and use a bool to denote whether canonicalization has been performed yet.
This saves us 32 bytes.
- Changed Data Uri::mEmbeddedHeadersText to an auto_ptr<>, since in most cases Uris don't
use it. Also use auto_ptr for mEmbeddedHeaders (was already a pointer, for consistency).
Change how branch parameters are encoded.
=========================================
Old format: z9hG4bK-d8754z-<branch>-<transportseq>-<clientData>-<sigcompCompartment>-d8754z-
New Format: z9hG4bK-524287-<transportseq>-<clientData>-<sigcompComprtment>-<branch>
This format encodes faster, parses faster (with _much_ simpler code), and takes up
less space on the wire. We may decide to tweak the new resip cookie; I chose
something that we can use memcmp instead of strncasecmp with, but the token character
set has a bunch of characters that aren't alphanumeric we could use.
Also, some other small optimizations; avoid copies associated with calling
Data::base64encode()/base64decode() on empty Datas, and reorder the SIP cookie
comparisons to be more efficient.
State shedding modifications to TransactionState
================================================
In a number of cases, we were preserving state (in the form of SipMessages
and DnsResults) in cases where we did not really need them any more. For
example, once we have transmitted a response, there is no need
to preserve the full SipMessage for this response (the raw retransmit buffer
is sufficient). Also, INVITE requests do not need to be maintained once
a final response comes in (since there is no possibility that we'll need to
send a simulated 408 or 503 to the TU, nor will we need to construct a CANCEL
request using the INVITE, nor will we need to retransmit). Similarly, once we
have received a final response for a NIT transaction, we no longer need to
maintain the original request or the retransmit buffer. Lastly, if we are
using a reliable transport, we do not need to maintain retransmit buffers
(although we may need to maintain full original requests for simulated
responses and such).
This change has basically no impact on reliable NIT performance, but a huge
impact on non-reliable and INVITE performance. Prior to this change, either
NIT UDP or INVITE TCP testStack would exhaust main memory on my laptop (with
4GB of main memory), bringing progress to a complete halt on runs longer than
15 seconds or so. I did not bother trying INVITE UDP, but that works now too.
Reduction in buffer reallocations while encoding a SipMessage
=============================================================
TransportSelector now keeps a moving average of the outgoing message size,
which is used to preallocate the buffers into which SipMessages are encoded.
This ends up making a small difference in testStack when linked against google
malloc, but a larger difference when linked against OS X's (horrible) standard
malloc.
Multiple Threads in the Stack
=============================
Allow transaction processing, transport processing, and DNS processing to be
broken off into separate threads.
- SipStack::run() causes the creation and run of three threads; a
TransactionControllerThread, and TransportSelectorThread, and a DnsThread. You
continue to use stuff like StackThread and EventStackThread to give cycles to
the rest of the stack (mainly processing app timers and statistics logging); the
SipStack is smart enough to unhook these three things from the normal event loop
API when they have their own threads. In other words, to use the new
multi-threaded mode, all you have to do is throw in a call to SipStack::run()
before you fire up your normal SipStack processing, and a
SipStack::shutdownAndJoinThreads() when you're done.
- In the Connection read/write code, process reads/writes until EAGAIN, or we
run out of stuff to send. Gives a healthy performance boost on connection-based
transports.
- In TransactionController, put transaction timers in their own fifo. This
prevents timers from firing late when the state machine fifo gets congested.
Also, process at most 16 TransactionMessages from the state machine fifo at a
time, to prevent starving other parts of the system.
- Unhook the TransactionController's processing loop from that of the
TransportSelector. This simplifies this API considerably, but required the
addition of a new feature to Fifo. Fifo can now take an (optional)
AsyncProcessHandler* that will be notified when the fifo goes from empty to
non-empty. Actually pretty useful.
- Allow setPollGrp() to be called multiple times on the various classes that
have this function. This allows the FdPollGrp to be re-set when the SipStack
enters multithreaded mode.
- Added a "multithreadedstack" --thread-type option to testStack. Exercise this
option in testStackStd.sh
- Added the ability to run any of the existing Transport objects in their own
thread, by a combination of a new transport flag
(RESIP_TRANSPORT_FLAG_OWNTHREAD), and a new TransportThread class. Added support
for this mode to testStack using the --tf option. Also exercised this feature in
testStackStd.sh.
- Installed SelectInterruptors at the TransportSelector, each Transport object,
and the DnsStub (this last one required moving SelectInterruptor to rutil). This
is critical to making multithreaded mode work in a performant manner, and
imposes almost no performance penalty due to the way they are invoked.
- SipStack now creates its own SelectInterruptor if one is not supplied
externally. This is because it is critical to be able to wake the
TransactionController up when new work comes down from the TU, or from the
transports.
New congestion-management framework
===================================
Notable features include:
* Allow testStack, tfm/repro/sanityTests, and repro to be run with a congestion
manager with the --use-congestion-manager flag.
* Efficient wait-time estimation in AbstractFifo; keeps track of how rapidly
messages are consumed, allowing good estimates of how long a new message will
take to be serviced. More efficient than the time-depth logic in
TimeLimitFifo, and a better predictor too.
* The ability to shed load at the transport level when the TransactionController
is congested, in a very efficient manner, using new functionality in Helper
and SipMessage (Helper::makeRawResponse() and
SipMessage::encodeSingleHeader())
* The ability to shed load coming from the TU when the TransactionController is
congested. This is crucial when congestion is being caused by a TU trying to
do too much.
* Changed the way load-shedding is handled for TransactionUsers to use the new
API
* A flexible congestion-management API, allowing load-shedding decisions to be
made in an arbitrary fashion.
* A generalized CongestionManager implementation that is powerful enough to be
useful.
* The TransactionController will now defer retransmissions of requests if
sufficiently congested (ie; the response is probably stuck in mStateMacFifo)
* The TransactionController now determines its hostname with a single call to
DnsUtil::getLocalHostName() on construction, for use in 503s. Previously, it
would make this call every time a 503 was sent; this call blocks sometimes!
* Don't call DnsResult::blacklistLast() on a Retry-After: 0
* Several fixes the the processing loop in testStack that were causing
starvation of one type of work or another when congestion occurred.
Other Misc Enhancements
=======================
-Small efficiency improvement in Random::getCryptoRandom(int)
Random::getCryptoRandom(unsigned int len) was implemented by calling
Random::getCryptoRandom() repeatedly, and collecting the return values
in a buffer. In the openssl case, we now use a single call to RAND_bytes().
-Use a priority_queue instead of a multiset for storing timers.
-Slight refactoring of Timer so that transaction timers and payload timers (ie;
timers that carry a Message*) are separate classes. Transaction timers no longer
have an unused Message* member, and payload timers no longer have the unused
transaction-id, Timer::Type, and duration. This saves a _lot_ of memory for apps
that use lots of app timers with long lifetimes.
-Less wasteful population of Call-IDs:
When generating Call-IDs, Helper was computing an md5 hash of the hostname and
some salt, hex-encoding it, and then Base64 encoding the hex data. We now Base64
encode the md5 hash directly. This is less computationally expensive, requires
less memory because the resulting string is half the size, and requires fewer
bytes on the wire.
-Make TransactionMap case-insensitive; Data::caseInsensitiveTokenHash() is fast
enough that performance actually increases a little.
-std::bitset-based parsing in a number of places.
-Don't check whether the encoding tables are initted for every single
character; check once before the encode operation begins. Also, checking
the value of a static bool to determine whether an init has been carried
out is pointless; that bool might not be initted yet, and it could have
any value. The static init code now copes with both accesses to the encoding
tables during static initialization, and from multiple threads during runtime.
-Don't bother generating a transaction identifier unless the parse fails
to extract one.
-Some refactoring of the FdPollGrp stuff. Now is compatible with cares, using
a bit of a hack. Also compatible with being driven with the old buildFdSet()/
select()/process(FdSet&) call sequence, although this is now deprecated.
Fixing these compatibility problems allowed us to switch over to using FdPollGrp
in all cases, instead of having dual mode everywhere.
-Buffer classes for Fifo to reduce lock contention. Using them in a few places, will
use them in more once we phase out TimeLimitFifo with the new congestion management
code.
-Use the --ignore-case option for generation of ParameterHash.cxx, instead of the
nasty sed rewriting we are using now. Should also be slightly faster, since gperf
handles case-insensitive hashing more efficiently than our hack was.
-Adding a local memory pool to SipMessage, to cut down (dramatically) on
heap allocation overhead. Some minor refactoring to free up wasted space
in SipMessage as well (makes more room for the pool). Changing the way
the start-line is stored to no longer use a full-blown ParserContainer+
HeaderFieldValueList. Lots of opportunistic doxygen merging.
Up to 20K NIT transactions per second on my machine, roughly a doubling
in performance.
Bug Fixes
=========
-Use getaddrinfo() instead of the non-threadsafe gethostbyname().
-Remove unused (and non-threadsafe) Timer::mTimerCount/Timer::mId.
Previously, all timers were assigned a "unique" (not really, more on that in a
moment) integer identifier. There is no place in the resip codebase that
actually uses this identifier in any way. For transaction timers, this
identifier is in principle unnecessary, since there is more than sufficient
identifier information present already (the transaction id and timer type). When
passing a Message* into the timer queue, a unique identifier already exists; the
Message* itself (if potential use of this Message* bugs you, you can always turn
it into a handle by applying some sort of transformation to it). This identifier
is unnecessary in every case I can think of. In addition, the values are
assigned simply by incrementing a global variable (Timer::mTimerCount), with no
threadsafety measures whatsoever, so it is not even guaranteed to be unique.
Because of all this, it has been removed. As a bonus, this saves some memory; 8
bytes per timer on 64-bit platforms, which adds up to around 3MB when testStack
steady state has close to 400000 timers in the timer queues at any given point.
This could be an even larger amount for TUs that schedule lots of long-lifetime
timers (Timer C, for instance).
-Get rid of a wasteful double-encode, in Message.cxx
-minor windows build fixes to avoid file in use errors when building dum test projects
-fix testDigestAuthentiation for recent lowercase nonce change
-fixed a nasty bug in NameAddr - where unknown parameters uri parameters on a
NameAddr/Uri with no angle brackets are treated as NameAddr parameters. When this is
done, the memory for these parameters was only a temporary Data object.
-fix bug in Data. If Data is wrapping memory allocated externally (ie. Share mode = BORROW)
and you start appending to it. It is possible that the append method will write a NULL
character off the end of the buffer. Changed the resize condition to make the buffer
larger 1 character sooner, to accommodate for this.
Best Regards,
SIP Spectrum, Inc.