< Previous by Date Date Index Next by Date >
  Thread Index Next in Thread >

[reSIProcate] Major Merge to Resiprocate SVN Trunk


Hi All,

I've spent the last number of days testing out and fixing up the Tekelec performance branch resip-b-TKLC-perf_work on Windows and Linux. I have just completed the merge of this branch with resip SVN trunk/mainline.

Thanks to Byron Campen for the initial contribution of the branch, and for testing on OS/X.

Summary of Major Changes (read below for more details):
1.  Many performance related changes, including a new multi-threaded stack mode.  Using testStack (Registration Transactions over TCP), I measured a 15-30% performance increase under Windows (multi-core) and a 20-30% performance increase under linux (single-core).  I expect much higher numbers for linux on a multicore system (especially when using the new multithreaded stack feature), but I didn't have one available for testing.   
2.  New Congestion Management framework.
3.  Reduced memory footprint.

The following notes highlight the changes in this merge.  Lot's of neat new stuff.  Happy reading!  

Merge Notes:

Various optimizations of Data
=============================

Made Data smaller, without sacrificing functionality. Data is 20 (56 vs 36) 
bytes smaller on 64-bit libs, and 4 (36 vs 32) bytes smaller on 32-bit libs. 
This was accomplished by making mSize, mCapacity, and mShareEnum 4-bytes on 
64-bit platforms (mShareEnum could be one byte, but it turns out this imposes a 
detectable performance penalty), and by having mShareEnum do double-duty as a 
null-terminator for mPreBuffer (Borrow==0), instead of requiring an extra byte 
at the end of mPreBuffer.

Several very simple functions have been inlined.

Functionality enhancements to a couple of functions:

- Data::md5() has been changed to Data::md5(Data::EncodingType type=HEX); this 
allows the output of md5() to be encoded as hex or Base64, or not encoded at all 
(binary).

- Data::replace(const Data& match, const Data& target) has been updated to 
Data::replace(const Data& match, const Data& target, int max=INT_MAX); this 
allows the maximum number of replacements to be specified.

Lastly, a few specialized hashing and comparison functions have been added:

- Data::caseInsensitiveTokenHash(); this is a case-insensitive hash that assumes 
that the Data is an RFC 3261 token (eg; branch params). This character set has 
the property that no character is equal to any other character when bit 6 is 
masked out, except for the alphabetical characters. (For alphabetical 
characters, bit 6 specifies whether the character is upper/lower case) This 
means that we can mask out bit 6, and then use a case-sensitive hash algorithm 
on the resulting characters, without hurting the collision properties of the 
hash, and get a case-insensitive hash as a result. This hash function is based 
on the Hsieh hash.

- bool Data::caseInsensitiveTokenCompare(const Data& rhs); this is an equality 
comparison that assumes that both Datas are RFC 3261 tokens (eg; branch 
parameters). This function takes advantage of the same properties of the RFC 
3261 token character set as caseInsensitiveTokenHash(), by using a bitmask 
instead of a true lowercase() operation. This ends up being faster than 
strncasecmp().

- Data& schemeLowercase(); this is a variant of lowercase() that assumes the 
Data is an RFC 3261 scheme. This character set has the property that setting bit 
6 is a no-op, except for alphabetical characters (0-9, '+', '-', and '.' all 
already have bit 6 set). Setting bit 6 on a alphabetical character is equivalent 
to lower-casing the character. Note: There is no corresponding schemeUppercase() 
function, because clearing bit 6 will convert 0-9, '+', '-', and '.' into 
unprintable characters (well, '-' is turned into a CR, but you get the point).



Performance improvements to ParseBuffer
=======================================

- Most functions that returned a Pointer now return a much more lightweight
   CurrentPosition object.
- Allow some of the simpler functions to be inlined
- Integer parsing code is more efficient, and overflow detection is better


Performance enhancements to DnsUtil
===================================

- DnsUtil::inet_ntop(): For some reason, the stock system inet_ntop 
  is dreadfully inefficient on OS X. A dirt-simple hand-rolled 
  implementation was 5-6 times as fast. This is shocking. The Linux 
  implementation is plenty efficient, though, so we're using 
  preprocessor to activate the hand-rolled code.

- DnsUtil::isIpV4Address(): The implementation uses 
  sscanf(), which is pretty expensive. Hand-rolled some code that 
  is much faster.


Reduced the memory footprint associated with storing URIs
=========================================================
- Removed the AOR cacheing stuff from Uri; it was horrifically inefficient. Checking
  for staleness of the cache was nearly as expensive as regenerating the AOR from 
  scratch. Not to mention that the AOR cacheing stuff took up a whopping 148 bytes
  of space on 64-bit platforms (4 Datas, and an int).

- Reworked the host canonicalization cache to take up less space, and be faster.
  Previously, the canonicalized host was put in a separate Data. We now canonicalize
  in-place, and use a bool to denote whether canonicalization has been performed yet.
  This saves us 32 bytes.

- Changed Data Uri::mEmbeddedHeadersText to an auto_ptr<>, since in most cases Uris don't
  use it. Also use auto_ptr for mEmbeddedHeaders (was already a pointer, for consistency).


Change how branch parameters are encoded.
=========================================

Old format: z9hG4bK-d8754z-<branch>-<transportseq>-<clientData>-<sigcompCompartment>-d8754z-

New Format: z9hG4bK-524287-<transportseq>-<clientData>-<sigcompComprtment>-<branch>

This format encodes faster, parses faster (with _much_ simpler code), and takes up
less space on the wire. We may decide to tweak the new resip cookie; I chose 
something that we can use memcmp instead of strncasecmp with, but the token character
set has a bunch of characters that aren't alphanumeric we could use.

Also, some other small optimizations; avoid copies associated with calling
Data::base64encode()/base64decode() on empty Datas, and reorder the SIP cookie
comparisons to be more efficient.


State shedding modifications to TransactionState
================================================
In a number of cases, we were preserving state (in the form of SipMessages
and DnsResults) in cases where we did not really need them any more. For
example, once we have transmitted a response, there is no need
to preserve the full SipMessage for this response (the raw retransmit buffer
is sufficient). Also, INVITE requests do not need to be maintained once
a final response comes in (since there is no possibility that we'll need to
send a simulated 408 or 503 to the TU, nor will we need to construct a CANCEL
request using the INVITE, nor will we need to retransmit). Similarly, once we
have received a final response for a NIT transaction, we no longer need to
maintain the original request or the retransmit buffer. Lastly, if we are
using a reliable transport, we do not need to maintain retransmit buffers
(although we may need to maintain full original requests for simulated
responses and such).

This change has basically no impact on reliable NIT performance, but a huge
impact on non-reliable and INVITE performance. Prior to this change, either
NIT UDP or INVITE TCP testStack would exhaust main memory on my laptop (with
4GB of main memory), bringing progress to a complete halt on runs longer than
15 seconds or so. I did not bother trying INVITE UDP, but that works now too.


Reduction in buffer reallocations while encoding a SipMessage
=============================================================
TransportSelector now keeps a moving average of the outgoing message size,
which is used to preallocate the buffers into which SipMessages are encoded.

This ends up making a small difference in testStack when linked against google
malloc, but a larger difference when linked against OS X's (horrible) standard
malloc.


Multiple Threads in the Stack
=============================
Allow transaction processing, transport processing, and DNS processing to be 
broken off into separate threads.

- SipStack::run() causes the creation and run of three threads; a 
TransactionControllerThread, and TransportSelectorThread, and a DnsThread. You 
continue to use stuff like StackThread and EventStackThread to give cycles to 
the rest of the stack (mainly processing app timers and statistics logging); the 
SipStack is smart enough to unhook these three things from the normal event loop 
API when they have their own threads. In other words, to use the new 
multi-threaded mode, all you have to do is throw in a call to SipStack::run() 
before you fire up your normal SipStack processing, and a 
SipStack::shutdownAndJoinThreads() when you're done.

- In the Connection read/write code, process reads/writes until EAGAIN, or we 
run out of stuff to send. Gives a healthy performance boost on connection-based 
transports.

- In TransactionController, put transaction timers in their own fifo. This 
prevents timers from firing late when the state machine fifo gets congested. 
Also, process at most 16 TransactionMessages from the state machine fifo at a 
time, to prevent starving other parts of the system.

- Unhook the TransactionController's processing loop from that of the 
TransportSelector. This simplifies this API considerably, but required the 
addition of a new feature to Fifo. Fifo can now take an (optional) 
AsyncProcessHandler* that will be notified when the fifo goes from empty to 
non-empty. Actually pretty useful.

- Allow setPollGrp() to be called multiple times on the various classes that 
have this function. This allows the FdPollGrp to be re-set when the SipStack 
enters multithreaded mode.

- Added a "multithreadedstack" --thread-type option to testStack. Exercise this 
option in testStackStd.sh

- Added the ability to run any of the existing Transport objects in their own 
thread, by a combination of a new transport flag 
(RESIP_TRANSPORT_FLAG_OWNTHREAD), and a new TransportThread class. Added support 
for this mode to testStack using the --tf option. Also exercised this feature in 
testStackStd.sh.

- Installed SelectInterruptors at the TransportSelector, each Transport object, 
and the DnsStub (this last one required moving SelectInterruptor to rutil). This 
is critical to making multithreaded mode work in a performant manner, and 
imposes almost no performance penalty due to the way they are invoked.

- SipStack now creates its own SelectInterruptor if one is not supplied 
externally. This is because it is critical to be able to wake the 
TransactionController up when new work comes down from the TU, or from the 
transports.


New congestion-management framework
===================================
Notable features include:
* Allow testStack, tfm/repro/sanityTests, and repro to be run with a congestion 
   manager with the --use-congestion-manager flag.

* Efficient wait-time estimation in AbstractFifo; keeps track of how rapidly
   messages are consumed, allowing good estimates of how long a new message will
   take to be serviced. More efficient than the time-depth logic in 
   TimeLimitFifo, and a better predictor too.

* The ability to shed load at the transport level when the TransactionController
   is congested, in a very efficient manner, using new functionality in Helper
   and SipMessage (Helper::makeRawResponse() and 
   SipMessage::encodeSingleHeader())

* The ability to shed load coming from the TU when the TransactionController is 
   congested. This is crucial when congestion is being caused by a TU trying to 
   do too much.

* Changed the way load-shedding is handled for TransactionUsers to use the new
   API

* A flexible congestion-management API, allowing load-shedding decisions to be
   made in an arbitrary fashion.

* A generalized CongestionManager implementation that is powerful enough to be
   useful.

* The TransactionController will now defer retransmissions of requests if 
   sufficiently congested (ie; the response is probably stuck in mStateMacFifo)

* The TransactionController now determines its hostname with a single call to 
   DnsUtil::getLocalHostName() on construction, for use in 503s. Previously, it 
   would make this call every time a 503 was sent; this call blocks sometimes!

* Don't call DnsResult::blacklistLast() on a Retry-After: 0

* Several fixes the the processing loop in testStack that were causing 
   starvation of one type of work or another when congestion occurred.


Other Misc Enhancements
=======================
-Small efficiency improvement in Random::getCryptoRandom(int)
 Random::getCryptoRandom(unsigned int len) was implemented by calling 
 Random::getCryptoRandom() repeatedly, and collecting the return values 
 in a buffer. In the openssl case, we now use a single call to RAND_bytes().
-Use a priority_queue instead of a multiset for storing timers.
-Slight refactoring of Timer so that transaction timers and payload timers (ie; 
 timers that carry a Message*) are separate classes. Transaction timers no longer 
 have an unused Message* member, and payload timers no longer have the unused 
 transaction-id, Timer::Type, and duration. This saves a _lot_ of memory for apps 
 that use lots of app timers with long lifetimes.
-Less wasteful population of Call-IDs: 
 When generating Call-IDs, Helper was computing an md5 hash of the hostname and 
 some salt, hex-encoding it, and then Base64 encoding the hex data. We now Base64 
 encode the md5 hash directly. This is less computationally expensive, requires 
 less memory because the resulting string is half the size, and requires fewer 
 bytes on the wire.
-Make TransactionMap case-insensitive; Data::caseInsensitiveTokenHash() is fast
 enough that performance actually increases a little.
-std::bitset-based parsing in a number of places.
-Don't check whether the encoding tables are initted for every single
 character; check once before the encode operation begins. Also, checking
 the value of a static bool to determine whether an init has been carried
 out is pointless; that bool might not be initted yet, and it could have
 any value. The static init code now copes with both accesses to the encoding
 tables during static initialization, and from multiple threads during runtime.
-Don't bother generating a transaction identifier unless the parse fails
 to extract one.
-Some refactoring of the FdPollGrp stuff. Now is compatible with cares, using
 a bit of a hack. Also compatible with being driven with the old buildFdSet()/
 select()/process(FdSet&) call sequence, although this is now deprecated.
 Fixing these compatibility problems allowed us to switch over to using FdPollGrp
 in all cases, instead of having dual mode everywhere.
-Buffer classes for Fifo to reduce lock contention. Using them in a few places, will
 use them in more once we phase out TimeLimitFifo with the new congestion management
 code.
-Use the --ignore-case option for generation of ParameterHash.cxx, instead of the
 nasty sed rewriting we are using now. Should also be slightly faster, since gperf
 handles case-insensitive hashing more efficiently than our hack was.
-Adding a local memory pool to SipMessage, to cut down (dramatically) on
 heap allocation overhead. Some minor refactoring to free up wasted space
 in SipMessage as well (makes more room for the pool). Changing the way
 the start-line is stored to no longer use a full-blown ParserContainer+
 HeaderFieldValueList. Lots of opportunistic doxygen merging.
 Up to 20K NIT transactions per second on my machine, roughly a doubling
 in performance.


Bug Fixes
=========
-Use getaddrinfo() instead of the non-threadsafe gethostbyname().
-Remove unused (and non-threadsafe) Timer::mTimerCount/Timer::mId.
 Previously, all timers were assigned a "unique" (not really, more on that in a 
 moment) integer identifier. There is no place in the resip codebase that 
 actually uses this identifier in any way. For transaction timers, this 
 identifier is in principle unnecessary, since there is more than sufficient 
 identifier information present already (the transaction id and timer type). When 
 passing a Message* into the timer queue, a unique identifier already exists; the 
 Message* itself (if potential use of this Message* bugs you, you can always turn 
 it into a handle by applying some sort of transformation to it). This identifier 
 is unnecessary in every case I can think of. In addition, the values are 
 assigned simply by incrementing a global variable (Timer::mTimerCount), with no 
 threadsafety measures whatsoever, so it is not even guaranteed to be unique. 
 Because of all this, it has been removed. As a bonus, this saves some memory; 8 
 bytes per timer on 64-bit platforms, which adds up to around 3MB when testStack 
 steady state has close to 400000 timers in the timer queues at any given point. 
 This could be an even larger amount for TUs that schedule lots of long-lifetime 
 timers (Timer C, for instance).
-Get rid of a wasteful double-encode, in Message.cxx
-minor windows build fixes to avoid file in use errors when building dum test projects
-fix testDigestAuthentiation for recent lowercase nonce change
-fixed a nasty bug in NameAddr - where unknown parameters uri parameters on a 
 NameAddr/Uri with no angle brackets are treated as NameAddr parameters.  When this is
 done, the memory for these parameters was only a temporary Data object.
-fix bug in Data.  If Data is wrapping memory allocated externally (ie. Share mode = BORROW)
 and you start appending to it.  It is possible that the append method will write a NULL
 character off the end of the buffer.  Changed the resize condition to make the buffer
 larger 1 character sooner, to accommodate for this.


Best Regards,
Scott Godin
SIP Spectrum, Inc.