[reSIProcate] Major Merge to Resiprocate SVN Trunk

Wed Feb 1 11:47:31 CST 2012

Hi All,

I've spent the last number of days testing out and fixing up the Tekelec
performance branch resip-b-TKLC-perf_work on Windows and Linux. I have just
completed the merge of this branch with resip SVN trunk/mainline.

Thanks to Byron Campen for the initial contribution of the branch, and for
testing on OS/X.

Summary of Major Changes (read below for more details):
1.  Many performance related changes, including a new multi-threaded stack
mode.  Using testStack (Registration Transactions over TCP), I measured a
15-30% performance increase under Windows (multi-core) and a 20-30%
performance increase under linux (single-core).  I expect much higher
numbers for linux on a multicore system (especially when using the new
multithreaded stack feature), but I didn't have one available for
testing.
2.  New Congestion Management framework.
3.  Reduced memory footprint.

The following notes highlight the changes in this merge.  Lot's of neat new
stuff.  Happy reading!

Merge Notes:

Various optimizations of Data
=============================

Made Data smaller, without sacrificing functionality. Data is 20 (56 vs 36)
bytes smaller on 64-bit libs, and 4 (36 vs 32) bytes smaller on 32-bit
libs.
This was accomplished by making mSize, mCapacity, and mShareEnum 4-bytes on
64-bit platforms (mShareEnum could be one byte, but it turns out this
imposes a
detectable performance penalty), and by having mShareEnum do double-duty as
a
null-terminator for mPreBuffer (Borrow==0), instead of requiring an extra
byte
at the end of mPreBuffer.

Several very simple functions have been inlined.

Functionality enhancements to a couple of functions:

- Data::md5() has been changed to Data::md5(Data::EncodingType type=HEX);
this
allows the output of md5() to be encoded as hex or Base64, or not encoded
at all
(binary).

- Data::replace(const Data& match, const Data& target) has been updated to
Data::replace(const Data& match, const Data& target, int max=INT_MAX); this
allows the maximum number of replacements to be specified.

Lastly, a few specialized hashing and comparison functions have been added:

- Data::caseInsensitiveTokenHash(); this is a case-insensitive hash that
assumes
that the Data is an RFC 3261 token (eg; branch params). This character set
has
the property that no character is equal to any other character when bit 6
is
masked out, except for the alphabetical characters. (For alphabetical
characters, bit 6 specifies whether the character is upper/lower case) This
means that we can mask out bit 6, and then use a case-sensitive hash
algorithm
on the resulting characters, without hurting the collision properties of
the
hash, and get a case-insensitive hash as a result. This hash function is
based
on the Hsieh hash.

- bool Data::caseInsensitiveTokenCompare(const Data& rhs); this is an
equality
comparison that assumes that both Datas are RFC 3261 tokens (eg; branch
parameters). This function takes advantage of the same properties of the
RFC
3261 token character set as caseInsensitiveTokenHash(), by using a bitmask
instead of a true lowercase() operation. This ends up being faster than
strncasecmp().

- Data& schemeLowercase(); this is a variant of lowercase() that assumes
the
Data is an RFC 3261 scheme. This character set has the property that
setting bit
6 is a no-op, except for alphabetical characters (0-9, '+', '-', and '.'
all
already have bit 6 set). Setting bit 6 on a alphabetical character is
equivalent
to lower-casing the character. Note: There is no corresponding
schemeUppercase()
function, because clearing bit 6 will convert 0-9, '+', '-', and '.' into
unprintable characters (well, '-' is turned into a CR, but you get the
point).

Performance improvements to ParseBuffer
=======================================

- Most functions that returned a Pointer now return a much more lightweight
   CurrentPosition object.
- Allow some of the simpler functions to be inlined
- Integer parsing code is more efficient, and overflow detection is better

Performance enhancements to DnsUtil
===================================

- DnsUtil::inet_ntop(): For some reason, the stock system inet_ntop
  is dreadfully inefficient on OS X. A dirt-simple hand-rolled
  implementation was 5-6 times as fast. This is shocking. The Linux
  implementation is plenty efficient, though, so we're using
  preprocessor to activate the hand-rolled code.

- DnsUtil::isIpV4Address(): The implementation uses
  sscanf(), which is pretty expensive. Hand-rolled some code that
  is much faster.

Reduced the memory footprint associated with storing URIs
=========================================================
- Removed the AOR cacheing stuff from Uri; it was horrifically inefficient.
Checking
  for staleness of the cache was nearly as expensive as regenerating the
AOR from
  scratch. Not to mention that the AOR cacheing stuff took up a whopping
148 bytes
  of space on 64-bit platforms (4 Datas, and an int).

- Reworked the host canonicalization cache to take up less space, and be
faster.
  Previously, the canonicalized host was put in a separate Data. We now
canonicalize
  in-place, and use a bool to denote whether canonicalization has been
performed yet.
  This saves us 32 bytes.

- Changed Data Uri::mEmbeddedHeadersText to an auto_ptr<>, since in most
cases Uris don't
  use it. Also use auto_ptr for mEmbeddedHeaders (was already a pointer,
for consistency).

Change how branch parameters are encoded.
=========================================

Old format:
z9hG4bK-d8754z-<branch>-<transportseq>-<clientData>-<sigcompCompartment>-d8754z-

New Format:
z9hG4bK-524287-<transportseq>-<clientData>-<sigcompComprtment>-<branch>

This format encodes faster, parses faster (with _much_ simpler code), and
takes up
less space on the wire. We may decide to tweak the new resip cookie; I
chose
something that we can use memcmp instead of strncasecmp with, but the token
character
set has a bunch of characters that aren't alphanumeric we could use.

Also, some other small optimizations; avoid copies associated with calling
Data::base64encode()/base64decode() on empty Datas, and reorder the SIP
cookie
comparisons to be more efficient.

State shedding modifications to TransactionState
================================================
In a number of cases, we were preserving state (in the form of SipMessages
and DnsResults) in cases where we did not really need them any more. For
example, once we have transmitted a response, there is no need
to preserve the full SipMessage for this response (the raw retransmit buffer
is sufficient). Also, INVITE requests do not need to be maintained once
a final response comes in (since there is no possibility that we'll need to
send a simulated 408 or 503 to the TU, nor will we need to construct a
CANCEL
request using the INVITE, nor will we need to retransmit). Similarly, once
we
have received a final response for a NIT transaction, we no longer need to
maintain the original request or the retransmit buffer. Lastly, if we are
using a reliable transport, we do not need to maintain retransmit buffers
(although we may need to maintain full original requests for simulated
responses and such).

This change has basically no impact on reliable NIT performance, but a huge
impact on non-reliable and INVITE performance. Prior to this change, either
NIT UDP or INVITE TCP testStack would exhaust main memory on my laptop (with
4GB of main memory), bringing progress to a complete halt on runs longer
than
15 seconds or so. I did not bother trying INVITE UDP, but that works now
too.

Reduction in buffer reallocations while encoding a SipMessage
=============================================================
TransportSelector now keeps a moving average of the outgoing message size,
which is used to preallocate the buffers into which SipMessages are encoded.

This ends up making a small difference in testStack when linked against
google
malloc, but a larger difference when linked against OS X's (horrible)
standard
malloc.

Multiple Threads in the Stack
=============================
Allow transaction processing, transport processing, and DNS processing to
be
broken off into separate threads.

- SipStack::run() causes the creation and run of three threads; a
TransactionControllerThread, and TransportSelectorThread, and a DnsThread.
You
continue to use stuff like StackThread and EventStackThread to give cycles
to
the rest of the stack (mainly processing app timers and statistics
logging); the
SipStack is smart enough to unhook these three things from the normal event
loop
API when they have their own threads. In other words, to use the new
multi-threaded mode, all you have to do is throw in a call to
SipStack::run()
before you fire up your normal SipStack processing, and a
SipStack::shutdownAndJoinThreads() when you're done.

- In the Connection read/write code, process reads/writes until EAGAIN, or
we
run out of stuff to send. Gives a healthy performance boost on
connection-based
transports.

- In TransactionController, put transaction timers in their own fifo. This
prevents timers from firing late when the state machine fifo gets
congested.
Also, process at most 16 TransactionMessages from the state machine fifo at
a
time, to prevent starving other parts of the system.

- Unhook the TransactionController's processing loop from that of the
TransportSelector. This simplifies this API considerably, but required the
addition of a new feature to Fifo. Fifo can now take an (optional)
AsyncProcessHandler* that will be notified when the fifo goes from empty to
non-empty. Actually pretty useful.

- Allow setPollGrp() to be called multiple times on the various classes
that
have this function. This allows the FdPollGrp to be re-set when the
SipStack
enters multithreaded mode.

- Added a "multithreadedstack" --thread-type option to testStack. Exercise
this
option in testStackStd.sh

- Added the ability to run any of the existing Transport objects in their
own
thread, by a combination of a new transport flag
(RESIP_TRANSPORT_FLAG_OWNTHREAD), and a new TransportThread class. Added
support
for this mode to testStack using the --tf option. Also exercised this
feature in
testStackStd.sh.

- Installed SelectInterruptors at the TransportSelector, each Transport
object,
and the DnsStub (this last one required moving SelectInterruptor to rutil).
This
is critical to making multithreaded mode work in a performant manner, and
imposes almost no performance penalty due to the way they are invoked.

- SipStack now creates its own SelectInterruptor if one is not supplied
externally. This is because it is critical to be able to wake the
TransactionController up when new work comes down from the TU, or from the
transports.

New congestion-management framework
===================================
Notable features include:
* Allow testStack, tfm/repro/sanityTests, and repro to be run with a
congestion
   manager with the --use-congestion-manager flag.

* Efficient wait-time estimation in AbstractFifo; keeps track of how rapidly
   messages are consumed, allowing good estimates of how long a new message
will
   take to be serviced. More efficient than the time-depth logic in
   TimeLimitFifo, and a better predictor too.

* The ability to shed load at the transport level when the
TransactionController
   is congested, in a very efficient manner, using new functionality in
Helper
   and SipMessage (Helper::makeRawResponse() and
   SipMessage::encodeSingleHeader())

* The ability to shed load coming from the TU when the
TransactionController is
   congested. This is crucial when congestion is being caused by a TU
trying to
   do too much.

* Changed the way load-shedding is handled for TransactionUsers to use the
new
   API

* A flexible congestion-management API, allowing load-shedding decisions to
be
   made in an arbitrary fashion.

* A generalized CongestionManager implementation that is powerful enough to
be
   useful.

* The TransactionController will now defer retransmissions of requests if
   sufficiently congested (ie; the response is probably stuck in
mStateMacFifo)

* The TransactionController now determines its hostname with a single call
to
   DnsUtil::getLocalHostName() on construction, for use in 503s.
Previously, it
   would make this call every time a 503 was sent; this call blocks
sometimes!

* Don't call DnsResult::blacklistLast() on a Retry-After: 0

* Several fixes the the processing loop in testStack that were causing
   starvation of one type of work or another when congestion occurred.

Other Misc Enhancements
=======================
-Small efficiency improvement in Random::getCryptoRandom(int)
 Random::getCryptoRandom(unsigned int len) was implemented by calling
 Random::getCryptoRandom() repeatedly, and collecting the return values
 in a buffer. In the openssl case, we now use a single call to RAND_bytes().
-Use a priority_queue instead of a multiset for storing timers.
-Slight refactoring of Timer so that transaction timers and payload timers
(ie;
 timers that carry a Message*) are separate classes. Transaction timers no
longer
 have an unused Message* member, and payload timers no longer have the
unused
 transaction-id, Timer::Type, and duration. This saves a _lot_ of memory
for apps
 that use lots of app timers with long lifetimes.
-Less wasteful population of Call-IDs:
 When generating Call-IDs, Helper was computing an md5 hash of the hostname
and
 some salt, hex-encoding it, and then Base64 encoding the hex data. We now
Base64
 encode the md5 hash directly. This is less computationally expensive,
requires
 less memory because the resulting string is half the size, and requires
fewer
 bytes on the wire.
-Make TransactionMap case-insensitive; Data::caseInsensitiveTokenHash() is
fast
 enough that performance actually increases a little.
-std::bitset-based parsing in a number of places.
-Don't check whether the encoding tables are initted for every single
 character; check once before the encode operation begins. Also, checking
 the value of a static bool to determine whether an init has been carried
 out is pointless; that bool might not be initted yet, and it could have
 any value. The static init code now copes with both accesses to the
encoding
 tables during static initialization, and from multiple threads during
runtime.
-Don't bother generating a transaction identifier unless the parse fails
 to extract one.
-Some refactoring of the FdPollGrp stuff. Now is compatible with cares,
using
 a bit of a hack. Also compatible with being driven with the old
buildFdSet()/
 select()/process(FdSet&) call sequence, although this is now deprecated.
 Fixing these compatibility problems allowed us to switch over to using
FdPollGrp
 in all cases, instead of having dual mode everywhere.
-Buffer classes for Fifo to reduce lock contention. Using them in a few
places, will
 use them in more once we phase out TimeLimitFifo with the new congestion
management
 code.
-Use the --ignore-case option for generation of ParameterHash.cxx, instead
of the
 nasty sed rewriting we are using now. Should also be slightly faster,
since gperf
 handles case-insensitive hashing more efficiently than our hack was.
-Adding a local memory pool to SipMessage, to cut down (dramatically) on
 heap allocation overhead. Some minor refactoring to free up wasted space
 in SipMessage as well (makes more room for the pool). Changing the way
 the start-line is stored to no longer use a full-blown ParserContainer+
 HeaderFieldValueList. Lots of opportunistic doxygen merging.
 Up to 20K NIT transactions per second on my machine, roughly a doubling
 in performance.

Bug Fixes
=========
-Use getaddrinfo() instead of the non-threadsafe gethostbyname().
-Remove unused (and non-threadsafe) Timer::mTimerCount/Timer::mId.
 Previously, all timers were assigned a "unique" (not really, more on that
in a
 moment) integer identifier. There is no place in the resip codebase that
 actually uses this identifier in any way. For transaction timers, this
 identifier is in principle unnecessary, since there is more than
sufficient
 identifier information present already (the transaction id and timer
type). When
 passing a Message* into the timer queue, a unique identifier already
exists; the
 Message* itself (if potential use of this Message* bugs you, you can
always turn
 it into a handle by applying some sort of transformation to it). This
identifier
 is unnecessary in every case I can think of. In addition, the values are
 assigned simply by incrementing a global variable (Timer::mTimerCount),
with no
 threadsafety measures whatsoever, so it is not even guaranteed to be
unique.
 Because of all this, it has been removed. As a bonus, this saves some
memory; 8
 bytes per timer on 64-bit platforms, which adds up to around 3MB when
testStack
 steady state has close to 400000 timers in the timer queues at any given
point.
 This could be an even larger amount for TUs that schedule lots of
long-lifetime
 timers (Timer C, for instance).
-Get rid of a wasteful double-encode, in Message.cxx
-minor windows build fixes to avoid file in use errors when building dum
test projects
-fix testDigestAuthentiation for recent lowercase nonce change
-fixed a nasty bug in NameAddr - where unknown parameters uri parameters on
a
 NameAddr/Uri with no angle brackets are treated as NameAddr parameters.
 When this is
 done, the memory for these parameters was only a temporary Data object.
-fix bug in Data.  If Data is wrapping memory allocated externally (ie.
Share mode = BORROW)
 and you start appending to it.  It is possible that the append method will
write a NULL
 character off the end of the buffer.  Changed the resize condition to make
the buffer
 larger 1 character sooner, to accommodate for this.

Best Regards,
Scott Godin
SIP Spectrum, Inc.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.resiprocate.org/pipermail/resiprocate-devel/attachments/20120201/c18ac956/attachment.htm>