< Previous by Date	Date Index	Next by Date >
	Thread Index	Next in Thread >

[reSIProcate] Major Merge to Resiprocate SVN Trunk

From: Scott Godin <sgodin@xxxxxxxxxxxxxxx>
Date: Wed, 1 Feb 2012 12:47:31 -0500

Hi All,

I've spent the last number of days testing out and fixing up the Tekelec performance branch resip-b-TKLC-perf_work on Windows and Linux. I have just completed the merge of this branch with resip SVN trunk/mainline.

Thanks to Byron Campen for the initial contribution of the branch, and for testing on OS/X.

Summary of Major Changes (read below for more details):

1. Many performance related changes, including a new multi-threaded stack mode. Using testStack (Registration Transactions over TCP), I measured a 15-30% performance increase under Windows (multi-core) and a 20-30% performance increase under linux (single-core). I expect much higher numbers for linux on a multicore system (especially when using the new multithreaded stack feature), but I didn't have one available for testing.

2. New Congestion Management framework.

3. Reduced memory footprint.

The following notes highlight the changes in this merge. Lot's of neat new stuff. Happy reading!

Merge Notes:

Various optimizations of Data

=============================

Made Data smaller, without sacrificing functionality. Data is 20 (56 vs 36)

bytes smaller on 64-bit libs, and 4 (36 vs 32) bytes smaller on 32-bit libs.

This was accomplished by making mSize, mCapacity, and mShareEnum 4-bytes on

64-bit platforms (mShareEnum could be one byte, but it turns out this imposes a

detectable performance penalty), and by having mShareEnum do double-duty as a

null-terminator for mPreBuffer (Borrow==0), instead of requiring an extra byte

at the end of mPreBuffer.

Several very simple functions have been inlined.

Functionality enhancements to a couple of functions:

- Data::md5() has been changed to Data::md5(Data::EncodingType type=HEX); this

allows the output of md5() to be encoded as hex or Base64, or not encoded at all

(binary).

- Data::replace(const Data& match, const Data& target) has been updated to

Data::replace(const Data& match, const Data& target, int max=INT_MAX); this

allows the maximum number of replacements to be specified.

Lastly, a few specialized hashing and comparison functions have been added:

- Data::caseInsensitiveTokenHash(); this is a case-insensitive hash that assumes

that the Data is an RFC 3261 token (eg; branch params). This character set has

the property that no character is equal to any other character when bit 6 is

masked out, except for the alphabetical characters. (For alphabetical

characters, bit 6 specifies whether the character is upper/lower case) This

means that we can mask out bit 6, and then use a case-sensitive hash algorithm

on the resulting characters, without hurting the collision properties of the

hash, and get a case-insensitive hash as a result. This hash function is based

on the Hsieh hash.

- bool Data::caseInsensitiveTokenCompare(const Data& rhs); this is an equality

comparison that assumes that both Datas are RFC 3261 tokens (eg; branch

parameters). This function takes advantage of the same properties of the RFC

3261 token character set as caseInsensitiveTokenHash(), by using a bitmask

instead of a true lowercase() operation. This ends up being faster than

strncasecmp().

- Data& schemeLowercase(); this is a variant of lowercase() that assumes the

Data is an RFC 3261 scheme. This character set has the property that setting bit

6 is a no-op, except for alphabetical characters (0-9, '+', '-', and '.' all

already have bit 6 set). Setting bit 6 on a alphabetical character is equivalent

to lower-casing the character. Note: There is no corresponding schemeUppercase()

function, because clearing bit 6 will convert 0-9, '+', '-', and '.' into

unprintable characters (well, '-' is turned into a CR, but you get the point).

Performance improvements to ParseBuffer

=======================================

- Most functions that returned a Pointer now return a much more lightweight

CurrentPosition object.

- Allow some of the simpler functions to be inlined

- Integer parsing code is more efficient, and overflow detection is better

Performance enhancements to DnsUtil

===================================

- DnsUtil::inet_ntop(): For some reason, the stock system inet_ntop

is dreadfully inefficient on OS X. A dirt-simple hand-rolled

implementation was 5-6 times as fast. This is shocking. The Linux

implementation is plenty efficient, though, so we're using

preprocessor to activate the hand-rolled code.

- DnsUtil::isIpV4Address(): The implementation uses

sscanf(), which is pretty expensive. Hand-rolled some code that

is much faster.

Reduced the memory footprint associated with storing URIs

=========================================================

- Removed the AOR cacheing stuff from Uri; it was horrifically inefficient. Checking

for staleness of the cache was nearly as expensive as regenerating the AOR from

scratch. Not to mention that the AOR cacheing stuff took up a whopping 148 bytes

of space on 64-bit platforms (4 Datas, and an int).

- Reworked the host canonicalization cache to take up less space, and be faster.

Previously, the canonicalized host was put in a separate Data. We now canonicalize

in-place, and use a bool to denote whether canonicalization has been performed yet.

This saves us 32 bytes.

- Changed Data Uri::mEmbeddedHeadersText to an auto_ptr<>, since in most cases Uris don't

use it. Also use auto_ptr for mEmbeddedHeaders (was already a pointer, for consistency).

Change how branch parameters are encoded.

=========================================

Old format: z9hG4bK-d8754z-<branch>-<transportseq>-<clientData>-<sigcompCompartment>-d8754z-

New Format: z9hG4bK-524287-<transportseq>-<clientData>-<sigcompComprtment>-<branch>

This format encodes faster, parses faster (with _much_ simpler code), and takes up

less space on the wire. We may decide to tweak the new resip cookie; I chose

something that we can use memcmp instead of strncasecmp with, but the token character

set has a bunch of characters that aren't alphanumeric we could use.

Also, some other small optimizations; avoid copies associated with calling

Data::base64encode()/base64decode() on empty Datas, and reorder the SIP cookie

comparisons to be more efficient.

State shedding modifications to TransactionState

================================================

In a number of cases, we were preserving state (in the form of SipMessages

and DnsResults) in cases where we did not really need them any more. For

example, once we have transmitted a response, there is no need

to preserve the full SipMessage for this response (the raw retransmit buffer

is sufficient). Also, INVITE requests do not need to be maintained once

a final response comes in (since there is no possibility that we'll need to

send a simulated 408 or 503 to the TU, nor will we need to construct a CANCEL

request using the INVITE, nor will we need to retransmit). Similarly, once we

have received a final response for a NIT transaction, we no longer need to

maintain the original request or the retransmit buffer. Lastly, if we are

using a reliable transport, we do not need to maintain retransmit buffers

(although we may need to maintain full original requests for simulated

responses and such).

This change has basically no impact on reliable NIT performance, but a huge

impact on non-reliable and INVITE performance. Prior to this change, either

NIT UDP or INVITE TCP testStack would exhaust main memory on my laptop (with

4GB of main memory), bringing progress to a complete halt on runs longer than

15 seconds or so. I did not bother trying INVITE UDP, but that works now too.

Reduction in buffer reallocations while encoding a SipMessage

=============================================================

TransportSelector now keeps a moving average of the outgoing message size,

which is used to preallocate the buffers into which SipMessages are encoded.

This ends up making a small difference in testStack when linked against google

malloc, but a larger difference when linked against OS X's (horrible) standard

malloc.

Multiple Threads in the Stack

=============================

Allow transaction processing, transport processing, and DNS processing to be

broken off into separate threads.

- SipStack::run() causes the creation and run of three threads; a

TransactionControllerThread, and TransportSelectorThread, and a DnsThread. You

continue to use stuff like StackThread and EventStackThread to give cycles to

the rest of the stack (mainly processing app timers and statistics logging); the

SipStack is smart enough to unhook these three things from the normal event loop

API when they have their own threads. In other words, to use the new

multi-threaded mode, all you have to do is throw in a call to SipStack::run()

before you fire up your normal SipStack processing, and a

SipStack::shutdownAndJoinThreads() when you're done.

- In the Connection read/write code, process reads/writes until EAGAIN, or we

run out of stuff to send. Gives a healthy performance boost on connection-based

transports.

- In TransactionController, put transaction timers in their own fifo. This

prevents timers from firing late when the state machine fifo gets congested.

Also, process at most 16 TransactionMessages from the state machine fifo at a

time, to prevent starving other parts of the system.

- Unhook the TransactionController's processing loop from that of the

TransportSelector. This simplifies this API considerably, but required the

addition of a new feature to Fifo. Fifo can now take an (optional)

AsyncProcessHandler* that will be notified when the fifo goes from empty to

non-empty. Actually pretty useful.

- Allow setPollGrp() to be called multiple times on the various classes that

have this function. This allows the FdPollGrp to be re-set when the SipStack

enters multithreaded mode.

- Added a "multithreadedstack" --thread-type option to testStack. Exercise this

option in testStackStd.sh

- Added the ability to run any of the existing Transport objects in their own

thread, by a combination of a new transport flag

(RESIP_TRANSPORT_FLAG_OWNTHREAD), and a new TransportThread class. Added support

for this mode to testStack using the --tf option. Also exercised this feature in

testStackStd.sh.

- Installed SelectInterruptors at the TransportSelector, each Transport object,

and the DnsStub (this last one required moving SelectInterruptor to rutil). This

is critical to making multithreaded mode work in a performant manner, and

imposes almost no performance penalty due to the way they are invoked.

- SipStack now creates its own SelectInterruptor if one is not supplied

externally. This is because it is critical to be able to wake the

TransactionController up when new work comes down from the TU, or from the

transports.

New congestion-management framework

===================================

Notable features include:

* Allow testStack, tfm/repro/sanityTests, and repro to be run with a congestion

manager with the --use-congestion-manager flag.

* Efficient wait-time estimation in AbstractFifo; keeps track of how rapidly

messages are consumed, allowing good estimates of how long a new message will

take to be serviced. More efficient than the time-depth logic in

TimeLimitFifo, and a better predictor too.

* The ability to shed load at the transport level when the TransactionController

is congested, in a very efficient manner, using new functionality in Helper

and SipMessage (Helper::makeRawResponse() and

SipMessage::encodeSingleHeader())

* The ability to shed load coming from the TU when the TransactionController is

congested. This is crucial when congestion is being caused by a TU trying to

do too much.

* Changed the way load-shedding is handled for TransactionUsers to use the new

API

* A flexible congestion-management API, allowing load-shedding decisions to be

made in an arbitrary fashion.

* A generalized CongestionManager implementation that is powerful enough to be

useful.

* The TransactionController will now defer retransmissions of requests if

sufficiently congested (ie; the response is probably stuck in mStateMacFifo)

* The TransactionController now determines its hostname with a single call to

DnsUtil::getLocalHostName() on construction, for use in 503s. Previously, it

would make this call every time a 503 was sent; this call blocks sometimes!

* Don't call DnsResult::blacklistLast() on a Retry-After: 0

* Several fixes the the processing loop in testStack that were causing

starvation of one type of work or another when congestion occurred.

Other Misc Enhancements

=======================

-Small efficiency improvement in Random::getCryptoRandom(int)

Random::getCryptoRandom(unsigned int len) was implemented by calling

Random::getCryptoRandom() repeatedly, and collecting the return values

in a buffer. In the openssl case, we now use a single call to RAND_bytes().

-Use a priority_queue instead of a multiset for storing timers.

-Slight refactoring of Timer so that transaction timers and payload timers (ie;

timers that carry a Message*) are separate classes. Transaction timers no longer

have an unused Message* member, and payload timers no longer have the unused

transaction-id, Timer::Type, and duration. This saves a _lot_ of memory for apps

that use lots of app timers with long lifetimes.

-Less wasteful population of Call-IDs:

When generating Call-IDs, Helper was computing an md5 hash of the hostname and

some salt, hex-encoding it, and then Base64 encoding the hex data. We now Base64

encode the md5 hash directly. This is less computationally expensive, requires

less memory because the resulting string is half the size, and requires fewer

bytes on the wire.

-Make TransactionMap case-insensitive; Data::caseInsensitiveTokenHash() is fast

enough that performance actually increases a little.

-std::bitset-based parsing in a number of places.

-Don't check whether the encoding tables are initted for every single

character; check once before the encode operation begins. Also, checking

the value of a static bool to determine whether an init has been carried

out is pointless; that bool might not be initted yet, and it could have

any value. The static init code now copes with both accesses to the encoding

tables during static initialization, and from multiple threads during runtime.

-Don't bother generating a transaction identifier unless the parse fails

to extract one.

-Some refactoring of the FdPollGrp stuff. Now is compatible with cares, using

a bit of a hack. Also compatible with being driven with the old buildFdSet()/

select()/process(FdSet&) call sequence, although this is now deprecated.

Fixing these compatibility problems allowed us to switch over to using FdPollGrp

in all cases, instead of having dual mode everywhere.

-Buffer classes for Fifo to reduce lock contention. Using them in a few places, will

use them in more once we phase out TimeLimitFifo with the new congestion management

code.

-Use the --ignore-case option for generation of ParameterHash.cxx, instead of the

nasty sed rewriting we are using now. Should also be slightly faster, since gperf

handles case-insensitive hashing more efficiently than our hack was.

-Adding a local memory pool to SipMessage, to cut down (dramatically) on

heap allocation overhead. Some minor refactoring to free up wasted space

in SipMessage as well (makes more room for the pool). Changing the way

the start-line is stored to no longer use a full-blown ParserContainer+

HeaderFieldValueList. Lots of opportunistic doxygen merging.

Up to 20K NIT transactions per second on my machine, roughly a doubling

in performance.

Bug Fixes

=========

-Use getaddrinfo() instead of the non-threadsafe gethostbyname().

-Remove unused (and non-threadsafe) Timer::mTimerCount/Timer::mId.

Previously, all timers were assigned a "unique" (not really, more on that in a

moment) integer identifier. There is no place in the resip codebase that

actually uses this identifier in any way. For transaction timers, this

identifier is in principle unnecessary, since there is more than sufficient

identifier information present already (the transaction id and timer type). When

passing a Message* into the timer queue, a unique identifier already exists; the

Message* itself (if potential use of this Message* bugs you, you can always turn

it into a handle by applying some sort of transformation to it). This identifier

is unnecessary in every case I can think of. In addition, the values are

assigned simply by incrementing a global variable (Timer::mTimerCount), with no

threadsafety measures whatsoever, so it is not even guaranteed to be unique.

Because of all this, it has been removed. As a bonus, this saves some memory; 8

bytes per timer on 64-bit platforms, which adds up to around 3MB when testStack

steady state has close to 400000 timers in the timer queues at any given point.

This could be an even larger amount for TUs that schedule lots of long-lifetime

timers (Timer C, for instance).

-Get rid of a wasteful double-encode, in Message.cxx

-minor windows build fixes to avoid file in use errors when building dum test projects

-fix testDigestAuthentiation for recent lowercase nonce change

-fixed a nasty bug in NameAddr - where unknown parameters uri parameters on a

NameAddr/Uri with no angle brackets are treated as NameAddr parameters. When this is

done, the memory for these parameters was only a temporary Data object.

-fix bug in Data. If Data is wrapping memory allocated externally (ie. Share mode = BORROW)

and you start appending to it. It is possible that the append method will write a NULL

character off the end of the buffer. Changed the resize condition to make the buffer

larger 1 character sooner, to accommodate for this.

Best Regards,

Scott Godin

SIP Spectrum, Inc.

Follow-Ups:
- Re: [reSIProcate] Major Merge to Resiprocate SVN Trunk
  - From: Francis Joanis
- Re: [reSIProcate] Major Merge to Resiprocate SVN Trunk
  - From: Aron Rosenberg

Prev by Date: Re: [reSIProcate] reTurn authentication, async and threading issues
Next by Date: Re: [reSIProcate] Major Merge to Resiprocate SVN Trunk
Previous by thread: Re: [reSIProcate] reTurn authentication, async and threading issues
Next by thread: Re: [reSIProcate] Major Merge to Resiprocate SVN Trunk
Index(es):
- Date
- Thread