Re: [reSIProcate] Major Merge to Resiprocate SVN Trunk
This is great news! Thanks a lot guys!
Francis
On Wed, Feb 1, 2012 at 12:47 PM, Scott Godin <sgodin@xxxxxxxxxxxxxxx> wrote:
> Hi All,
>
> I've spent the last number of days testing out and fixing up the Tekelec
> performance branch resip-b-TKLC-perf_work on Windows and Linux. I have just
> completed the merge of this branch with resip SVN trunk/mainline.
>
> Thanks to Byron Campen for the initial contribution of the branch, and for
> testing on OS/X.
>
> Summary of Major Changes (read below for more details):
> 1. Many performance related changes, including a new multi-threaded stack
> mode. Using testStack (Registration Transactions over TCP), I measured a
> 15-30% performance increase under Windows (multi-core) and a 20-30%
> performance increase under linux (single-core). I expect much higher
> numbers for linux on a multicore system (especially when using the new
> multithreaded stack feature), but I didn't have one available for
> testing.
> 2. New Congestion Management framework.
> 3. Reduced memory footprint.
>
> The following notes highlight the changes in this merge. Lot's of neat new
> stuff. Happy reading!
>
> Merge Notes:
>
> Various optimizations of Data
> =============================
>
> Made Data smaller, without sacrificing functionality. Data is 20 (56 vs 36)
> bytes smaller on 64-bit libs, and 4 (36 vs 32) bytes smaller on 32-bit
> libs.
> This was accomplished by making mSize, mCapacity, and mShareEnum 4-bytes on
> 64-bit platforms (mShareEnum could be one byte, but it turns out this
> imposes a
> detectable performance penalty), and by having mShareEnum do double-duty as
> a
> null-terminator for mPreBuffer (Borrow==0), instead of requiring an extra
> byte
> at the end of mPreBuffer.
>
> Several very simple functions have been inlined.
>
> Functionality enhancements to a couple of functions:
>
> - Data::md5() has been changed to Data::md5(Data::EncodingType type=HEX);
> this
> allows the output of md5() to be encoded as hex or Base64, or not encoded at
> all
> (binary).
>
> - Data::replace(const Data& match, const Data& target) has been updated to
> Data::replace(const Data& match, const Data& target, int max=INT_MAX); this
> allows the maximum number of replacements to be specified.
>
> Lastly, a few specialized hashing and comparison functions have been added:
>
> - Data::caseInsensitiveTokenHash(); this is a case-insensitive hash that
> assumes
> that the Data is an RFC 3261 token (eg; branch params). This character set
> has
> the property that no character is equal to any other character when bit 6
> is
> masked out, except for the alphabetical characters. (For alphabetical
> characters, bit 6 specifies whether the character is upper/lower case) This
> means that we can mask out bit 6, and then use a case-sensitive hash
> algorithm
> on the resulting characters, without hurting the collision properties of
> the
> hash, and get a case-insensitive hash as a result. This hash function is
> based
> on the Hsieh hash.
>
> - bool Data::caseInsensitiveTokenCompare(const Data& rhs); this is an
> equality
> comparison that assumes that both Datas are RFC 3261 tokens (eg; branch
> parameters). This function takes advantage of the same properties of the
> RFC
> 3261 token character set as caseInsensitiveTokenHash(), by using a bitmask
> instead of a true lowercase() operation. This ends up being faster than
> strncasecmp().
>
> - Data& schemeLowercase(); this is a variant of lowercase() that assumes
> the
> Data is an RFC 3261 scheme. This character set has the property that setting
> bit
> 6 is a no-op, except for alphabetical characters (0-9, '+', '-', and '.'
> all
> already have bit 6 set). Setting bit 6 on a alphabetical character is
> equivalent
> to lower-casing the character. Note: There is no corresponding
> schemeUppercase()
> function, because clearing bit 6 will convert 0-9, '+', '-', and '.' into
> unprintable characters (well, '-' is turned into a CR, but you get the
> point).
>
>
>
> Performance improvements to ParseBuffer
> =======================================
>
> - Most functions that returned a Pointer now return a much more lightweight
> CurrentPosition object.
> - Allow some of the simpler functions to be inlined
> - Integer parsing code is more efficient, and overflow detection is better
>
>
> Performance enhancements to DnsUtil
> ===================================
>
> - DnsUtil::inet_ntop(): For some reason, the stock system inet_ntop
> is dreadfully inefficient on OS X. A dirt-simple hand-rolled
> implementation was 5-6 times as fast. This is shocking. The Linux
> implementation is plenty efficient, though, so we're using
> preprocessor to activate the hand-rolled code.
>
> - DnsUtil::isIpV4Address(): The implementation uses
> sscanf(), which is pretty expensive. Hand-rolled some code that
> is much faster.
>
>
> Reduced the memory footprint associated with storing URIs
> =========================================================
> - Removed the AOR cacheing stuff from Uri; it was horrifically inefficient.
> Checking
> for staleness of the cache was nearly as expensive as regenerating the AOR
> from
> scratch. Not to mention that the AOR cacheing stuff took up a whopping 148
> bytes
> of space on 64-bit platforms (4 Datas, and an int).
>
> - Reworked the host canonicalization cache to take up less space, and be
> faster.
> Previously, the canonicalized host was put in a separate Data. We now
> canonicalize
> in-place, and use a bool to denote whether canonicalization has been
> performed yet.
> This saves us 32 bytes.
>
> - Changed Data Uri::mEmbeddedHeadersText to an auto_ptr<>, since in most
> cases Uris don't
> use it. Also use auto_ptr for mEmbeddedHeaders (was already a pointer, for
> consistency).
>
>
> Change how branch parameters are encoded.
> =========================================
>
> Old format:
> z9hG4bK-d8754z-<branch>-<transportseq>-<clientData>-<sigcompCompartment>-d8754z-
>
> New Format:
> z9hG4bK-524287-<transportseq>-<clientData>-<sigcompComprtment>-<branch>
>
> This format encodes faster, parses faster (with _much_ simpler code), and
> takes up
> less space on the wire. We may decide to tweak the new resip cookie; I
> chose
> something that we can use memcmp instead of strncasecmp with, but the token
> character
> set has a bunch of characters that aren't alphanumeric we could use.
>
> Also, some other small optimizations; avoid copies associated with calling
> Data::base64encode()/base64decode() on empty Datas, and reorder the SIP
> cookie
> comparisons to be more efficient.
>
>
> State shedding modifications to TransactionState
> ================================================
> In a number of cases, we were preserving state (in the form of SipMessages
> and DnsResults) in cases where we did not really need them any more. For
> example, once we have transmitted a response, there is no need
> to preserve the full SipMessage for this response (the raw retransmit buffer
> is sufficient). Also, INVITE requests do not need to be maintained once
> a final response comes in (since there is no possibility that we'll need to
> send a simulated 408 or 503 to the TU, nor will we need to construct a
> CANCEL
> request using the INVITE, nor will we need to retransmit). Similarly, once
> we
> have received a final response for a NIT transaction, we no longer need to
> maintain the original request or the retransmit buffer. Lastly, if we are
> using a reliable transport, we do not need to maintain retransmit buffers
> (although we may need to maintain full original requests for simulated
> responses and such).
>
> This change has basically no impact on reliable NIT performance, but a huge
> impact on non-reliable and INVITE performance. Prior to this change, either
> NIT UDP or INVITE TCP testStack would exhaust main memory on my laptop (with
> 4GB of main memory), bringing progress to a complete halt on runs longer
> than
> 15 seconds or so. I did not bother trying INVITE UDP, but that works now
> too.
>
>
> Reduction in buffer reallocations while encoding a SipMessage
> =============================================================
> TransportSelector now keeps a moving average of the outgoing message size,
> which is used to preallocate the buffers into which SipMessages are encoded.
>
> This ends up making a small difference in testStack when linked against
> google
> malloc, but a larger difference when linked against OS X's (horrible)
> standard
> malloc.
>
>
> Multiple Threads in the Stack
> =============================
> Allow transaction processing, transport processing, and DNS processing to
> be
> broken off into separate threads.
>
> - SipStack::run() causes the creation and run of three threads; a
> TransactionControllerThread, and TransportSelectorThread, and a DnsThread.
> You
> continue to use stuff like StackThread and EventStackThread to give cycles
> to
> the rest of the stack (mainly processing app timers and statistics logging);
> the
> SipStack is smart enough to unhook these three things from the normal event
> loop
> API when they have their own threads. In other words, to use the new
> multi-threaded mode, all you have to do is throw in a call to
> SipStack::run()
> before you fire up your normal SipStack processing, and a
> SipStack::shutdownAndJoinThreads() when you're done.
>
> - In the Connection read/write code, process reads/writes until EAGAIN, or
> we
> run out of stuff to send. Gives a healthy performance boost on
> connection-based
> transports.
>
> - In TransactionController, put transaction timers in their own fifo. This
> prevents timers from firing late when the state machine fifo gets
> congested.
> Also, process at most 16 TransactionMessages from the state machine fifo at
> a
> time, to prevent starving other parts of the system.
>
> - Unhook the TransactionController's processing loop from that of the
> TransportSelector. This simplifies this API considerably, but required the
> addition of a new feature to Fifo. Fifo can now take an (optional)
> AsyncProcessHandler* that will be notified when the fifo goes from empty to
> non-empty. Actually pretty useful.
>
> - Allow setPollGrp() to be called multiple times on the various classes
> that
> have this function. This allows the FdPollGrp to be re-set when the
> SipStack
> enters multithreaded mode.
>
> - Added a "multithreadedstack" --thread-type option to testStack. Exercise
> this
> option in testStackStd.sh
>
> - Added the ability to run any of the existing Transport objects in their
> own
> thread, by a combination of a new transport flag
> (RESIP_TRANSPORT_FLAG_OWNTHREAD), and a new TransportThread class. Added
> support
> for this mode to testStack using the --tf option. Also exercised this
> feature in
> testStackStd.sh.
>
> - Installed SelectInterruptors at the TransportSelector, each Transport
> object,
> and the DnsStub (this last one required moving SelectInterruptor to rutil).
> This
> is critical to making multithreaded mode work in a performant manner, and
> imposes almost no performance penalty due to the way they are invoked.
>
> - SipStack now creates its own SelectInterruptor if one is not supplied
> externally. This is because it is critical to be able to wake the
> TransactionController up when new work comes down from the TU, or from the
> transports.
>
>
> New congestion-management framework
> ===================================
> Notable features include:
> * Allow testStack, tfm/repro/sanityTests, and repro to be run with a
> congestion
> manager with the --use-congestion-manager flag.
>
> * Efficient wait-time estimation in AbstractFifo; keeps track of how rapidly
> messages are consumed, allowing good estimates of how long a new message
> will
> take to be serviced. More efficient than the time-depth logic in
> TimeLimitFifo, and a better predictor too.
>
> * The ability to shed load at the transport level when the
> TransactionController
> is congested, in a very efficient manner, using new functionality in
> Helper
> and SipMessage (Helper::makeRawResponse() and
> SipMessage::encodeSingleHeader())
>
> * The ability to shed load coming from the TU when the TransactionController
> is
> congested. This is crucial when congestion is being caused by a TU trying
> to
> do too much.
>
> * Changed the way load-shedding is handled for TransactionUsers to use the
> new
> API
>
> * A flexible congestion-management API, allowing load-shedding decisions to
> be
> made in an arbitrary fashion.
>
> * A generalized CongestionManager implementation that is powerful enough to
> be
> useful.
>
> * The TransactionController will now defer retransmissions of requests if
> sufficiently congested (ie; the response is probably stuck in
> mStateMacFifo)
>
> * The TransactionController now determines its hostname with a single call
> to
> DnsUtil::getLocalHostName() on construction, for use in 503s. Previously,
> it
> would make this call every time a 503 was sent; this call blocks
> sometimes!
>
> * Don't call DnsResult::blacklistLast() on a Retry-After: 0
>
> * Several fixes the the processing loop in testStack that were causing
> starvation of one type of work or another when congestion occurred.
>
>
> Other Misc Enhancements
> =======================
> -Small efficiency improvement in Random::getCryptoRandom(int)
> Random::getCryptoRandom(unsigned int len) was implemented by calling
> Random::getCryptoRandom() repeatedly, and collecting the return values
> in a buffer. In the openssl case, we now use a single call to RAND_bytes().
> -Use a priority_queue instead of a multiset for storing timers.
> -Slight refactoring of Timer so that transaction timers and payload timers
> (ie;
> timers that carry a Message*) are separate classes. Transaction timers no
> longer
> have an unused Message* member, and payload timers no longer have the
> unused
> transaction-id, Timer::Type, and duration. This saves a _lot_ of memory for
> apps
> that use lots of app timers with long lifetimes.
> -Less wasteful population of Call-IDs:
> When generating Call-IDs, Helper was computing an md5 hash of the hostname
> and
> some salt, hex-encoding it, and then Base64 encoding the hex data. We now
> Base64
> encode the md5 hash directly. This is less computationally expensive,
> requires
> less memory because the resulting string is half the size, and requires
> fewer
> bytes on the wire.
> -Make TransactionMap case-insensitive; Data::caseInsensitiveTokenHash() is
> fast
> enough that performance actually increases a little.
> -std::bitset-based parsing in a number of places.
> -Don't check whether the encoding tables are initted for every single
> character; check once before the encode operation begins. Also, checking
> the value of a static bool to determine whether an init has been carried
> out is pointless; that bool might not be initted yet, and it could have
> any value. The static init code now copes with both accesses to the
> encoding
> tables during static initialization, and from multiple threads during
> runtime.
> -Don't bother generating a transaction identifier unless the parse fails
> to extract one.
> -Some refactoring of the FdPollGrp stuff. Now is compatible with cares,
> using
> a bit of a hack. Also compatible with being driven with the old
> buildFdSet()/
> select()/process(FdSet&) call sequence, although this is now deprecated.
> Fixing these compatibility problems allowed us to switch over to using
> FdPollGrp
> in all cases, instead of having dual mode everywhere.
> -Buffer classes for Fifo to reduce lock contention. Using them in a few
> places, will
> use them in more once we phase out TimeLimitFifo with the new congestion
> management
> code.
> -Use the --ignore-case option for generation of ParameterHash.cxx, instead
> of the
> nasty sed rewriting we are using now. Should also be slightly faster, since
> gperf
> handles case-insensitive hashing more efficiently than our hack was.
> -Adding a local memory pool to SipMessage, to cut down (dramatically) on
> heap allocation overhead. Some minor refactoring to free up wasted space
> in SipMessage as well (makes more room for the pool). Changing the way
> the start-line is stored to no longer use a full-blown ParserContainer+
> HeaderFieldValueList. Lots of opportunistic doxygen merging.
> Up to 20K NIT transactions per second on my machine, roughly a doubling
> in performance.
>
>
> Bug Fixes
> =========
> -Use getaddrinfo() instead of the non-threadsafe gethostbyname().
> -Remove unused (and non-threadsafe) Timer::mTimerCount/Timer::mId.
> Previously, all timers were assigned a "unique" (not really, more on that
> in a
> moment) integer identifier. There is no place in the resip codebase that
> actually uses this identifier in any way. For transaction timers, this
> identifier is in principle unnecessary, since there is more than
> sufficient
> identifier information present already (the transaction id and timer type).
> When
> passing a Message* into the timer queue, a unique identifier already
> exists; the
> Message* itself (if potential use of this Message* bugs you, you can always
> turn
> it into a handle by applying some sort of transformation to it). This
> identifier
> is unnecessary in every case I can think of. In addition, the values are
> assigned simply by incrementing a global variable (Timer::mTimerCount),
> with no
> threadsafety measures whatsoever, so it is not even guaranteed to be
> unique.
> Because of all this, it has been removed. As a bonus, this saves some
> memory; 8
> bytes per timer on 64-bit platforms, which adds up to around 3MB when
> testStack
> steady state has close to 400000 timers in the timer queues at any given
> point.
> This could be an even larger amount for TUs that schedule lots of
> long-lifetime
> timers (Timer C, for instance).
> -Get rid of a wasteful double-encode, in Message.cxx
> -minor windows build fixes to avoid file in use errors when building dum
> test projects
> -fix testDigestAuthentiation for recent lowercase nonce change
> -fixed a nasty bug in NameAddr - where unknown parameters uri parameters on
> a
> NameAddr/Uri with no angle brackets are treated as NameAddr parameters.
> When this is
> done, the memory for these parameters was only a temporary Data object.
> -fix bug in Data. If Data is wrapping memory allocated externally (ie.
> Share mode = BORROW)
> and you start appending to it. It is possible that the append method will
> write a NULL
> character off the end of the buffer. Changed the resize condition to make
> the buffer
> larger 1 character sooner, to accommodate for this.
>
>
> Best Regards,
> Scott Godin
> SIP Spectrum, Inc.
>
> _______________________________________________
> resiprocate-devel mailing list
> resiprocate-devel@xxxxxxxxxxxxxxx
> https://list.resiprocate.org/mailman/listinfo/resiprocate-devel