[reSIProcate] The Million User Dilemma

Sat Jan 29 22:52:24 CST 2011

Hi guys,

I've been using resip + dum for a while, but since I was more focusing on
building UAs (and not proxies, ...) with it I've had no performance issue so
far. In fact I found that it was actually performing better than some other
SIP application layers, especially when handling multiple SIP events at the
same time.

The reason why it was performing better was because the DUM (application
level) generally uses a single thread for all SIP events, rather than using
one thread per event or event type (imagine what 1 thread per session would
do... :( ).

I see the current resip threading code as being like the reactor design
pattern, where only a single thread is used to "select" then synchronously
process events. From my experience, one main advantage of this approach is
that the stack's general behaviour is "predictable" with regards to its
performance and the flow of events (i.e. processing a single call VS
processing 100+ incoming calls).

However, one downside of the reactor is that it doesn't scale well on
multicore CPUs since it only has a single thread. To really leverage
multicore, programs need to become more and more concurrent (that is truly
concurrent - i.e. without mutexes and locking) in order to get faster. This
is probably nothing new for most of us, but it is something that I've been
realizing practically more and more since I've been exposed to concurrent
languages like Erlang.

I think that investing time into making resip (at least the stack and DUM
parts) multicore aware would be a great way to future-proof it.

To add to what was already said in this thread:

- It does make a lot of sense to leverage libevent or asio or ... to ensure
best performance on all platforms. This is a long term goal but maybe we
could start some prep work now (like decoupling stuff and laying down
foundations). The alternative could be to try to implement the best select()
substitute for each supported platforms, but we might then end up rewriting
libevent ourselves.
- Regarding the reactor design pattern, there is also the proactor one which
uses (unless I'm mistaken) OS-level asynchronous IO (resip currently uses
synchronous IO). The idea is that the resip transport thread would be able
to service multiple IO operations at the same time through the kernel. I
think this is similar to what Kennard mentioned as a post-notified system
and this would not be an easy change.
- Adding more threads where it makes sense (like one per transport or ...)
might not be good enough if those threads still use thread locking to
communicate between each other. I've done a bit of googling about lock-free
data structures and it is quite interesting. I might try it one day to see
how much faster it could get just between the stack and the DUM.
- It does also make sense to look into code profiling and ensuring that the
code is not "wasting cycles"

Anyway, I think this is a great idea and I would be happy to help :)

Regards,
Francis

On Sat, Jan 29, 2011 at 1:57 PM, Kennard White
<kennard_white at logitech.com>wrote:

> Hi Byron,
>
> Scott and I earlier discussed possible directions to take, and he found
> this:
> http://google-opensource.blogspot.com/2010/01/libevent-20x-like-libevent-14x-only.htmlwhich is a good overview of pre-notification (select/poll/epoll, which tell
> app when IO is possible) approaches vs buffered or post-notification
> approaches (where you queue up IO into the kernel, and it tells app when
> when IO is complete). Asio is a post-notification system, and as far as I
> can tell doesn't offer a pre-notification API. In contrast, libevent offers
> both pre-notification and post-notification. The gotcha here is that in
> Windows the only way to effectively handle many connections is with
> post-notification (so I understand). Windows has pre-notification APIs, but
> they don't scale (I'm told).
>
> With respect to the current resip codebase, any pre-notification library
> could be plugged into resip in place of my "native" epoll, and (hopefully)
> everything is properly virtualized to allowed this. Need to do something
> here, because there is significant branching within current codebase to
> handle epoll vs "older" buildFd/process paradigm. Support both modes won't
> be fun.
>
> The alternative is to make a "big" leap to a post-notify system. The
> problem is that this is much more intrusive into the application, because
> the underlying framework has its own buffer management system (think mbufs),
> and every framework manages buffers differently. This is in contrast to
> pre-notification, where the app tells the kernel "put the data here", and
> the app provides its own buffer management (which resip does, esp for TCP).
> While I haven't started a prototype so I don't know for sure, my guess is
> that is a given transport class with resip would be "hardcoded" to work with
> a particular framework -- it cannot be hidden. Given that, asio is a natural
> choice, since (hopefully) it works everywhere we want.
>
> Anyways, I'm undecided among 3 options:
>
>    - Write a FdPollGrp impl class that uses Window's select call, so that
>    there a working FdPollGrp class on every platform, and can obsolete
>    buildFd/process. Unfortunately, I don't develop for Windows, and Window's
>    select() is somewhere in between Linux's select() and poll(), so really
>    someone else needs to do this.
>    - Write libevent2 adapter for FdPollGrp that uses the pre-notification
>    mode and then obsolete buildFd/process. Then libevent2 becomes required
>    dependency for Windows (and any platform without working epoll()).
>    - Switch everything to asio. This is big project, requires turning the
>    transport code inside-out, and would break compatibility with any "private"
>    transports.
>
> Regarding, asio::strand, as far as I can tell it is "just" a per-handler
> mutex. I don't see how multi-threading of the transports helps anything. All
> the heavy work is in the transaction layer, and there is a queue interface
> between the transports and transaction layer. I think putting transports in
> separate threads has been tried before ("ExternalTransports") and my
> understand is that it didn't pay. One can see the same thing by running
> testStack in the different threading modes I added, and the multithreaded
> ones all perform worse.
>
> Kennard
>
>
>
>
>
>
> On Fri, Jan 28, 2011 at 7:07 PM, Byron Campen <bcampen at estacado.net>wrote:
>
>>        tfdum is actually doing the boost::bind trick here, but no asio
>> Strand (the bindings are to the various blahCommand() functions). I just
>> wish the compiler spew wasn't so bad when you got a parameter not quite
>> right, but that's gcc templates for you. An app-writer can easily use
>> boost::bind in their app, which does not require any boost dependency in
>> resip or DUM, so that is at least nice. I'm not familiar enough with asio
>> Strand to say how much work it would be to make resip's threading use it;
>> I'm guessing this is a wrapper for pthreads/whatever Windows uses/the fancy
>> Intel threading stuff?
>>
>>        As for using asio to just drive the event loops, Scott, roughly how
>> much work would need to be done here? And how many platforms would this
>> benefit? I know the epoll stuff works on OS X; how would Windows benefit
>> from using asio? I'm thrilled with epoll, but that's just me.
>>
>> Best regards,
>> Byron Campen
>>
>>
>> > Hi Kennard,
>> >
>> > I think you're on the right track with using epoll, but I'd like to go
>> > one step further and improve cross platform compatibility in the
>> > process.  Scott Godin has been keeping header only asio up to date in
>> > the resiprocate tree, and it provides support for every platforms most
>> > sophisticated version of select/epoll/kqueue etc.  Reimplementing things
>> > like FdSet and and the wait and process functions with async_wait that
>> > asio provides could provide a humongous performance improvement as well
>> > as Asio can be multithreaded easily.
>> >
>> > Also consider things like DumCommand.  It can be easily replaced with
>> > Asio's Strand + boost::bind/boost::function or C++0x lambdas which are
>> > much more flexible and require significantly less code.
>> >
>> > Dan
>> >
>> > On 01/26/2011 01:45 PM, Kennard White wrote:
>> >> Hi Dan,
>> >>
>> >> I found your post very interesting, since we have very similar goals.
>> >> The changes I've made recently to resip to add epoll support is to
>> >> address the first limitation: simply being able to have many
>> connections
>> >> open.
>> >>
>> >> I've spent some amount of time profiling resip, and unfortunately I
>> >> haven't found one single hot-spot. Probably SipMessage allocation and
>> >> destruction is most expensive, but I haven't looked into it in any
>> >> detail. For reference, I'm getting about 2ktps on good hardware in
>> >> "real" usage scenarios. Probably first thing to do is look for
>> >> unnecessary message copies.
>> >>
>> >> For the SIP-aspect of NAT traversal, we are switch to TCP/TLS (away
>> from
>> >> UDP) using RFC 5626 outbound support.
>> >>
>> >> Would like to hear your plans.
>> >>
>> >> Regards,
>> >> Kennard
>> >>
>> >> On Wed, Jan 26, 2011 at 10:15 AM, Dan Weber <dan at marketsoup.com
>> >> <mailto:dan at marketsoup.com>> wrote:
>> >>
>> >>    Hi guys,
>> >>
>> >>    I must say I have a quite ambitious goal.  I want to make it so that
>> I
>> >>    can build a network of repros that can support millions upon
>> millions of
>> >>    users.  Likewise, I like to consider myself as a standards based
>> guy,
>> >>    and I want to take as much of everyone's input as possible in the
>> design
>> >>    path to doing this.  In return, everything will be made available
>> for
>> >>    free under the same Vovida license and/or BSD licensing that is
>> already
>> >>    available.
>> >>
>> >>
>> >>    Several key areas of concern are the following:
>> >>
>> >>    Reliability:
>> >>    How do we make it so that we can have many repro nodes work together
>> >>    across large geographic topology, and allow calls to continue
>> processing
>> >>    in the event of an attack or a failure?
>> >>
>> >>    Scalability:
>> >>    If you've ever run the testStack application and you're running a
>> modern
>> >>    computer, you'll notice that it doesn't matter how many cores you
>> have,
>> >>    or even to the point of the clock rate of your processor, there
>> seems to
>> >>    be a magic threshold around 6500 TPS for non invite scenarios.
>> >>    Likewise, for calls, I can get about 1/3rd of that.  Also, those are
>> >>    tests done with TCP, when you add in UDP, you can watch it suck up
>> >>    memory like its job.  Based on what Byron has shown me, on inferior
>> >>    hardware, the stack that Estacado/Tekelec has built and modified
>> from
>> >>    the main resiprocate tree can perform over 12000 TPS for noninvite
>> >>    transactions in a single thread.  This means there are even great
>> areas
>> >>    for improvement beyond just adding concurrency.
>> >>
>> >>    Security:
>> >>    Resiprocate supports TLS fairly well.  I would like to be able to
>> take
>> >>    advantage of that with any reliability mechanism put forth to help
>> meet
>> >>    HIPAA style requirements that require that all data stored to disk
>> be
>> >>    encrypted, and all data in transit be in encrypted.  Thankfully,
>> part of
>> >>    this problem can be more easily resolved by keeping more state in
>> >>    memory.
>> >>
>> >>    NAT Traversal:
>> >>    Jeremy Geras and Scott Godin among others have worked very hard to
>> >>    provide NAT traversal mechanisms for calls and registrations and so
>> >>    forth through reTurn, reflow, and recon.  Jeremy's branch of recon
>> >>    utilizes an outdated stack, but supports ICE to a large degree.  It
>> is
>> >>    missing support for ICE with TURN and has some other quirks that
>> I've
>> >>    managed to work out.
>> >>
>> >>    In my research around these key areas, I have come up with several
>> ideas
>> >>    of my own to deal with these issues, however, I would like to open
>> this
>> >>    up to the community to discuss these areas in an open forum where
>> >>    everyone can participate and have their input taken seriously.
>> >>
>> >>    Thanks guys,
>> >>    Dan
>> >>
>> >>    _______________________________________________
>> >>    resiprocate-devel mailing list
>> >>    resiprocate-devel at resiprocate.org
>> >>    <mailto:resiprocate-devel at resiprocate.org>
>> >>    https://list.resiprocate.org/mailman/listinfo/resiprocate-devel
>> >>
>> >>
>> >
>> >
>> > _______________________________________________
>> > resiprocate-devel mailing list
>> > resiprocate-devel at resiprocate.org
>> > https://list.resiprocate.org/mailman/listinfo/resiprocate-devel
>>
>>
>
> _______________________________________________
> resiprocate-devel mailing list
> resiprocate-devel at resiprocate.org
> https://list.resiprocate.org/mailman/listinfo/resiprocate-devel
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.resiprocate.org/pipermail/resiprocate-devel/attachments/20110129/037334ff/attachment.htm>