< Previous by Date Date Index Next by Date >
< Previous in Thread Thread Index Next in Thread >

Re: [reSIProcate] The Million User Dilemma


This is an very interesting topic, but my experience told that putting transport into different threads did not improve the performance at all, and using Google's tcmalloc memory allocator can improve the performance of the stack. ReSiprocate is a very good general SIP stack, but like Francis pointed out, it does not scale well on multi-core CPUs. As to the TLS, adding SSL_MODE_RELEASE_BUFFER support, we can save 34K memory per TLS connection (http://www.openssl.org/docs/ssl/SSL_CTX_set_mode.html)

I have built an Erlang/OTP based SIP proxy with TCP/TLS/UDP transports support, it almost linear scales on multi-core CPU. It can support more than 100K TLS connections, and more than 1000 CPS. If you are interested, please contact me offline.

Best regards,

/Kaiduan

From: Francis Joanis <francis.joanis@xxxxxxxxx>
To: Kennard White <kennard_white@xxxxxxxxxxxx>
Cc: resiprocate-devel@xxxxxxxxxxxxxxx
Sent: Sat, January 29, 2011 11:52:24 PM
Subject: Re: [reSIProcate] The Million User Dilemma

Hi guys,

I've been using resip + dum for a while, but since I was more focusing on building UAs (and not proxies, ...) with it I've had no performance issue so far. In fact I found that it was actually performing better than some other SIP application layers, especially when handling multiple SIP events at the same time.

The reason why it was performing better was because the DUM (application level) generally uses a single thread for all SIP events, rather than using one thread per event or event type (imagine what 1 thread per session would do... :( ).

I see the current resip threading code as being like the reactor design pattern, where only a single thread is used to "select" then synchronously process events. From my experience, one main advantage of this approach is that the stack's general behaviour is "predictable" with regards to its performance and the flow of events (i.e. processing a single call VS processing 100+ incoming calls). 

However, one downside of the reactor is that it doesn't scale well on multicore CPUs since it only has a single thread. To really leverage multicore, programs need to become more and more concurrent (that is truly concurrent - i.e. without mutexes and locking) in order to get faster. This is probably nothing new for most of us, but it is something that I've been realizing practically more and more since I've been exposed to concurrent languages like Erlang.

I think that investing time into making resip (at least the stack and DUM parts) multicore aware would be a great way to future-proof it.

To add to what was already said in this thread:

- It does make a lot of sense to leverage libevent or asio or ... to ensure best performance on all platforms. This is a long term goal but maybe we could start some prep work now (like decoupling stuff and laying down foundations). The alternative could be to try to implement the best select() substitute for each supported platforms, but we might then end up rewriting libevent ourselves.
- Regarding the reactor design pattern, there is also the proactor one which uses (unless I'm mistaken) OS-level asynchronous IO (resip currently uses synchronous IO). The idea is that the resip transport thread would be able to service multiple IO operations at the same time through the kernel. I think this is similar to what Kennard mentioned as a post-notified system and this would not be an easy change.
- Adding more threads where it makes sense (like one per transport or ...) might not be good enough if those threads still use thread locking to communicate between each other. I've done a bit of googling about lock-free data structures and it is quite interesting. I might try it one day to see how much faster it could get just between the stack and the DUM.
- It does also make sense to look into code profiling and ensuring that the code is not "wasting cycles"

Anyway, I think this is a great idea and I would be happy to help :)

Regards,
Francis

On Sat, Jan 29, 2011 at 1:57 PM, Kennard White <kennard_white@xxxxxxxxxxxx> wrote:
Hi Byron,

Scott and I earlier discussed possible directions to take, and he found this: http://google-opensource.blogspot.com/2010/01/libevent-20x-like-libevent-14x-only.html which is a good overview of pre-notification (select/poll/epoll, which tell app when IO is possible) approaches vs buffered or post-notification approaches (where you queue up IO into the kernel, and it tells app when when IO is complete). Asio is a post-notification system, and as far as I can tell doesn't offer a pre-notification API. In contrast, libevent offers both pre-notification and post-notification. The gotcha here is that in Windows the only way to effectively handle many connections is with post-notification (so I understand). Windows has pre-notification APIs, but they don't scale (I'm told).

With respect to the current resip codebase, any pre-notification library could be plugged into resip in place of my "native" epoll, and (hopefully) everything is properly virtualized to allowed this. Need to do something here, because there is significant branching within current codebase to handle epoll vs "older" buildFd/process paradigm. Support both modes won't be fun.

The alternative is to make a "big" leap to a post-notify system. The problem is that this is much more intrusive into the application, because the underlying framework has its own buffer management system (think mbufs), and every framework manages buffers differently. This is in contrast to pre-notification, where the app tells the kernel "put the data here", and the app provides its own buffer management (which resip does, esp for TCP). While I haven't started a prototype so I don't know for sure, my guess is that is a given transport class with resip would be "hardcoded" to work with a particular framework -- it cannot be hidden. Given that, asio is a natural choice, since (hopefully) it works everywhere we want.

Anyways, I'm undecided among 3 options:
  • Write a FdPollGrp impl class that uses Window's select call, so that there a working FdPollGrp class on every platform, and can obsolete buildFd/process. Unfortunately, I don't develop for Windows, and Window's select() is somewhere in between Linux's select() and poll(), so really someone else needs to do this.
  • Write libevent2 adapter for FdPollGrp that uses the pre-notification mode and then obsolete buildFd/process. Then libevent2 becomes required dependency for Windows (and any platform without working epoll()).
  • Switch everything to asio. This is big project, requires turning the transport code inside-out, and would break compatibility with any "private" transports.
Regarding, asio::strand, as far as I can tell it is "just" a per-handler mutex. I don't see how multi-threading of the transports helps anything. All the heavy work is in the transaction layer, and there is a queue interface between the transports and transaction layer. I think putting transports in separate threads has been tried before ("ExternalTransports") and my understand is that it didn't pay. One can see the same thing by running testStack in the different threading modes I added, and the multithreaded ones all perform worse.

Kennard






On Fri, Jan 28, 2011 at 7:07 PM, Byron Campen <bcampen@xxxxxxxxxxxx> wrote:
       tfdum is actually doing the boost::bind trick here, but no asio Strand (the bindings are to the various blahCommand() functions). I just wish the compiler spew wasn't so bad when you got a parameter not quite right, but that's gcc templates for you. An app-writer can easily use boost::bind in their app, which does not require any boost dependency in resip or DUM, so that is at least nice. I'm not familiar enough with asio Strand to say how much work it would be to make resip's threading use it; I'm guessing this is a wrapper for pthreads/whatever Windows uses/the fancy Intel threading stuff?

       As for using asio to just drive the event loops, Scott, roughly how much work would need to be done here? And how many platforms would this benefit? I know the epoll stuff works on OS X; how would Windows benefit from using asio? I'm thrilled with epoll, but that's just me.

Best regards,
Byron Campen


> Hi Kennard,
>
> I think you're on the right track with using epoll, but I'd like to go
> one step further and improve cross platform compatibility in the
> process.  Scott Godin has been keeping header only asio up to date in
> the resiprocate tree, and it provides support for every platforms most
> sophisticated version of select/epoll/kqueue etc.  Reimplementing things
> like FdSet and and the wait and process functions with async_wait that
> asio provides could provide a humongous performance improvement as well
> as Asio can be multithreaded easily.
>
> Also consider things like DumCommand.  It can be easily replaced with
> Asio's Strand + boost::bind/boost::function or C++0x lambdas which are
> much more flexible and require significantly less code.
>
> Dan
>
> On 01/26/2011 01:45 PM, Kennard White wrote:
>> Hi Dan,
>>
>> I found your post very interesting, since we have very similar goals.
>> The changes I've made recently to resip to add epoll support is to
>> address the first limitation: simply being able to have many connections
>> open.
>>
>> I've spent some amount of time profiling resip, and unfortunately I
>> haven't found one single hot-spot. Probably SipMessage allocation and
>> destruction is most expensive, but I haven't looked into it in any
>> detail. For reference, I'm getting about 2ktps on good hardware in
>> "real" usage scenarios. Probably first thing to do is look for
>> unnecessary message copies.
>>
>> For the SIP-aspect of NAT traversal, we are switch to TCP/TLS (away from
>> UDP) using RFC 5626 outbound support.
>>
>> Would like to hear your plans.
>>
>> Regards,
>> Kennard
>>
>> On Wed, Jan 26, 2011 at 10:15 AM, Dan Weber <dan@xxxxxxxxxxxxxx
>> <mailto:dan@xxxxxxxxxxxxxx>> wrote:
>>
>>    Hi guys,
>>
>>    I must say I have a quite ambitious goal.  I want to make it so that I
>>    can build a network of repros that can support millions upon millions of
>>    users.  Likewise, I like to consider myself as a standards based guy,
>>    and I want to take as much of everyone's input as possible in the design
>>    path to doing this.  In return, everything will be made available for
>>    free under the same Vovida license and/or BSD licensing that is already
>>    available.
>>
>>
>>    Several key areas of concern are the following:
>>
>>    Reliability:
>>    How do we make it so that we can have many repro nodes work together
>>    across large geographic topology, and allow calls to continue processing
>>    in the event of an attack or a failure?
>>
>>    Scalability:
>>    If you've ever run the testStack application and you're running a modern
>>    computer, you'll notice that it doesn't matter how many cores you have,
>>    or even to the point of the clock rate of your processor, there seems to
>>    be a magic threshold around 6500 TPS for non invite scenarios.
>>    Likewise, for calls, I can get about 1/3rd of that.  Also, those are
>>    tests done with TCP, when you add in UDP, you can watch it suck up
>>    memory like its job.  Based on what Byron has shown me, on inferior
>>    hardware, the stack that Estacado/Tekelec has built and modified from
>>    the main resiprocate tree can perform over 12000 TPS for noninvite
>>    transactions in a single thread.  This means there are even great areas
>>    for improvement beyond just adding concurrency.
>>
>>    Security:
>>    Resiprocate supports TLS fairly well.  I would like to be able to take
>>    advantage of that with any reliability mechanism put forth to help meet
>>    HIPAA style requirements that require that all data stored to disk be
>>    encrypted, and all data in transit be in encrypted.  Thankfully, part of
>>    this problem can be more easily resolved by keeping more state in
>>    memory.
>>
>>    NAT Traversal:
>>    Jeremy Geras and Scott Godin among others have worked very hard to
>>    provide NAT traversal mechanisms for calls and registrations and so
>>    forth through reTurn, reflow, and recon.  Jeremy's branch of recon
>>    utilizes an outdated stack, but supports ICE to a large degree.  It is
>>    missing support for ICE with TURN and has some other quirks that I've
>>    managed to work out.
>>
>>    In my research around these key areas, I have come up with several ideas
>>    of my own to deal with these issues, however, I would like to open this
>>    up to the community to discuss these areas in an open forum where
>>    everyone can participate and have their input taken seriously.
>>
>>    Thanks guys,
>>    Dan
>>
>>    _______________________________________________
>>    resiprocate-devel mailing list
>>    resiprocate-devel@xxxxxxxxxxxxxxx
>>    <mailto:resiprocate-devel@xxxxxxxxxxxxxxx>
>>    https://list.resiprocate.org/mailman/listinfo/resiprocate-devel
>>
>>
>
>
> _______________________________________________
> resiprocate-devel mailing list
> resiprocate-devel@xxxxxxxxxxxxxxx
> https://list.resiprocate.org/mailman/listinfo/resiprocate-devel



_______________________________________________
resiprocate-devel mailing list
resiprocate-devel@xxxxxxxxxxxxxxx
https://list.resiprocate.org/mailman/listinfo/resiprocate-devel