[reSIProcate] Random.cxx and MultiCore systems

Aron Rosenberg arosenberg at sightspeed.com
Wed Mar 19 17:56:40 CDT 2008


The only thing that I could think of is to use the new random_r and
srand_r functions instead of random and srand. The glibc _r ones force
the application to keep the "seed" value which might make it immune to
the caching problem.

 

The issue with this approach was that the entire Random() class is
static although you could just add a class wide static variable to hold
the new userland data.

 

-Aron

 

 

From: Byron Campen [mailto:bcampen at estacado.net] 
Sent: Wednesday, March 19, 2008 2:55 PM
To: Alan Hawrylyshen
Cc: Aron Rosenberg; resiprocate-devel
Subject: Re: [reSIProcate] Random.cxx and MultiCore systems

 

            Upon some discussion, it seems that this is happening _in a
single thread_, so using a thread-id will not help with this specific
problem (although it is probably a good idea anyway). It does seem
likely that we are running into a caching problem, although I am not
sure what can be done about this.

 

            Anyone have any ideas?

 

Best regards,

Byron Campen

 





More to Byron's point. If you substitute the crypto random function, you
will notice a significant increase in latency. A pseudo random cheap
solution that doesn't exhibit this erect is needed.  We stumbled across
something similar years ago when we ported to a then exotic dual
processor machine. The problem occurred very rarely then but with 4+
cores and todays speeds I believe the problem will happen predictably as
you have seen Aron. 

 

Alan
--

Sorry this is terse; it is from my handheld device. 


On 19 Mar 2008, at 10:14, Byron Campen <bcampen at estacado.net> wrote:

	            I wouldn't re-implement Random::getRandom() with
getCryptoRandom(), since the contract on it is for providing cheap,
pseudo-random numbers. It would be more reasonable to change the code
that generates transaction-ids and tags (in fact, the code that
generates Call-Ids has been tweaked to help with this very problem that
you're seeing). The tweak in the Call-Id generation code involves
throwing the thread-id into the generated bits, which solves the
collision issue you're seeing. Maybe we could alter Random::getRandom()
to xor the current thread-id with everything it returned (this would be
in-keeping with "cheap, pseudo-random numbers")? Or maybe we could add a
Random::getRandomReentrant() function?

	 

	            Anyone have an opinion on this?

	 

	Best regards,

	Byron Campen

	
	
	

	So this bug report concerns a very strange issue that we noticed
on our brandnew Dual Quad Core machine (8 cpu's) involving duplicate
Call-Id's, Transaction-ID's and Tag's being generated for independent
INVITE's. This behavior would then result in assert failures all over
the stack.

	 

	We have a single instance of DUM/Resiprocate running on its own
thread. Our application generates 4 independent INVITE requests at the
same exact time which results in sequential calls eventually being made
to Random.cxx and then glibc's random() function. Of the four calls we
get the following random values returned

	 

	Call 1: aaaaaaaaaaa 

	Call 2: bbbbbbbbbb

	Call 3: aaaaaaaaaaa   (same exact sequence of random values as
the first call)

	Call 4: bbbbbbbbbb  (same exact sequence of random values as the
second call)

	 

	Sometime later, various assert failures would occur due to
duplicate TID values and all sorts of other issues.

	 

	If pause or sleep the thread for 1 MS then the the problem
disappears. So what the heck is going on....

	 

	We think that DUM thread is being migrated across CPU's between
the different invocations of glibc's random() function and the "seed"
value is stale in a one of the CPU caches.

	 

	So how do we fix this - When we dug into the resiprocate
Random.cxx code we noticed that although we had linked against OpenSSL,
the OpenSSL random functions were not being used at all. They would be
used to initialize the seed but not used to actually generate the random
values.

	 

	If we used the crypto versions of the functions the repeatedness
issue went away completely.

	 

	Here is a small patch which will use the crypto version if
USE_OPENSSL is defined

	 

	--- rutil/Random.cxx.orig              2008-03-14
23:21:29.000000000 -0700

	+++ rutil/Random.cxx    2008-03-15 00:26:59.000000000 -0700

	@@ -149,8 +149,9 @@

	 Random::getRandom()

	 {

	    initialize();

	-

	-#ifdef WIN32

	+#if USE_OPENSSL

	+   return getCryptoRandom();

	+#elif WIN32

	    assert( RAND_MAX == 0x7fff );

	    int r1 = rand();

	    int r2 = rand();

	 

	 

	-Aron

	 

	Aron Rosenberg

	SightSpeed

	_______________________________________________

	resiprocate-devel mailing list

	resiprocate-devel at resiprocate.org

	https://list.re

	_______________________________________________
	resiprocate-devel mailing list
	resiprocate-devel at resiprocate.org
	https://list.resiprocate.org/mailman/listinfo/resiprocate-devel

 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://list.resiprocate.org/pipermail/resiprocate-devel/attachments/20080319/7ae53332/attachment.htm>


More information about the resiprocate-devel mailing list