> Glibc rand uses a seed, how is the seed accessed or protected?
If you look at the source I linked to earlier, random() is implemented as:
long int
__random ()
  int32_t retval;

  __libc_lock_lock (lock);

  (void) __random_r (&unsafe_state, &retval);

  __libc_lock_unlock (lock);

  return retval;

So the state is in "unsafe_state" and the variable lock is used to
protect it.  the "lock cmpxchg" assembly from before is what
__libc_lock_lock(lock) compiles into.  Anywhere unsafe_state is set or
accessed, it's protected by that lock as far as I can tell.

So looking at this code, I don't see how random() itself could be
causing this problem, unless the contract on what libc_lock_lock
ensures is different than I think it is.


> I'm baffled by how this could be happening.  Looking at random.o in
> libc on the fedora and ubuntu machines I have handy right now (neither
> ia64, but both smp's), the lock in random() is implemented:
>  17:   f0 0f b1 0d 00 00 00    lock cmpxchg %ecx,0x0
> my understanding is that operation is considered smp-safe.  So unless
> I'm missing something very basic about what that guarantees that
> operation provides (which is entirely possible, I don't claim to be an
> expert on low-level memory operations), I don't  understand how random
> could be causing the problem.  I don't see how anything else coming in
> from makeInviteSession could be causing it, either.  I'd be interested
> in whether a collision is seen if you logged the callId in
> BaseCreator.cxx right after computeCallId is called, but of course
> that might change behavior...
> There are some possible race conditions that have never been fixed in
> Condition.cxx, and it's possible to do some stupid things with
> pointers to temporaries with some of the code in rutil (Data::c_str
> being the best example), but I don't see any of that involved in
> makeInviteSession.
> Bruce
>> We see the issue on a gentoo stock glibc 2.6.1 version on a dual
>>> quad-core Intel server.
>> quad-core Intel server.
>> I spent a little bit of time looking at this, but it's left me more
>> confused than I was before.
>> Have you determined what platforms people are actually seeing the
>> CallID problem with?  In particular, what libc are they using?  To get
>> a duplicate callid, it looks like you would have to get 4 consecutive
>> calls to random() to return the same result.  The only way I can see
>> that would happen would be if two threads run their calls in parallel
>> starting with the same state, but without sharing any updates to the
>> random state.
>> With glibc, I believe this is virtually impossible.  The glibc
>> implementation of rand and random imposes a mutex around all of the
>> calls that access the static state.
>> http://sourceware.org/cgi-bin/cvsweb.cgi/libc/stdlib/random.c?rev=1.18&c
>> ontent-type=text/x-cvsweb-markup&cvsroot=glibc
>> so unless there's something I'm not seeing like a peculiar cache
>> setting being used for the lock and memory random() uses, I don't see
>> how this problem is possible there.
>> Based on that, I'm wondering if a different libc implementation is
>> being used here, and the reason switching to SSL fixes the problem is
>> that the openssl implementation actually forces thread safety
>> (ssleay_rand_bytes does locking, and it ultimately is the default rand
>> function in openssl).  My conclusion would be that the right thing to
>> do is to add a mutex to getRandom() that is used if an unsafe C
>> library is being used (not entirely sure how to check for that, but
>> could probably identify a set of known-safe C libraries that can be
>> detected).  That way, the concern about other uses of Random that
>> aren't being detected goes away.
>> Bruce
>>> As we've seen in the past, the Call-ID generation code that DUM uses
>>> (resip/stack/Helper.cxx:625 on head) can generate colliding Call-IDs
>> under
>>> high-load conditions. The current code looks like this:
>>>   Data
>>>   Helper::computeCallId()
>>>   {
>>>      static Data hostname = DnsUtil::getLocalHostName();
>>>      Data hostAndSalt(hostname + Random::getRandomHex(16));
>>>   #ifndef USE_SSL // .bwc. None of this is neccessary if we're using
>>>   openssl
>>>   #if defined(__linux__) || defined(__APPLE__)
>>>      pid_t pid = getpid();
>>>      hostAndSalt.append((char*)&pid,sizeof(pid));
>>>   #endif
>>>   #ifdef __APPLE__
>>>      pthread_t thread = pthread_self();
>>>      hostAndSalt.append((char*)&thread,sizeof(thread));
>>>   #endif
>>>   #ifdef WIN32
>>>      DWORD proccessId = ::GetCurrentProcessId();
>>>      DWORD threadId = ::GetCurrentThreadId();
>>>      hostAndSalt.append((char*)&proccessId,sizeof(proccessId));
>>>      hostAndSalt.append((char*)&threadId,sizeof(threadId));
>>>   #endif
>>>   #endif // of USE_SSL
>>>      return hostAndSalt.md5().base64encode(true);
>>>   }
>>> I spoke to Byron just now, and he thinks the comment about "USE_SSL"
>> is not
>>> accurate. (It would be if the code under getRandomHex() called into
>> OpenSSL
>>> -- currently, it does not).
>>> To help refresh memories, we've visited this problem in detail before,
>> most
>>> recently here:
>>> http://list.resiprocate.org/archive/resiprocate-devel/msg06605.html
>>> The conclusion of that thread left me confused -- Alan demonstrated
>> that
>>> we'll have collisions (albeit rarely) on just about any architecture,
>> and
>>> that such collisions don't require multithreading to occur. From my
>> read of
>>> things, Aron's problem (and Ilana's; see
>>> http://list.resiprocate.org/archive/resiprocate-users/msg00642.html)
>> occurs
>>> more frequently than Alan's test program.
>>> It seems to me that there are a few things we can do to try and
>> address
>>> this:
>>>  1. If we're using OpenSSL, make computeCallId call through to OpenSSL
>>>     for its random numbers (there area a few paths to get there, so
>>>     I'm just throwing out the general idea at this point).
>>>  2. Remove the "#ifndef USE_SSL" guards from computeCallId() -- is
>>>     this sufficent?
>>>  3. Do #2, but also salt in a 32-bit thread-local serial number to
>>>     prevent intra-thread collisions
>>> Thoughts? (If no one expresses an opinion in a reasonable amount of
>> time,
>>> I'll probably do #3).
>>> [It occurs to me that we must have a similar problem with tags and
>> branch
>>> IDs, albeit without any assert()s being triggered -- I would presume
>> that
>>> any fix made to Call-ID should also be made to them as well, in
>>> Helper::computeUniqueBranch() and Helper::computeTag()]
>>> /a
>> it
>>> down.
>>> Previous reports can be found here:
>> http://list.resiprocate.org/archive/resiprocate-devel-old/msg03200.html
>>> http://list.resiprocate.org/archive/resiprocate-devel/msg06605.html
>>> Aron's solution -- shunting "getRandom" over to "getCryptoRandom" --
>> worked
>>> for him. Of course, you impose a higher load on your CPU when you do
>> so, so
>>> you may want to try tracking the problem down and addressing it in a
>> more
>>> efficient way.
>>> The problem does not seem to surface except when using DUM.
>>> /a
>>>> Hello
>>>> I have just started to use dum in our application and noticed that if
>> I
>>>> run calls in a very high rate the call id repeats itself?
>>>> What am I doing wrong I have a separate thread that calls buildFdSet,
>>>> stack process and dum process. There is a semaphore before it and
>> semaphore
>>>> for all the api calls that come from my application.
>>>> I have run a call for computeCallId from the same thread ( the thread
>> that
>>>> runs the dum and stack) and the value returned seems to be fine. But
>> when it
>>>> gets called from the api makeInviteSession which is called from the
>> context
>>>> of my application thread the value repeats it self for around 8
>> calls.
>>>> The calls are created one after another in a very high volume. If the
>>>> calls are created in a low volume (let's say one per second)
>> everything is
>>>> fine.
>>>> Have anyone seen this problem?
>>>> Thanks
