< Previous by Date | Date Index | Next by Date > |
< Previous in Thread | Thread Index |
There is only a single thread going on. Keep in mind that there is
a lot of other calls happening in the call stack between successive invocations
of the dum ->makeInvite In our initial tests it was 100% repeatable that we would have duplicate
call-ids and tids being generated. If we stuck a usleep(1) then the problem
went away. What I think is happening is that the OS is moving the thread to different
physical CPU’s between successive calls but the cache isn’t
updated. If we used the sched_setaffinity then the initial location of
the duplicate would move, but other assert and random() failures would occur
later. -Aron From: Alan Hawrylyshen
[mailto:alan@xxxxxxxxxxxx] Now that I've slept on it - I suppose that some level of
parallelism would hurt this problem. Is there any chance that your application
is calling the Helper functions that generate the CallIDs from more than one
thread? Your results below are exactly inline with what I see on my system. I'm
not convinced this is a problem.. (specifically that these results illustrate
the problem). More consideration is required. Can you rule out that you call random (via a routine that
makes a callid) from more than one thread? Thanks Alan On 20-Mar-08, at 10:44 , Aron Rosenberg wrote:
First run – Count was around 70 mp-test ~ # ./a.out tot: 778262873 l1: 2060261465 l2: 2060261465 Aborted Second Run – Count was at 400 mp-test ~ # ./a.out tot: 4033371507 l1: 1314891622 l2: 1314891622 Aborted Third Run – Count was at 130 mp-test ~ # ./a.out tot: 1427405301 l1: 475005228 l2: 475005228 Aborted mp-test ~ # ./a.out tot: 1309167503 l1: 71029242 l2: 71029242 Aborted -Aron From: Alan
Hawrylyshen [mailto:alan@xxxxxxxxxxxx] I am still quite tempted to prove
what the failure is with a minimal test driver. I fear that it might be
something slightly more insidious. So, once we can cause this to happen
at-will, we can address the appropriate root cause. Is this something that can
be checked easily? Anyone? I have a test driver that fails on
a dual core intel platform, gcc 4.0.1, Mac OS X 10.5.2 This will fail around the 100 mark
in the progress output (but I have waited much longer). Let it run for a while and see. This will abort when two
successive calls to random() match. I would expect this to be
unlikely, but should we check this on a single processor / single core system? Does it happen more often on dual
core or SMP systems? Aron - can you try this on your
platform? Please run it a LOT and see if the
time-to-run varies greatly or if it fails reliably. Thanks Alan -- #include <stdio.h> #include <time.h> #include <unistd.h> #include <stdlib.h> #include <string.h> int main() { unsigned long
long t = 0; unsigned long
l1 = (unsigned long)random();
srandom(time(0)); unsigned long
l2 = 0UL; while (3) {
l2 = (unsigned long)random();
if ( l1 == l2 ){
printf("tot: %llu\nl1: %lu\nl2:
%lu\n",t,l1,l2);
abort();
}
l1 = l2;
t++;
const int modulator = 10000000L;
if (!(t % modulator)) {
printf("%llu...\r",(t/modulator));
fflush(stdout);
} } return 0; } Alan On 19-Mar-08, at 15:56 , Aron
Rosenberg wrote:
The only thing that I could think of is to use the new random_r
and srand_r functions instead of random and srand. The glibc _r ones force the
application to keep the “seed” value which might make it immune to
the caching problem. The issue with this approach was that the entire Random() class
is static although you could just add a class wide static variable to hold the
new userland data. |