2013年6月2日 星期日

A further look at the gcc __sync_bool_compare_and_swap.

Continuing the looks-pretty-good gcc atomic CAS functions, the point would be how to move these great results in application and make the pretty testing result helps. For the purpose to understand clearly its behavior, some more tests have been done.




With the similar pattern, consider the synchronized situation:
Two threads need to modify a shared data only after the other one has done the modification, except the initial status.
But in several extensions this time:
  1. Each thread waits a little bit more time after modifying the shared data with holding the modification right. It makes the counter thread takes more time (and resources in some cases) to acquire the right to modify the shared data.
  2. Thought of the application in the real world, threads should hand over the CPU resource after slot of times, especially for busy-loop ones. Thus let each thread wait a microsecond if it could not acquire the right to modify after each 10 tryings. 
Then the result follows:
  1. For the dummy-job-after-modifying case, we take something like that:
    for(DummyCnt = 0; DummyCnt < 1000; ++ DummyCnt)
    { // Dummy count for dummy job
        if(DummyCnt == 123)
        {
            ++ Cnt; // Add the count of testing times.
        }
    }
    
    
    after each modifying of the shared data. Difference occurs between the pthread-family in the results:
    ---
    Mutex:[25.290001] us, Atom=0
    ---
    CAS:[6.710000] us, Atom=0
    ---
    Cond:[18.620001] us, Atom=0
    ---
    Spin:[19.959999] us, Atom=0
    
    
    It seems that __sync_bool_compare_and_swap still shows a great advantage to the others. Moreover, as a rough observation by top in run time, number follows that both CAS and Cond case has about 14% CPU usage, mutex case in a little higher 17%, while the spin lock uses about 120%.
  2. And for the sleep-a-microsecond-while-waiting case, results follows:
  3. Mutex:[7.590000] us, Atom=0
    ---
    CAS:[7.030000] us, Atom=0
    ---
    Cond:[29.820000] us, Atom=0
    ---
    Spin:[7.610000] us, Atom=0
    
    
    In CPU resources considered case, the condition variable version still shows a worse performance. That talks a more sufficient reason to take a replacement of CAS as carefully.
The testing environment for belows are a 8-core 4G-ram machine running at linux kernel 2.6.32 64bit OS. It is reasonable to suspect the result or conclusion when applying in the way that many threads with more strict shared data competitive but relatively less CPU resources for sharing. 
Agreement is made for the situation that there are more threads than CPU cores, which causes the possibility of thread might need to wait the resource (context-switch and be context-switched)even it is not accessing the shared/critical data. Condition variable might be a better choice if that is the case, since releasing the resource becomes an important thing and will heavily effect the performance. In my opinion, if low-latency still has great priority in the case, get a machine with more cores or redesign the multithread-working pattern would be the only right thing, instead of considering the CAS or condition variable.

沒有留言:

張貼留言