2013年5月22日 星期三

A rough survey on gcc builtin atomic operations.

When my colleague suggested me the gcc __sync_* functions, not much interest to me for which was treated as a low-level lock rather than a lock-free pattern. But it does surprise me after some survey.




gcc offers a series of builtin atomic functions which have name starting from `__sync_' after version 4.1. Simple tests have been done to show the great opportunities in application. Condsider a synchronized situation, a blocking producer-consumer model:
Two threads need to modify a shared data only after the other one has done the modification, except the initial status.
Let's say, thread A has a strong desire to make the data D be 1, and thread B insists D must be 0, thus each of two would monitor D and modify it when D has been modified by the counterpart, or more gracefully, to wait the other has done the job first. Condition variable usually be the common choise for the implementation:
static void *Cond_BeOne(void *Argu)
{
    CondStruct_t    *pStruct = (CondStruct_t *)Argu;
    int             Cnt = 0;
    int             *pAtom = pStruct->pAtom;
    pthread_mutex_t *pMtx = pStruct->Lock;
    pthread_cond_t  *pCond = pStruct->Cond;

    while(!*(pStruct->StartWorking))
    {
    }

    while(Cnt < TOTAL)
    {
        pthread_mutex_lock(pMtx);
        while((*pAtom) != 0)
        {
            pthread_cond_wait(pCond, pMtx);
        }
        (*pAtom) = 1;
        pthread_cond_signal(pCond);
        pthread_mutex_unlock(pMtx);

        ++ Cnt;
    }

    return NULL;
}

static void *Cond_BeZero(void *Argu)
{
    CondStruct_t    *pStruct = (CondStruct_t *)Argu;
    int             Cnt = 0;
    int             *pAtom = pStruct->pAtom;
    pthread_mutex_t *pMtx = pStruct->Lock;
    pthread_cond_t  *pCond = pStruct->Cond;

    while(!*(pStruct->StartWorking))
    {
    }

    while(Cnt < TOTAL)
    {
        pthread_mutex_lock(pMtx);
        while((*pAtom) != 1)
        {
            pthread_cond_wait(pCond, pMtx);
        }
        (*pAtom) = 0;
        pthread_cond_signal(pCond);
        pthread_mutex_unlock(pMtx);

        ++ Cnt;
    }

    return NULL;
}
And place it to a thread testing:
#define TOTAL 1000000
StartWorking = 0;
pthread_create(&OneThread, NULL, Cond_BeOne, &CondStruct);
pthread_create(&ZeroThread, NULL, Cond_BeZero, &CondStruct);

Start = clock();
StartWorking = 1;

pthread_join(OneThread, 0);
pthread_join(ZeroThread, 0);

Stop = clock();

printf("Cond:[%f] us, Atom=%d\n", (float)(Stop - Start)/CLOCKS_PER_SEC, Atom);

We then have a correct result:
Cond:[16.379999] us, Atom=0
And a busy-mutex version with similar pattern make big improvement:
while(Cnt < TOTAL)
{
    pthread_mutex_lock(pMtx);
    if(*pAtom == 0)
    {
        (*pAtom) = 1;
        ++ Cnt;
    }
    pthread_mutex_unlock(pMtx);
}
Mutex:[6.110000] us, Atom=0
Now the gcc builtin CAS has a amzing great result:
while(Cnt < TOTAL)
{
    while(!__sync_bool_compare_and_swap(pAtom, 0, 1))
    {
    }

    ++ Cnt;
}
CAS:[0.700000] us, Atom=0

The phrase `amazing great result' here implies two important factors which are not shown above:
  1. With the same of spin lock, pthread_spinlock_t causes a much worse result:
    while(Cnt < TOTAL)
    {
        pthread_spin_lock(pSpin);
        if(*pAtom == 0)
        {
            (*pAtom) = 1;
            ++ Cnt;
        }
        pthread_spin_unlock(pSpin);
    }
    
    Spin:[14.990000] us, Atom=0
    
    Looks like being the short job inside spin lock's critical section, the if-condition with some more assignments be too long for it.
  2. It sounds nothing special that busy-loop has better response time than blocking one, by paying huge CPU usage. But from the manual observation, the gcc builtin CAS version even has fewer CPU usage than the conditon variable counterpart.
It's worthy so far as using `__sync_bool_compare_and_swap' to replace condition variable for the implementation in `notify' feature, especially the short-time-waiting situation. Side effects might be bigger in the real complex implement, but as a latency-concern system, the above benefit would be charming enough for some resource wasting.

沒有留言:

張貼留言