Is there any method to speed up the OpenML program more quickly?

I am changing a serial program into an OpenMP program. The program is about counting how many pixels cover in a circle with the radius r. In the end, the value of pixels in the parallel program is the same as the serial program. The OpenMP program is faster than the serial program, but it does not meet the time limit. Changing the g++ flag or revising the chunk size does not work. Therefore, I think is that the code in the parallel block should be reviesed. What methods can speed up the code more quickly?

Any response will be appreciated.

serial program

for (unsigned long long x = 0; x < r; x++) {
                unsigned long long y = ceil(sqrtl(r*r - x*x));
                pixels += y;
                pixels %= k;
        }

OpenMP

unsigned long long x, y ;
unsigned long long pixels = 0 ;
unsigned long long r, k ; // Input parameter
#pragma omp parallel for reduction(+: pixels) private(y)
        for (x = 0; x < r; x++) {
                y = ceil(sqrtl(r*r - x*x));
                pixels += y ;
                pixels %= k;
        } //for
How many English words
do you know?
Test your English vocabulary size, and measure
how many words do you know
Online Test
Powered by Examplum