mmap very slow when using O_SYNC

Brief description of our project: We are using CycloneV in our project, FPGA will write data to DDR using AXI bus and our application needs to send the data out using Ethernet. We benchmark our Ethernet throughput using iperf and it can achieve a speed of about 700Mbps. When we test our application throughput, the result we are getting is just 400Mbps. We write a simple server code without using /dev/mem, then populate the memory with random data using dd command and the application reads the file to send out. We notice that the throughput is actually near to iperf benchmark. We found out that when we remove O_SYNC during open /dev/mem, the throughput can be achieved close to that of iperf. But the issue now is that we get intermittent wrong data if we don't use O_SYNC.

We allocate the contiguous memory using dma_alloc_coherent:

p_ximageConfig->fpgamem_virt = dma_alloc_coherent(NULL, Dma_Size, &(p_ximageConfig->fpgamem_phys), GFP_KERNEL);

and we pass the phys memory to userspace to mmap using IOCTL:

uint32 DMAPHYSADDR = getDmaPhysAddr();
pImagePool = ((volatile unsigned char*)mmap( 0,MAPPED_SIZE_BUFFER, PROT_READ|PROT_WRITE, MAP_SHARED, _fdFpga, DMAPHYSADDR));

We have tried following methods:

  1. Writing our own mmap in our driver: We still get wrong data intermittently if we do not sync. Sync method that we tried is pgprot_noncached and pgprot_dmacoherent but it can only achieve 300Mbps.

  2. We tried to use dma_mmap_coherent: The result we get is about 500Mbps.

Is there any method that can help us achieve a performance that is close to iperf performance?