cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
milosoftware
Scholar
Scholar
13,459 Views
Registered: ‎10-26-2012

dma_alloc_coherent versus streaming DMA, neither works satisfactory

I'm using a Linux driver to transfer large blocks of data to/from logic at high speed. The self-written DMA controller is in logic, can queue up several transfers and operates fine.

 

The trouble I have with the driver is that for the DMA transfers, I want to use a zero-copy style interface, because the data usually consists of video frames, and moving them arouns in memory is prohibitly expensive.

 

I based my implementation on what IIO (industrial IO) does, and implemented IOCTL calls to the driver to allocate, free, enqueue and dequeue blocks that are owned by the driver. Each block can be mmapped into user space.

 

Using dma_alloc_coherent to allocate the blocks, and then just using them without any extra measures works just fine. The system can transfer data at 600MB/s to and from DDR into logic with very little CPU intervention. However, the memory returned by dma_mmap_coherent appears to be uncached, because accessing this area from userspace is horribly slow (e.g. reading 200MB byte-by-byte in a simple for loop takes 20 seconds, while this takes about a second in malloced memory)

 

After reading some documentation, I decided that I should use a streaming DMA interface because my driver knows exactly when logic or CPU "owns" the data blocks. So instead of the "coherent" functions, I just kmalloc these buffers and then use dma_map_single_* to initialize them. Before and after DMA transfers, I call the appropriate dma_sync_single_for_* method.

By mapping the kmalloced memory into user space, I once again have speedy access to this memory, and caching is enabled. Data transfers also work well and extensive testing shows that this works well. However, the in-kernel performance is now completely crippled. The system spends so much time in the dma_sync_single... calls, the the CPU now becomes a limiting factor. This limits the transfer speeds to about only 180MB/s. This method is only about 20% less CPU intensive than just copying the data from the DMA buffer into a user buffer using a copy_to_user call.

 

0 Kudos
6 Replies
linnj
Xilinx Employee
Xilinx Employee
13,447 Views
Registered: ‎09-10-2008

I can't say I have any brilliant insights as it sounds like you're seeing reality, but thinking thru it is good.

 

It sounds like you've done your homework. I've been saying that to cache or not depends on how many times you want to access the data and is it worth the cost of the cache operations.  For big buffers they are expensive and I don't believe this is anything specific to Zynq.  Big buffers I think are also likely to cause other code/data to be lost from the L2 cache and that's why ACP is not likely to help (but is another option).

 

I'm eager to hear from others if there's something we're forgetting here.

 

Thanks

John

 

 

 

linnj
Xilinx Employee
Xilinx Employee
13,445 Views
Registered: ‎09-10-2008

One thing I did forget, if you are doing writes the making sure the memory is write combined does help a lot for uncached, but it doesn't help for reads in my experience.
0 Kudos
milosoftware
Scholar
Scholar
13,441 Views
Registered: ‎10-26-2012

I've been digging in kernel DMA code for the past few hours, and snippets like this suggest that indeed write combining is attempted.

 

static inline pgprot_t __get_dma_pgprot(struct dma_attrs *attrs, pgprot_t prot)
{
        prot = dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs) ?
                            pgprot_writecombine(prot) :
                            pgprot_dmacoherent(prot);
        return prot;
}

 

That also seems to match my observations, writing to the memory works fine, I cannot notice any speed difference, but reading it (in any other way than sequentially and using big accesses like memcpy does) is horribly slow.

 

Judging from the CPU usage and behaviour, it is almost as if the DMA buffer data is actually being copied to some other location and back when calling the dma_sync... methods.

 

I'm probably the first to try and really get the promised 600MB/s transfer speeds, otherwise people would have already reported this...

0 Kudos
linnj
Xilinx Employee
Xilinx Employee
13,430 Views
Registered: ‎09-10-2008

I've seen bounce buffers sometimes in the kernel and in the DMA area specifically. I don't think I ever saw them used in our area but maybe I missed something and there's a kernel configuration that causes it?
0 Kudos
milosoftware
Scholar
Scholar
13,403 Views
Registered: ‎10-26-2012

Since my DMA controller is pretty smart, I also experimented with transfers directly from user memory. This boiled down to calling "get_user_pages" and then constructing a scatter-gather list using sg_init_table and then adding those user pages. Then call "dma_map_sg" to translate and coalesce the pages into DMA requests. Just this pagetable housekeeping took about the same amount of processing time as the copy_from_user call, which made me abandon that code before even getting to the point of actually transferring the data.

 

Based on that experience, I'd think the dma_sync calls do similar things (walking page tables and changing some attributes) and that is where they spend so much time.

 

I also tried cheating by not calling the dma_sync methods at all, but this (surprisingly) led to hangups. I'm still investigating that route.

 

Also tried to replace the "dma_mmap_coherent" call by a simple "remap_pfn_range" so as to prevent setting the cache attributes on that region, but that didn't have any effect at all, it appears that the non-cachable property was already applied in the dma_alloc_coherent method.

0 Kudos
milosoftware
Scholar
Scholar
13,383 Views
Registered: ‎10-26-2012

I added some timing code to the "sync" calls, this is what I get (numbers in microseconds) when using 1MB blocks of streaming DMA transfers:

 

dma_sync_single_for_device(TO_DEVICE): 3336
dma_sync_single_for_device(FROM_DEVICE): 1991
dma_sync_single_for_cpu(FROM_DEVICE): 2175
dma_sync_single_for_cpu(TO_DEVICE): 0
dma_sync_single_for_device(TO_DEVICE): 3152
dma_sync_single_for_device(FROM_DEVICE): 1990
dma_sync_single_for_cpu(FROM_DEVICE): 2193
dma_sync_single_for_cpu(TO_DEVICE): 0

 

As you can see, the system spends 2 or 3 ms on "housekeeping" for each transition, except the cpu(TO_DEVICE) one which appears to be free which is perfectly logical because returning the outgoing buffer to the CPU should not need any special cache handling. I would have expected the for_device(FROM_DEVICE) to be free as well, but surprisingly this one takes up 2ms as well.

 

Adding the numbers, it takes over 7  ms overhead to transfer 1MB data, hence 1MB/0.007s or about 150MB/s would be the maximum possible data transfer rate.

0 Kudos