04-21-2015 02:55 AM
I'm using a Linux driver to transfer large blocks of data to/from logic at high speed. The self-written DMA controller is in logic, can queue up several transfers and operates fine.
The trouble I have with the driver is that for the DMA transfers, I want to use a zero-copy style interface, because the data usually consists of video frames, and moving them arouns in memory is prohibitly expensive.
I based my implementation on what IIO (industrial IO) does, and implemented IOCTL calls to the driver to allocate, free, enqueue and dequeue blocks that are owned by the driver. Each block can be mmapped into user space.
Using dma_alloc_coherent to allocate the blocks, and then just using them without any extra measures works just fine. The system can transfer data at 600MB/s to and from DDR into logic with very little CPU intervention. However, the memory returned by dma_mmap_coherent appears to be uncached, because accessing this area from userspace is horribly slow (e.g. reading 200MB byte-by-byte in a simple for loop takes 20 seconds, while this takes about a second in malloced memory)
After reading some documentation, I decided that I should use a streaming DMA interface because my driver knows exactly when logic or CPU "owns" the data blocks. So instead of the "coherent" functions, I just kmalloc these buffers and then use dma_map_single_* to initialize them. Before and after DMA transfers, I call the appropriate dma_sync_single_for_* method.
By mapping the kmalloced memory into user space, I once again have speedy access to this memory, and caching is enabled. Data transfers also work well and extensive testing shows that this works well. However, the in-kernel performance is now completely crippled. The system spends so much time in the dma_sync_single... calls, the the CPU now becomes a limiting factor. This limits the transfer speeds to about only 180MB/s. This method is only about 20% less CPU intensive than just copying the data from the DMA buffer into a user buffer using a copy_to_user call.
04-21-2015 05:55 AM
I can't say I have any brilliant insights as it sounds like you're seeing reality, but thinking thru it is good.
It sounds like you've done your homework. I've been saying that to cache or not depends on how many times you want to access the data and is it worth the cost of the cache operations. For big buffers they are expensive and I don't believe this is anything specific to Zynq. Big buffers I think are also likely to cause other code/data to be lost from the L2 cache and that's why ACP is not likely to help (but is another option).
I'm eager to hear from others if there's something we're forgetting here.
Thanks
John
04-21-2015 05:56 AM
04-21-2015 06:04 AM
I've been digging in kernel DMA code for the past few hours, and snippets like this suggest that indeed write combining is attempted.
static inline pgprot_t __get_dma_pgprot(struct dma_attrs *attrs, pgprot_t prot)
{
prot = dma_get_attr(DMA_ATTR_WRITE_COMBINE, attrs) ?
pgprot_writecombine(prot) :
pgprot_dmacoherent(prot);
return prot;
}
That also seems to match my observations, writing to the memory works fine, I cannot notice any speed difference, but reading it (in any other way than sequentially and using big accesses like memcpy does) is horribly slow.
Judging from the CPU usage and behaviour, it is almost as if the DMA buffer data is actually being copied to some other location and back when calling the dma_sync... methods.
I'm probably the first to try and really get the promised 600MB/s transfer speeds, otherwise people would have already reported this...
04-21-2015 06:12 AM
04-21-2015 10:47 PM
Since my DMA controller is pretty smart, I also experimented with transfers directly from user memory. This boiled down to calling "get_user_pages" and then constructing a scatter-gather list using sg_init_table and then adding those user pages. Then call "dma_map_sg" to translate and coalesce the pages into DMA requests. Just this pagetable housekeeping took about the same amount of processing time as the copy_from_user call, which made me abandon that code before even getting to the point of actually transferring the data.
Based on that experience, I'd think the dma_sync calls do similar things (walking page tables and changing some attributes) and that is where they spend so much time.
I also tried cheating by not calling the dma_sync methods at all, but this (surprisingly) led to hangups. I'm still investigating that route.
Also tried to replace the "dma_mmap_coherent" call by a simple "remap_pfn_range" so as to prevent setting the cache attributes on that region, but that didn't have any effect at all, it appears that the non-cachable property was already applied in the dma_alloc_coherent method.
04-22-2015 11:33 PM - edited 04-22-2015 11:36 PM
I added some timing code to the "sync" calls, this is what I get (numbers in microseconds) when using 1MB blocks of streaming DMA transfers:
dma_sync_single_for_device(TO_DEVICE): 3336
dma_sync_single_for_device(FROM_DEVICE): 1991
dma_sync_single_for_cpu(FROM_DEVICE): 2175
dma_sync_single_for_cpu(TO_DEVICE): 0
dma_sync_single_for_device(TO_DEVICE): 3152
dma_sync_single_for_device(FROM_DEVICE): 1990
dma_sync_single_for_cpu(FROM_DEVICE): 2193
dma_sync_single_for_cpu(TO_DEVICE): 0
As you can see, the system spends 2 or 3 ms on "housekeeping" for each transition, except the cpu(TO_DEVICE) one which appears to be free which is perfectly logical because returning the outgoing buffer to the CPU should not need any special cache handling. I would have expected the for_device(FROM_DEVICE) to be free as well, but surprisingly this one takes up 2ms as well.
Adding the numbers, it takes over 7 ms overhead to transfer 1MB data, hence 1MB/0.007s or about 150MB/s would be the maximum possible data transfer rate.