cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
wzab
Mentor
Mentor
858 Views
Registered: ‎08-24-2011

How to pass efficiently the hugepages-backed buffer to the BM DMA device in Linux?

I need to provide a huge circular buffer (a few GB) for the bus-mastering DMA PCIe device implemented in FPGA.

The buffers should not be reserved at the boot time. Therefore, the buffer may be not contiguous.

The device supports scatter-gather (SG) operation, but for performance reasons, the addresses and lengths of consecutive contiguous segments of the buffer are stored inside the FPGA. Therefore, usage of standard 4KB pages is not acceptable (there would be up to 262144 segments for each 1GB of the buffer).

The right solution should allocate the buffer consisting of 2MB hugepages in the user space (reducing the maximum number of segments by factor of 512). The virtual address of the buffer should be transferred to the kernel driver via ioctl. Then the addresses and the length of the segments should be calculated and written to the FPGA.

In theory, I could use get_user_pages to create the list of the pages, and then call sg_alloc_table_from_pages to obtain the SG list suitable to program the DMA engine in FPGA. Unfortunately, in this approach I must prepare the intermediate list of page structures with length of 262144 pages per 1GB of the buffer. This list is stored in RAM, not in the FPGA, so it is less problematic, but anyway it would be good to avoid it.

In fact I don't need to keep the pages maped for the kernel, as the hugepages are protected against swapping out, and they are mapped for the user space application that will process the received data.

So what I'm looking for is a function sg_alloc_table_from_user_hugepages, that could take such a user-space address of the hugepages-based memory buffer, and transfer it directly into the right scatterlist, without performing unnecessary and memory-consuming mapping for the kernel. Of course such a function should verify that the buffer indeed consists of hugepages.

I have found and read these posts: (A), (B), but couldn't find a good answer. Is there any official method to do it in the current Linux kernel?

Thank you in advance,
with best regards,
Wojtek

PS. I have also asked that question on stackoverflow. However, maybe here is a bigger chance to find somebody who have faced and successfully solved such problem.

0 Kudos
15 Replies
wzab
Mentor
Mentor
734 Views
Registered: ‎08-24-2011

At the moment I have a very inefficient solution based on get_user_pages_fast

 

    int sgt_prepare(const char __user *buf, size_t count, 
                struct sg_table * sgt, struct page *** a_pages,
                int * a_n_pages)
    {
        int res = 0;
        int n_pages;
        struct page ** pages = NULL;
        const unsigned long offset = ((unsigned long)buf) & (PAGE_SIZE-1);
        //Calculate number of pages
        n_pages = (offset + count + PAGE_SIZE - 1) >> PAGE_SHIFT;
        printk(KERN_ALERT "n_pages: %d",n_pages);
        //Allocate the table for pages
        pages = vzalloc(sizeof(* pages) * n_pages);
        printk(KERN_ALERT "pages: %p",pages);
        if(pages == NULL) {
            res = -ENOMEM;
            goto sglm_err1;
        }
        //Now pin the pages
        //res = get_user_pages_fast(buf, n_pages, rw == READ ? FOLL_WRITE : 0, pages);
        res = get_user_pages_fast(((unsigned long)buf & PAGE_MASK), n_pages, 0, pages); 
        printk(KERN_ALERT "gupf: %d",res);   
        if(res < n_pages) {
            int i;
            for(i=0; i<res; i++)
                put_page(pages[i]);
            res = -ENOMEM;
            goto sglm_err1;
        }
        //Now create the sg-list
        res = sg_alloc_table_from_pages(sgt, pages, n_pages, offset, count, GFP_KERNEL);
        printk(KERN_ALERT "satf: %d",res);   
        if(res < 0)
            goto sglm_err2;
        *a_pages = pages;
        *a_n_pages = n_pages;
        return res;
    sglm_err2:
        //Here we jump if we know that the pages are pinned
        {
            int i;
            for(i=0; i<n_pages; i++)
                put_page(pages[i]);
        }
    sglm_err1:
        if(sgt) sg_free_table(sgt);
        if(pages) kfree(pages);
        * a_pages = NULL;
        * a_n_pages = 0;
        return res;
    }
    
    void sgt_destroy(struct sg_table * sgt, struct page ** pages, int n_pages)
    {
        int i;
        //Free the sg list
        if(sgt->sgl)
            sg_free_table(sgt);
        //Unpin pages
        for(i=0; i < n_pages; i++) {
            set_page_dirty(pages[i]);
            put_page(pages[i]);
        }
    }

 

The sgt_prepare function builds the sg_table sgt structure that i can use to create the DMA mapping. I have verified that it contains the number of entries equal to the number of hugepages used.

Unfortunately, it requires that the list of the pages is created (allocated and returned via the a_pages pointer argument), and kept as long as the buffer is used.

Therefore, I really dislike that solution. Now I have 256 2MB hugepages used as a DMA buffer. It means that I have to create and keeep unnecessary 128*1024 page structures. I also waste 512 MB of kernel address space for unnecessary kernel mapping.

The interesting question is if the a_pages may be kept only temporarily (until the sg-list is created)? In theory it should be possible, as the pages are still locked...

0 Kudos
wzab
Mentor
Mentor
687 Views
Registered: ‎08-24-2011

According to the kernel sources https://elixir.bootlin.com/linux/v5.11.13/source/include/linux/mm.h#L152 the struct page is between 56 and 80 bytes long. I need to create one such structure for each page in a_pages array. It means that I had an up to 2% overhead. Not very bad, but I'd like to avoid it.

0 Kudos
hokim
Scholar
Scholar
677 Views
Registered: ‎10-21-2015

Hi 

Use dma_alloc_coherent for cma(contiguous memory allocation)

https://www.kernel.org/doc/Documentation/DMA-API.txt

There is a nice example driver under project-spec/meta-user/recipes-modules/dpcma at

https://github.com/Xilinx/Vitis-AI/blob/master/dsa/DPU-TRD/prj/Vivado/dpu_petalinux_bsp/download_bsp.sh

0 Kudos
wzab
Mentor
Mentor
641 Views
Registered: ‎08-24-2011

Hi Hokim,

Yes, I know about dma_alloc_coherent and I have used it in a few applications (e.g., https://gitlab.com/WZab/versatile-dma1/-/blob/87af337ec264488f6e845654b1d16b50f43b95da/dma_mover1/software/driver/ax_dma1.c#L534 ).
Here, however, I really need to work with userspace-allocated hugepages-based buffers.

With best regards,
Wojtek

 

0 Kudos
hokim
Scholar
Scholar
594 Views
Registered: ‎10-21-2015

Hi

I don't understand why you want to use not-contiguous(segmented) buffer on purpose

In principle, the access on segmented buffer is slower than  the access on contiguous buffer.

You can allocate and mmap  cma memory space  that you want in user space using the driver.

0 Kudos
wzab
Mentor
Mentor
560 Views
Registered: ‎08-24-2011

When the segments are of reasonable size (and that's why we want to use hugepages) we can keep the addresses of the segments in the internal BRAM.
Then the access to such a segmented buffer is not slower than the access to the contiguous buffer.

What is important, in that configuration, the memory may be reserved during the start of the system but does not have to be reserved during the start of the system.
In particular, when the device is not in use, the memory is free for other purposes.

0 Kudos
hokim
Scholar
Scholar
551 Views
Registered: ‎10-21-2015

Hi

 

What is important, in that configuration, the memory may be reserved during the start of the system but does not have to be reserved during the start of the system.
In particular, when the device is not in use, the memory is free for other purposes.

 

When you don't allocate memory for dma  using the driver, cma reserved region can be used by other general program.

If the remaining cma memory is smaller than your request for dma,

cma  moves the region  occupied by other program outside of cma region and allows dma to use the region  

Refer to https://www.slideshare.net/PankajSuryawanshi3/linux-memory-management-with-cma page29

0 Kudos
wzab
Mentor
Mentor
518 Views
Registered: ‎08-24-2011

OK. I can try this approach as well. However it requires me to use a non-standard kernel.

The machine hosting the PCIe cards will be a x86_64 server running Debian Linux.

The end user prefers to use the standard distribution kernel, not the customized one.

The standard Debian kernels have CMA switched off ( # CONFIG_CMA is not set ).

 

0 Kudos
hokim
Scholar
Scholar
482 Views
Registered: ‎10-21-2015

By the way, why don't you use xilinx pcie driver?

You can find the answer in the following links:

https://github.com/Xilinx/dma_ip_drivers/blob/master/XDMA/linux-kernel/xdma/cdev_sgdma.c#L247-L414
https://github.com/Xilinx/dma_ip_drivers/blob/master/XDMA/linux-kernel/tools/dma_utils.c#L43-L146
https://github.com/Xilinx/dma_ip_drivers/blob/master/XDMA/linux-kernel/tools/dma_from_device.c#L231-L279

I think you don't need to keep  the pages

get_user_pages_fast before dma transfer and put_page after dma transfer are used in read/write function of the above  first link

We don't need to worry about releases of pages because get_user_pages takes ref counts for pages 

https://github.com/Xilinx/linux-xlnx/blob/master/mm/gup.c#L745-L748

 

wzab
Mentor
Mentor
460 Views
Registered: ‎08-24-2011

In my application the DMA transfer will be continuous and may be working even for a week or so. I need to trigger a DMA in a "cyclic mode". It is just another system like those described in http://koral.ise.pw.edu.pl/~wzab/artykuly/DMA_architectures_for_FPGA.pdf  , where I had to solve the problems mentioned in https://forums.xilinx.com/t5/PCIe-and-CPM/DMA-Bridge-Subsystem-for-PCI-Express-v3-0-usage-in-cyclic-mode/m-p/751088#M8456 .

I need to service a few (4, maybe boards, where I need to have up to 4GB of cyclical buffer for each (up to 32 GB DMA buffers in total). At the same time those buffers should allow in-place data preprocessing. Therefore, they should not be mapped as coherent - the cache should be on, and the part filled with the data should be synchronized for CPU before the processing.  With the coherent cache-free mapping the processing performance would drop dramatically.

I hope it also explains why I want to use the hugepages, not normal pages. I need to keep the bus address of each segment in the BRAM in the FPGA (so that FPGA does not need to access the host memory to learn where to put the next chunk of data). With 4GB of buffer, I would need to store 4 million of addresses! With huge pages, I need to store only 2 thousand, which is affordable.

In the links you have sent, the  char_sgdma_map_user_buf_to_sgl (  https://github.com/Xilinx/dma_ip_drivers/blob/2c01de2211bcf452e90cb65b2ee49fcaafda375c/XDMA/linux-kernel/xdma/cdev_sgdma.c#L272 ) function looks really interesting. If i could modify it for working with hugepages, without intermediate using of normal pages, it could be perfect. (However, it is similar to the solution that I sent in the second post...)

0 Kudos
hokim
Scholar
Scholar
437 Views
Registered: ‎10-21-2015

I don't understand why you worry about pages which is a pointer of pointers for pages

For 256MB DMA, the size of pages is 128 * 1024 *8 bytes =  1MB

For pages, use kmalloc instead of vmalloc because vmalloc has overhead for remapping

0 Kudos
wzab
Mentor
Mentor
416 Views
Registered: ‎08-24-2011

For 4GB anyway it is 8MB. I'm not sure if it can be kmalloc'ed.

0 Kudos
hokim
Scholar
Scholar
367 Views
Registered: ‎10-21-2015

https://github.com/Xilinx/dma_ip_drivers/blob/master/XDMA/linux-kernel/xdma/cdev_sgdma.c#L247-L414

In the above, pages can be used for  not total  but a element(segment) of buffer in your case

You seem to use xilinx pcie dma ip(xdma, dma/bridge subsystem for pcie), right?

If so, why don't you use or modify xdma driver(my link points to the driver repository)?

Why don't you implement cyclic mode daq in user space?

I think it can be implemented with thread safe queue which is accessed by the producer and consumer threads

The queue stores buffer index which is used to calculate real memory address. 

The producer checks whether queue is full or not and reads data from c2h device if queue is not full.

After that ,  the producer enqueues read buffer index into the queue and increases cyclically buffer index for next read 

The consumer processes buffer data with dequeued buffer index

This is an example of thread safe queues

https://github.com/Xilinx/Vitis-AI/blob/master/tools/Vitis-AI-Runtime/VART/vart/util/include/vitis/ai/shared_queue.hpp

0 Kudos
wzab
Mentor
Mentor
290 Views
Registered: ‎08-24-2011

No, I'm not going to use XDMA (OK. To some extent it may be used in the initial version as a AXI MM to PCIe bridge, but with independent driver). The main port of the core should prepared as a portable HDL, not Xilinx-specific.

The producer is in hardware (data arriving via dedicated MGT links). The cyclic mode must be implemented in hardware.

The driver must support sleeping until required amount of data arrives. However, then the application should enter the "active polling" mode, without interrupts, until the buffer is empty. The whole solution must be "zero copy".

The solution should be similar to https://gitlab.com/WZab/versatile-dma1 , however, now the segment addresses should be stored in the hardware (not delivered via driver from the PC memory).

I want to minimize the usage of the proprietary cores (finally it should communicate with the PCIe primitive on the TLP level, so that can be easi;y adopted to various FPGA vendors, initially it may use the AXI MM slave via https://www.xilinx.com/support/documentation/ip_documentation/xdma/v4_1/pg195-pcie-dma.pdf  or Avalon MM slave - https://www.intel.com/content/www/us/en/programmable/documentation/sox1520633403002.html ).

 

0 Kudos
hokim
Scholar
Scholar
244 Views
Registered: ‎10-21-2015

Hi

I think, you are better to modify your hardware(ip) design

The xilinx dma hardware engine fetches only 2048 descriptors(pages info) at one time from host memory  and  transfers the related buffers.

It means the dma engine has always temporary descriptors(continues to update descriptors)

See Scatter Gather Example in 

https://www.xilinx.com/video/technology/getting-the-best-performance-with-dma-for-pci-express.html

So there is no limit in processing huge-pages buffer

0 Kudos