EmbeddedRelated.com
Forums

How to reduce PCI latency to read host memory (Linux)

Started by TheCyrus 2 years ago4 replieslatest reply 2 years ago529 views

Hello experts, I hope you folks could give me a help.

I have a PCI device and a Linux device driver to interact with it. I use a buffer, on host memory, to transfer a lot of data o the device. Once the buffer is ready I write to the device register's specifying the host memory (address got using vir_to_bus), and the amount of data to read.

After this the device uses DMA to get the buffer data.

Problem is that this transfer is taking very long time. Based on PCI bus speed we expected it to take a few dozen nano seconds but we are getting hundreds (500 ~ 800).

This is what we have tried:

- Use kmalloc instead of vmalloc, and reducing buffer size, so we would get physically contiguous memory
- Disabling the io_mmu

Those modifications made no difference.

Do you guys have any idea where we could look for ways to reduce the latency? Can it be related to cache? If yes, how to verify and what can be done? What else could be causing this delay? This is my first project with Linux (previously just worked with bare-metal devices), so any help is welcome. Thanks.

[ - ]
Reply by jmford94May 2, 2022

There are a lot of things to consider.  Most are architecture specific, and you don't say what processors, chipsets, etc. you are using.

Are you sure it's the transfers that are slow, or is it the starting and stopping, that is, notifications, that are slow?  Have you put a scope on the signals to verify?

Are the memory blocks allocated at the correct alignment for best speed (depends on architecture, 32, 64, or 128 bit alignment)

Are the memory blocks allocated "close" to the DMA controller and the PCI controller?  Not on a memory bank far on the other side of the system.

Are your memory, process, and interrupt handlers locked to a core?  Is it the "right" core?  When you allocate the memory, it needs to be touched right away so it is mapped in.  Then locked with mlockall();  I guess this doesn't apply to kmalloc'ed memory.

Look at what else is going on on your system while stress testing it.  Look at the IRQs, CPU usage and allocations, etc.

It may be that your host isn't so good for the purpose.  Surprisingly, there's a wide variation in the abilities of the various PC motherboards.

If you can supply some more details, we may be able to help more.

EDIT:  Check this out.  An excellent discussion of the problem.  And unfortunately, it appears that you may be out of luck.  The median PCIe latency they measure is about 450 ns.

https://gianniantichi.github.io/files/presentation...

[ - ]
Reply by CustomSargeMay 2, 2022

Does the latency vary? Is there a task priority manager in Linux? Are there other tasks running that you could pause or not need? 

What's the overall nature of the process - how big a datablock to be downloaded how fast and how often? Presuming you're not overrunning the system bus bandwidth, the "accordion" effect may require an intermediary accumulator buffer. Device bursts datablock, system acquires multiple blocks frequently enough to keep up.

Caveat: I've never done high speed / low latency transfers with an OS, but many between uC units. Those all ran assembler, so I had total control.

Good Hunting - OS makes it not trivial   <<<)))