Reply by December 14, 20162016-12-14
I have received the answer on stackoverflow.
It appears, that the flag ARCH_HAS_DMA_MMAP_COHERENT is not used since kernel 3.6.
Therefore, one should always use dma_mmap_coherent to mmap coherent buffers.
There is one probable mistake. When mapping multiple areas (as in my case - multiple DMA buffers), the area selected is identified by "offset" parameter of the mmap function.
In this case it is important to zero the vma->vm_pgoff field before calling the dma_mmap_coherent.

Thanks & regards,
Wojtek
Reply by December 12, 20162016-12-12
Hi,

I was creating a firmware and software for UltraScale+ based data acquisition embedded system.
The code was ported from the 32-bit Zynq platform, where it worked perfectly.

The acquired data are transferred by DMA to the buffers, allocated with:
dma_zalloc_coherent(&pdev->dev, BUF_SIZE, &phys_buf[i],GFP_KERNEL);

The buffer is memory mapped to the user space with:
remap_pfn_range(vma,vma->vm_start, phys_buf[off] >> PAGE_SHIFT , vsize, pgprot_noncached(vma->vm_page_prot));
because
dma_mmap_coherent(&my_pdev->dev, vma, virt_buf[off], phys_buf[off],  vsize);
is not available for that platform (ARCH_HAS_DMA_MMAP_COHERENT is not set).

The essential thing is that the DMA buffer is mmapped with cache switched off (necessary to ensure data coherency).

The transferred data are transmitted in UDP packets with "sendto" function.
The length of each packet was limited to MAX_DGRAM (originally 572).

int i = 0;
 int bleft = nbytes;
 while(i<nbytes) {
    int bts = bleft < MAX_DGRAM ? bleft : MAX_DGRAM;
    if (sendto(fd,&buf[nbuf][i],bts,0, res2->ai_addr,res2->ai_addrlen)==-1) {
       printf("%s",strerror(errno));
       exit(1);
    }
    bleft -= bts;
   i+= bts;
 }

The code usually works, but sometimes it generates the following error:

[  852.703491] Unhandled fault: alignment fault (0x96000021) at 0x0000007f82635584
[  852.710739] Internal error: : 96000021 [#4] SMP
[  852.715235] Modules linked in: axi4s2dmov(O) ksgpio(O)
[  852.720358] CPU: 0 PID: 1870 Comm: a4s2dmov_send Tainted: G      D    O    4.4.0 #3
[  852.728001] Hardware name: ZynqMP ZCU102 RevB (DT)
[  852.732769] task: ffffffc0718ac180 ti: ffffffc0718b8000 task.ti: ffffffc0718b8000
[  852.740248] PC is at __copy_from_user+0x8c/0x180
[  852.744836] LR is at copy_from_iter+0x70/0x24c
[  852.749261] pc : [<ffffffc00039210c>] lr : [<ffffffc0003a36a8>] pstate: 80000145
[  852.756644] sp : ffffffc0718bba40
[  852.759935] x29: ffffffc0718bba40 x28: ffffffc06a4bae00 
[  852.765228] x27: ffffffc0718ac820 x26: 000000000000000c 
[  852.770523] x25: 0000000000000014 x24: 0000000000000000 
[  852.775818] x23: ffffffc0718bbe08 x22: ffffffc0710eba38 
[  852.781112] x21: ffffffc0718bbde8 x20: 000000000000000c 
[  852.786407] x19: 000000000000000c x18: ffffffc000823020 
[  852.791702] x17: 0000000000000000 x16: 0000000000000000 
[  852.796997] x15: 0000000000000000 x14: 00000000c0a85f32 
[  852.802292] x13: 0000000000000000 x12: 0000000000000032 
[  852.807586] x11: 0000000000000014 x10: 0000000000000014 
[  852.812881] x9 : ffffffc0718bbcf8 x8 : 000000000000000c 
[  852.818176] x7 : ffffffc0718bbdf8 x6 : ffffffc0710eba2c 
[  852.823471] x5 : ffffffc0710eba38 x4 : 0000000000000000 
[  852.828766] x3 : 000000000000000c x2 : 000000000000000c 
[  852.834061] x1 : 0000007f82635584 x0 : ffffffc0710eba2c 
[  852.839355] 
[  852.840833] Process a4s2dmov_send (pid: 1870, stack limit = 0xffffffc0718b8020)
[  852.848134] Stack: (0xffffffc0718bba40 to 0xffffffc0718bc000)
[  852.853858] ba40: ffffffc0718bba90 ffffffc0006a1b2c 000000000000000c ffffffc06a9bdb00
[  852.861676] ba60: 00000000000005dc ffffffc071a0d200 0000000000000000 ffffffc0718bbdf8
[  852.869488] ba80: 0000000000000014 ffffffc06a959000 ffffffc0718bbad0 ffffffc0006a2358
[...]
[  853.213212] Call trace:
[  853.215639] [<ffffffc00039210c>] __copy_from_user+0x8c/0x180
[  853.221284] [<ffffffc0006a1b2c>] ip_generic_getfrag+0xa4/0xc4
[  853.227011] [<ffffffc0006a2358>] __ip_append_data.isra.43+0x80c/0xa70
[  853.233434] [<ffffffc0006a3d50>] ip_make_skb+0xc4/0x148
[  853.238642] [<ffffffc0006c9d04>] udp_sendmsg+0x280/0x740
[  853.243937] [<ffffffc0006d38e4>] inet_sendmsg+0x7c/0xbc
[  853.249145] [<ffffffc000651f5c>] sock_sendmsg+0x18/0x2c
[  853.254352] [<ffffffc000654b14>] SyS_sendto+0xb0/0xf0
[  853.259388] [<ffffffc000084470>] el0_svc_naked+0x24/0x28
[  853.264682] Code: a88120c7 a8c12027 a88120c7 36180062 (f8408423) 
[  853.270791] ---[ end trace 30e1cd8e2ccd56c5 ]---
Segmentation fault
root@Xilinx-ZCU102-2016_2:~#

Obviously, the problem is related to using of sendto to send the data starting from the address that id not 8-byte aligned (which happens quite often, as the selected MAX_DGRAM=572 is NOT a multiple of 8).

The strange thing is relatively low rate of the error. It happens after a few hundreds of transfers (equivalent to a few thousands of unaligned block transfers).

Is it a bug in the Linux kernel __copy_from_user implementation for AArch64, or have I missed something when memory-mapping the DMA buffer to the user space?

TIA & regards,
Wojtek

PS. The question was also asked at http://stackoverflow.com/questions/41093013/linux-on-arm64-sendto-causes-unhandled-fault-alignment-fault-0x96000021-wh