Incorrect alignment of SIMD vectors returned by value from functions
I thought, that force-inlining all functions which return a SIMD vector would be the solution, but I was wrong. The same issue reveals when storing a SIMD vector in a local variable. Here is the disassembly of such SIGSEGV crash:
0x00000000005520db <+101>: vmovdqa ymm0,ymm1 => 0x00000000005520df <+105>: vmovdqa YMMWORD PTR [rbp-0x40],ymm0 ... (gdb) p $rbp $1 = (void *) 0xf7a750 (gdb) p $rbp-0x40 $2 = (void *) 0xf7a710
The local variable's address 0xf7a710 is not properly aligned to 32 bytes, so vmovdqa crashes.
This was compiled with -O0. When compiled with -O3, it doesn't crash, probably because it keeps all computations in registers and doesn't store intermediate results in local variables at all. However, there are cases when storing vector registers to local variables is inevitable.
Thank you for your report, but may I ask why you have raised the issue here? You are using a (trademark-infringing) x86_64-w64-mingw32 GCC compiler suite, (which, in spite of its anomalous mingw32 host identification, appears to generate 64-bit code); this compiler suite appears to originate from the MSYS2 project, which is not, in any way, associated with the MinGW Project, (the legitimate owner of the MinGW trademark). Consequently, I am inclined to close this, as an "invalid" ticket.
Before I do close it, however, I will offer the following comments:
$ mingw32-g++ -S -O0 -DARRAY_SIZE=32 -o- -masm=intel -mavx2 main.cc | grep -v cfi | less -FX ... .text .globl __Z16crash_vector_retPh .def __Z16crash_vector_retPh; .scl 2; .type 32; .endef __Z16crash_vector_retPh: LFB5437: push ebp mov ebp, esp and esp, -32 sub esp, 64 mov eax, DWORD PTR [ebp+8] mov DWORD PTR [esp+28], eax mov eax, DWORD PTR [esp+28] vmovdqu xmm0, XMMWORD PTR [eax] vinserti128 ymm0, ymm0, XMMWORD PTR [eax+16], 0x1 vmovdqa YMMWORD PTR [esp+32], ymm0 vmovdqa ymm0, YMMWORD PTR [esp+32] leave retIgnoring that this is 32-bit assembly, (and if I actually try to run it, it crashes, not with the SIGSEGV you report, but with a SIGILL "illegal instruction" exception), do note the additional:
and esp, -32instruction, at the beginning of the local stack frame set-up, for which no corresponding instruction appears in your GDB disassembly: this ensures that the local frame itself, and thus any variable addressed at a 32-byte offset within it, will be correctly aligned on a 32-byte boundary.
$ g++ -S -O0 -DARRAY_SIZE=32 -o- -masm=intel -mavx2 main.cc | grep -v cfi | less -FX ... .text .size _ZlsRSoRKDv4_x, .-_ZlsRSoRKDv4_x .globl _Z16crash_vector_retPh .type _Z16crash_vector_retPh, @function _Z16crash_vector_retPh: .LFB5667: push rbp mov rbp, rsp and rsp, -32 mov QWORD PTR -56[rsp], rdi mov rax, QWORD PTR -56[rsp] mov QWORD PTR -40[rsp], rax mov rax, QWORD PTR -40[rsp] vmovdqu xmm0, XMMWORD PTR [rax] vinserti128 ymm0, ymm0, XMMWORD PTR 16[rax], 0x1 vmovdqa YMMWORD PTR -32[rsp], ymm0 vmovdqa ymm0, YMMWORD PTR -32[rsp] leave retObviously, this is now 64-bit Linux-native code, but it too has that corresponding:
and rsp, -32instruction, to achieve the required stack frame alignment. (Alas, if I try to run this, it too crashes with a SIGILL exception, on the first vmovdqu instruction; I guess the Intel Celeron N4000 64-bit processor, in my laptop, either doesn't support the AVX2 instructions, or they are disabled in the firmware configuration).
Thanks a lot for your answer, @keith, and for testing it on linux in the cross-compile mode. I wasn't aware of any trademark issues between MinGW and MSYS2. I'm sorry to hear that. I thought that msys2 includes the mingw gcc compiler package just like it includes other open source packages. I know that they apply some patches on top of MinGW, but I thought that this kind of bug is so low level that it has to be directly in MinGW sources, not in msys2 patches. Eventually, it seems this bug is msys2 specific. Anyway, I like your idea of cross-compiling on linux for windows. Thanks, you can close this issue.
Summary of the Issue
Working with AVX2 SIMD vectors. When a __m256i value is returned from a function, the SIMD register ymm0 is copied to memory using the aligned move instruction vmovdqa, so the destination memory address must be aligned to 32 bytes. This requirement is not met in my test case and the program crashes with SIGSEGV. More info follows in the "Minimal Test Case" section below.
I tested it also on Ubuntu's gcc, but it didn't crash there, neither did with clang++ on Windows and Ubuntu, so therefore I am submitting this issue to the MinGW tracker. But I might be wrong and the issue may exist in gcc as well.
Host Operating System Information and Version
Windows 10 Home, Version 10.0.17134 Build 17134
with latest updated MSYS2.
GCC Version
Binutils Version
MinGW Version
I don't know what should I look for in the mingw/include/_mingw.h file.
Build Environment
Minimal Self-Contained Test Case
main.cpp:
Makefile:
Trying various compilation options:
GDB session
Program compiled by
g++ -mavx2 -std=c++17 -g -DARRAY_SIZE=32 -O0 -o main_O0_g++_s32.exe main.cpp
As you can see, the $rax address is not aligned to 32 bytes and therefore the vmovdqa crashes.
I thought that because of the alignof(__m256i) == 32, the compiler should guarantee that also the return value is aligned to 32 bytes.
I might be wrong. Could you please shed more light on this issue?
Thanks for any hints.