The SSE instruction set was introduced to the Pentium 3 processor in 1999 and now has pretty widespread support. Compilers perform auto-vectorization so all that's required to take advantage of SSE is a compiler switch (if it's not already enabled by default). SSE typically gives faster floating point performance and can be useful when wishing to perform single-precision arithmetic without higher precision intermediate results.
GCC is available under Windows with MinGW. It's a good compiler and has the advantage of making cross-platform development straightforward - it's available for all major platforms and you just need a single Makefile (with some minor platform-specific tweaks) to build the same code across Windows, MacOS and Linux (and possibly others). You don't need to keep multiple project workspaces (e.g. Visual Studio, XCode etc.) in sync just to do a build, and the operations and settings are all clear and unambiguous (in my experience XCode seems particularly bad at hiding what it does, when you might want to know about it).
However, there is a pitfall that may trip you up!
Some SSE instructions need the data to be aligned to certain boundaries in memory (e.g. begin on a 16-byte boundary). For the casual SSE user (like me), writing no assembly code and relying on the compiler working out auto-vectorization where it can, this should not be of any significance - the compiler should sort out these details. Indeed, it does - but not quite. Sometimes there is a problem.
Some data used by SSE instructions is going to be put on the stack. The alignment (or not) of this is something that the compiler should take care of. Here is where there is a problem with the interaction between GCC and Windows.
Windows aligns the stack on a 4-byte boundary. GCC decides to keep the stack aligned to 16-byte
boundary, so fixes up what it gets from Windows (or whatever operating system) when it initially
gets control in main()
(or WinMain()
). Then it assumes the stack remains
16-byte aligned througout the rest of the program. This is to allow function entry and exit to be
more efficient. I think the Microsoft compiler fixes up the stack to a 16-byte boundary on entry to
every function that uses SSE instructions, so this problem does not occur there.
This strategy has a flaw though - when there are callbacks (such as WndProc()
or
CreateThread()
), your program gets control without GCC knowing it needs to fix up the
stack to a 16-byte boundary. It's assuming it's already 16-byte aligned, but it isn't, so your
program may crash because it contains an SSE instruction that requires 16-byte aligned data. For
me, this was a movaps
instruction and not knowing much about assembly, working out
what was going wrong took me ages! Here's the debugger output:
Program received signal SIGSEGV, Segmentation fault. Renderer::Initialise (clear=..., text=...) at ../../../Renderer.cpp:659 659 GLfloat globalAmbient[4] = {0.0f, 0.0f, 0.0f, 1.0f}; (gdb) disassemble Dump of assembler code for function Renderer::Initialise(Colour const&, Colour const&): 0x0041c630 <+0>: push %ebp 0x0041c631 <+1>: push %edi 0x0041c632 <+2>: push %esi 0x0041c633 <+3>: push %ebx 0x0041c634 <+4>: sub $0x6c,%esp 0x0041c637 <+7>: cmpb $0x0,0x50a0e0 0x0041c63e <+14>: movaps 0x4d96c0,%xmm0 0x0041c645 <+21>: mov 0x80(%esp),%esi 0x0041c64c <+28>: mov 0x84(%esp),%ebx => 0x0041c653 <+35>: movaps %xmm0,0x50(%esp) 0x0041c658 <+40>: je 0x41cbe0 <Renderer::Initialise(Colour const&, Colour const&)+1456> 0x0041c65e <+46>: movl $0x4bc820,(%esp) 0x0041c665 <+53>: lea 0x4c(%esp),%edi 0x0041c669 <+57>: movb $0x0,0x50a0e0 0x0041c670 <+64>: call 0x425bd0 <Output(char const*, ...)> (gdb) info register esp esp 0x22f73c 0x22f73c
GDB uses the AT&T assembly syntax by default, where instructions have the form operation
source destination
. So the problem here is that 0x22f73c + 0x50
is not 16-byte
aligned.
There are two ways you can fix this problem.
-mstackrealign
flag to GCC.__attribute__((force_align_arg_pointer))
.
The second option is more efficient and in my case was easy - there were only 3 entry points in my
code: WinMain()
, WndProc()
and the function I pass to
CreateThread()
.
I hope this is useful to someone - I've found lots of useful information on the Internet that people have taken the time to write, and I decided that this was something I could do here.