SIMD instruction sets

Modern CPUs' full performance can only be harnessed through the use of SIMD instruction sets. In particular, the peak floating point performance numbers reported by CPU vendors are achievable only by using these instruction sets in conjunction with multithreading. At Archaea Software, LLC, we have worked with these instruction set extensions since their inception in the mid-1990s. Though difficult to use, they can be critical in achieving maximum performance on modern CPUs.

Starting in the mid-1990s, Intel and other CPU vendors began shipping CPUs that could perform identical operations on multiple data elements in a single instruction. Because the execution units of a modern CPU do not consume many transistors as compared to the on-chip caches, these new registers and instructions could be added without making the CPU prohibitively more expensive.

Intel's first foray into so-called "SIMD" (single instruction, multiple data) instruction sets was MMX ("MultiMedia eXtensions"), which enabled eight 64-bit registers to each be considered as 8 bytes, four 16-bit words, two 32-bit words, or one 64-bit word. Instructions such as PADDW (parallel add-word) would perform four additions on corresponding 16-bit quantities in these registers. The instructions could be accessed either through hand-coding assembly language, or by using compiler "intrinsics," inline functions that correspond to the new low-level instructions but can be invoked in a high level language such as C++.

After MMX, Intel introduced the SSE ("streaming SIMD extensions") instruction set that added eight new registers (XMM0..XMM7) that can hold 128 bits of data. Initially, SSE only provided support for single-precision floating point operations - for example, the ADDPS instruction could perform four floating point additions in one instruction - but the subsequent SSE2 instruction set, first introduced with the Pentium4, added MMX-like integer capabilities as well as support for double precision floating point.

For applications targeting x86 chips, SSE2 is a reasonable "least common denominator" instruction set: it has been available since 2002 and provides full-featured support for all integer sizes (8- to 128-bit) as well as single- and double-precision floating point. In fact, on x86-64, the 64-bit extension of the x86 architecture, SSE2 is the only way to access floating point hardware.

Since the introduction of the Pentium4, the x86 instruction set has been extended with new instructions beyond SSE2. For example, SSE3 added "horizontal add" instructions that could operate on elements within a register instead of just performing parallel operations on corresponding elements. The SSE4 instruction set added instructions to improve compiler vectorization and increased support for packed 32-bit word computation.

Depending on the CPU architectures they are targeting, developers must use post-SSE2 instruction set extensions with care. In order to run on hardware that may or may not include the SSE4 instruction set, for example, the application must correctly detect whether SSE4 is available and fall back to a lower-level implementation if it is not available.

Archaea Software, LLC can help evaluate your application's suitability to acceleration using SIMD instructions. Fill out our questionnaire to get started.