SSE4 ( Streaming SIMD Extensions 4 ) is a SIMD CPU instruction set used in the Intel Core microarchitecture and AMD K10 (K8L) . It was announced on September 27, 2006, at the Fall 2006 Intel Developer Forum , with vague details in a white paper ; more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum in Beijing , in the presentation. SSE4 extended the SSE3 instruction set which was released in early 2004. All software using previous Intel SIMD instructions (ex. SSE3) are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.
21-401: Like other previous generation CPU SIMD instruction sets, SSE4 supports up to 16 registers, each 128-bits wide which can load four 32-bit integers, four 32-bit single precision floating point numbers, or two 64-bit double precision floating point numbers. SIMD operations, such as vector element-wise addition/multiplication and vector scalar addition/multiplication, process multiple bytes of data in
42-462: A new divider with reduced latency, a new shuffle engine, and SSE4.1 instructions (some of which are enabled by the new single-cycle shuffle engine). Maximum L2 cache size per chip was increased from 4 to 6 MB, with L2 associativity increased from 16-way to 24-way. Cut-down versions with 3 MB L2 also exist, which are commonly called Penryn-3M and Wolfdale-3M as well as Yorkfield-6M, respectively. The single-core version of Penryn, listed as Penryn-L here,
63-447: A single CPU instruction. The parallel operation packs noticeable increases in performance. SSE4.2 introduced new SIMD string operations, including an instruction to compare two string fragments of up to 16 bytes each. SSE4.2 is a subset of SSE4 and it was released a few years after the initial release of SSE4. Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to as SSE4.1 in some Intel documentation,
84-815: Is available in Penryn . Additionally, SSE4.2 , a second subset consisting of the seven remaining instructions, is first available in Nehalem -based Core i7 . Intel credits feedback from developers as playing an important role in the development of the instruction set. Starting with Barcelona -based processors, AMD introduced the SSE4a instruction set, which has four SSE4 instructions and four new SSE instructions. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 (the full SSE4 instruction set) in
105-522: Is not a separate model like Merom-L but a version of the Penryn-3M model with only one active core. The processors of the Core microarchitecture can be categorized by number of cores, cache size, and socket; each combination of these has a unique code name and product code that is used across a number of brands. For instance, code name "Allendale" with product code 80557 has two cores, 2 MB L2 cache and uses
126-513: Is now known as SSSE3 (Supplemental Streaming SIMD Extensions 3), introduced in the Intel Core 2 processor line, was referred to as SSE4 by some media until Intel came up with the SSSE3 moniker. Internally dubbed Merom New Instructions, Intel originally did not plan to assign a special name to them, which was criticized by some journalists. Intel eventually cleared up the confusion and reserved
147-489: Is shared among all cores. Nehalem is an architecture that differs radically from NetBurst , while retaining some of the latter's minor features. Nehalem later received a die-shrink to 32 nm with Westmere , and was fully succeeded by "second-generation" Sandy Bridge in January 2011. It has been reported that Nehalem has a focus on performance, thus the increased core size. Compared to Penryn, Nehalem has: Overclocking
168-718: The BSF (bit scan forward) or TZCNT instructions. Windows 11 24H2 requires the CPU to support POPCNT , otherwise the Windows kernel is unbootable. The SSE4a instruction group was introduced in AMD's Barcelona microarchitecture . These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag. X86-64 v2 CPUs: Penryn (microarchitecture) In Intel's Tick-Tock cycle,
189-521: The Bulldozer -based FX processors. With SSE4a the misaligned SSE feature was also introduced which meant unaligned load instructions were as fast as aligned versions on aligned addresses. It also allowed disabling the alignment check on non-load SSE operations accessing memory. Intel later introduced similar speed improvements to unaligned SSE in their Nehalem processors, but did not introduce misaligned access by non-load SSE instructions until AVX . What
210-642: The Nehalem -based Intel Core i7 product line, and complete the SSE4 instruction set. AMD on the other hand first added support starting with the Bulldozer microarchitecture . Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag. Windows 11 24H2 requires the CPU to support SSE4.2, otherwise the Windows kernel is unbootable. (Various unofficial Windows 11 variants, such as Tiny11 and Parallels virtualizations installations, bypass this requirement.) These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at
231-607: The 2007/2008 "Tick" was the shrink of the Core microarchitecture to 45 nanometers as CPUID model 23. In Core 2 processors, it is used with the code names Penryn (Socket P), Wolfdale (LGA 775) and Yorkfield (MCM, LGA 775), some of which are also sold as Celeron, Pentium and Xeon processors. In the Xeon brand, the Wolfdale-DP and Harpertown code names are used for LGA 771 based MCMs with two or four active Wolfdale cores. Architectural improvements over 65-nanometer Core 2 CPUs include
SECTION 10
#1732772398763252-430: The 65 nm processors, the same product code can be shared by processors with different dies, but the specific information about which one is used can be derived from the stepping. In the model 23 (cpuid 01067xh), Intel started marketing stepping with full (6 MB) and reduced (3 MB) L2 cache at the same time, and giving them identical cpuid values. All steppings have the new SSE4.1 instructions. Stepping C1/M1
273-466: The SSE4 name for their next instruction set extension. Intel is using the marketing term HD Boost to refer to SSE4. Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand. Several of these instructions are enabled by
294-452: The bits masked by SRC are set. SSE4.2 added STTNI (String and Text New Instructions), several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing of XML documents. It also added a CRC32 instruction to compute cyclic redundancy checks as used in certain data transfer protocols. These instructions were first implemented in
315-498: The desktop socket 775, but has been marketed as Celeron, Pentium, Core 2 and Xeon, each with different sets of features enabled. Most of the mobile and desktop processors come in two variants that differ in the size of the L2 cache, but the specific amount of L2 cache in a product can also be reduced by disabling parts at production time. Wolfdale-DP and all quad-core processors except Dunnington QC are multi-chip modules combining two dies. For
336-465: The older Core microarchitecture used on Core 2 processors . The term "Nehalem" comes from the Nehalem River . Nehalem is built on the 45 nm process, is able to run at higher clock speeds without sacrificing efficiency, and is more energy-efficient than Penryn microprocessors. Hyper-threading is reintroduced, along with a reduction in L2 cache size, as well as an enlarged L3 cache that
357-465: The same encoding path as the encoding of the BSR (bit scan reverse) instruction. This results in an issue where LZCNT called on some CPUs not supporting it, such as Intel CPUs prior to Haswell, may incorrectly execute the BSR operation instead of raising an invalid instruction exception. This is an issue as the result values of LZCNT and BSR are different. Trailing zeros can be counted using
378-652: The same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implements POPCNT beginning with the Nehalem microarchitecture and LZCNT beginning with the Haswell microarchitecture. AMD implements both, beginning with the Barcelona microarchitecture . AMD calls this pair of instructions Advanced Bit Manipulation (ABM) . The encoding of LZCNT takes
399-453: The single-cycle shuffle engine in Penryn. (Shuffle operations reorder bytes within a register.) These instructions were introduced with Penryn microarchitecture , the 45 nm shrink of Intel's Core microarchitecture . Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag. This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of
420-508: The usual two cores, which leads to an unusually large die size of 503 mm . As of February 2008, it has only found its way into the very high-end Xeon 7400 series ( Dunnington ). Nehalem (microarchitecture) Nehalem / n ə ˈ h eɪ l əm / is the codename for Intel 's 45 nm microarchitecture released in November 2008. It was used in the first generation of the Intel Core i5 and i7 processors, and succeeds
441-533: Was a bug fix version of C0/M0 specifically for quad core processors and only used in those. Stepping E0/R0 adds two new instructions (XSAVE/XRSTOR) and replaces all earlier steppings. In mobile processors, stepping C0/M0 is only used in the Intel Mobile 965 Express ( Santa Rosa refresh ) platform, whereas stepping E0/R0 supports the later Intel Mobile 4 Express ( Montevina ) platform. Model 29 stepping A1 (cpuid 106d1h) adds an L3 cache as well as six instead of
SECTION 20
#1732772398763#762237