This year (perhaps in the third quarter) AMD processors will be released with the new Zen 5 architecture. This will again be a big change, while the previous Zen 4 was basically an evolution of Zen 3, and according to various indirect indications, including the statements of the architect Mike Clark, it would may have been AMD’s most interesting architecture since the first Zen. It is interesting that until now there was information about her from a single YouTuber source. But they have just been officially confirmed directly from AMD.
All the concrete information about the nature of the Zen 5 cores (besides perhaps that the desktop CPUs will use 4nm chiplets) comes from this leak by YouTuber Moore’s Law Is Dead. He revealed the schematic of the core, but also, as you must remember, a glimpse of the increase in performance at 1 MHz frequency. According to these slides, it should improve by 10-15+% (the plus is probably supposed to say that 10-15% is a conservative lower limit).
While we’ll continue to be tight-lipped about performance, AMD has now sent a patch to the GCC compiler, used to translate mainly open source software from source codes, which adds optimization support for Zen 5 processors. The patch is interesting to us because it describes a number of aspects of the kernel and fits with what the slides published by Moore’s Law is Dead showed some time ago. This means that he probably proved their authenticity and we can now rely on them with a fairly decent degree of certainty.
Enhanced core: More ALU and AGU
Patches for GCC confirm Zen 5 core extensions. From the first to the fourth generation, these architectures maintained a very similar basic structure with four Arithmetic Logic Units (ALUs) that execute most of the most typical instructions, although Zen 3–4 has three AGUs (in Zenu 1–2 only two) that perform memory write and read operations. This is in contrast to the higher number of units in Intel cores, not to mention ARM cores, where the Cortex-X4 ALU for example already contains eight.
It was interesting that AMD was able to extract relatively higher performance from this particular number than the competitors, but Zen 5 finally raises this parameter. The patch confirms that the core has six ALUs and four AGUs. This could be accompanied by a significant increase in IPC, although the utilization of these extra units may not be that high at the beginning and further advances in IPC may be mined gradually in subsequent generations, similar to how Zen 2, 3 and 4 were able to gradually get more and more power from of the four ALUs already present in Zen 1.
It is not yet clear whether the load/store pipeline (AGU units) have an increased width from 256 bits to 512 bits to be able to read and write a 512-bit vector for AVX-512 instructions in one cycle.
Zen 5 has 6 ALU and 4 AGU
Autor: GCC / AMD
On the contrary, there is one area where there is no expansion. The number of instruction decoders remains four. However, x86 processors, including Zen architectures, have as an alternative solution the so-called uOP cache, which stores already decoded instructions. The processor should most of the time take instructions from the uOP cache, which can deliver significantly more instructions per cycle than four decoders and is more energy efficient at the same time. Therefore, the number of decoders is not as important as for ARM processors without uOP cache.
Native 512-bit AVX-512
The FPU unit (which mainly also processes SIMD instructions) does not seem to have an additional pipeline added against Zen 3 and 4, the FPU will probably have four pipelines for different operations again. But GCC confirms that Zen 5 has physical 512-bit SIMD units for the first time. It thus supports the processing of most AVX-512 instructions in a single cycle, while Zen 4 has 256-bit units like the Zen 2 and Zen 3 cores (which could only do 256-bit AVX2).
Therefore, the previous core executed 512-bit AVX-512 instructions in two passes, which calculated half the width of the vector each time. This SIMD expansion in Zen 5 seems to be matched by the addition of a second port for floating-point store operations. However, FP store units are apparently still 256-bit, so 512-bit operations are performed by combining both units.
Zen 5 Core FPU
Autor: GCC / AMD
Just expanding the width of the SIMD units to 512 bits means that the theoretical computing power given in FLOPS (but the same applies to operations working with integer data types) is doubled. With this, Zen 5 should catch up with the raw performance of Intel cores in all parameters, so the use of AVX-512, for example, in servers should now bring an advantage to AMD, while until now it gave Intel a chance to catch up with the higher overall performance of Epyc processors (Intel could still have an advantage in different matrices AMX instructions).
But the patches for GCC show that there have been other improvements. In SIMD units, three pipelines now have the ability to handle shuffle (permutation) operations instead of two, so three of these operations per cycle can be performed instead of two. There seems to be a redistribution of some operations between ports in the FPU. Floating-point addition has its latency reduced from three to two cycles, which should directly improve performance, as contiguous calculations can be processed more quickly.
The integer part also has some improvement of this kind. It looks like the two newly added ALUs can do more than the simplest operations, or AMD has beefed up the existing ALUs. While the previous cores could only process CMOV and SETCC on two out of three ALUs, Zen 5, according to the patch to GCC, can process these instructions in four of its six ALUs, i.e. up to four per cycle.
Zen 5 still has 4 instruction decoders
Autor: GCC / AMD
Furthermore, in the patch there is information that division and square root calculations should be accelerated, these instructions have reduced latency by one or more cycles for most data types.
Zen 5 will know some AVX-512 instructions that Intel lost (or damned?)
It can also be read from the patch that the Zen 5 core will additionally be able to handle some AVX-512 instructions, which Zen 4 does not yet support. This group of instructions has a relatively branched (and criticized) set of subsets. Zen 4 supports a good portion, but not the MOVDIRI, MOVDIR64B, PREFETCHI and AVXVNNI instructions – these will be useful for AI, but this is a 256-bit version of the VNNI instruction that was added for the E-Core, while Zen 4 can do it in the original 512-bit version. AVXVNNI will be useful mainly for compatibility with big.LITTLE Intel processors. Zen 5 adds these instructions.
In addition to these, Zen 5 also supports an extension with the long-winded name AVX512VP2INTERSECT (AVX-512 Vector Pair Intersection to a Pair of Mask Registers), which is interestingly obscure. These instructions were added by Intel for Tiger Lake processors (Willow Cove architecture), but then apparently Intel changed its mind, or found some problems in the implementation, because subsequent architectures, including the current Intel Sapphire Rapids server processors, no longer support AVX512VP2INTERSECT.
It is possible that the AVX512VP2INTERSECT will return to Intel. Everything after Tiger Lake is still based on Intel’s single Golden Cove architecture, so it’s possible that the AVX512VP2Intersect is broken only in that, or rather in its server version. Interestingly, Alder Lake and Raptor Lake client CPUs seem to support this instruction until Intel forcibly disables AVX-512 for them. However, the recently presented plans for the reorganization of 512-bit instructions under the AVX10 shell do not yet count on this extension, so it is possible that the implementation is not counted on again.
Perhaps this will create a funny situation where AMD again supports something that Intel does not have, as in the case of FMA4 instructions. The question is, will it benefit Zen 5 or will it be more of a disadvantage. This instruction will consume transistors that Intel will be able to save, but the software may not want to use it due to the lack of support on Intel. This is a disadvantage that smaller competitors often have to contend with.
However, it’s not as if Zen 5 is capable of more AVX-512 instructions overall than Intel’s cores. It should have more coverage than desktop and notebook processors (Ice Lake, Rocket Lake, Tiger Lake), but the Sapphire Rapids server processors and the new Emerald Rapids have a few more instructions that Zen 5 can’t yet do.
Resources: GCC / AMDAnandTech Forum (1, 2)
2024-02-13 11:02:07
#Zen #instructions #Intel #doesnt #support #wider #core #AMD #confirmed #architecture #details #Cnews.cz