Home » Technology » Architektury CPU ARM Cortex-X2, Cortex-A710 a Cortex-A510

Architektury CPU ARM Cortex-X2, Cortex-A710 a Cortex-A510

The powerful Cortex-X2 core builds on the Cortex-X1 architecture (details about it can be found here) and has the same role of ensuring the highest possible single – fiber performance. Importantly, the ARM has always aimed at balancing performance with good energy efficiency in its Cortex A-Series cores, while not making the core too large in terms of chip area. Cortex-X2 is designed to go for higher performance even where it already means that it has lower energy efficiency (consumption of Cortex X1 in mobile phones seems to have reached up to 3.5-5.0 W) and lower efficiency in the relationship between power and area captured on the chip.


Architektura ARM Cortex-X2 Source: ARM, via AnandTech

In mobile phones, therefore, sometimes this core may need only one, supplemented by Cortexy-A710. However, ARM mentions that this core could be used in “larger screen devices”, which could indicate that chips with this architecture for laptops could appear from Qualcomm, where perhaps more of these cores could appear at once.

The first ARMv9, without 32 bits

The Cortex-X2 doesn’t seem to be a completely new architecture, but rather an evolutionary improvement built on the previous X1 (designed by the ARM team in Austin, Texas). Of course, adding support for all ARMv9 features required a significant overhaul of everything possible. ARM states that it has made many targeted adjustments in all parts.


Architektura ARM Cortex X2 04

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

Another big change is that all support for running 32-bit applications has been removed from the kernel, so this kernel now only supports 64-bit mode, which probably simplified the design and saved some transistors.

Tip: ARM introduces a new generation of CPU architecture. ARMv9 has SVE, SVE2 and security innovations

Emphasis on branching prediction

ARM states that the branching predictor, which ARM considers to be key to efficiency, should be significantly improved. The predictor works with a larger volume of information, has a larger effective capacity of the Branch Target Buffer. The success rate should be significantly better than in the older generation, the number of incorrectly estimated branches to a certain number of instructions of the executed code should be significantly lower compared to X1.

Branch prediction also runs separately from the Fetch step, which previous generations have had. This approach aims to cover the “bubbles” in the use of computing units caused by erroneous prediction. The goal is for these parts of the frontend to be able to run ahead of the rest of the processing and, in the event of a bad prediction, to have a few more cycles to start processing the correct code path before the next stages of processing work.


Architektura ARM Cortex X2 05

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

Interestingly, ARM reduced the depth of the entire pipeline from 11 to 10 cycles – one stage of execution was eliminated in the dispatch phase, which previously took two cycles. At the same time, Cortex cores already had a rather short pipeline for today’s conditions, which is why they also do not reach frequencies well above 3 GHz. However, shortening the pipeline shortens the penalty for incorrectly estimated branching and thus directly improves the IPC.

According to ARM, the performance gain was large enough to justify the relatively large impact on the chip area as well as the far-reaching core redesign that this required. Despite the shortening of the pipeline, it is said to not suffer from the frequency potential of the core, so it should run roughly at the same cycles as X1, or at least ARM promises it.


Architektura ARM Cortex X2 06

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

Deeper queues

ARM does not seem to have added computational units or expanded the kernel to handle more operations per cycle, (the number of ALUs thus remained at four, for those who monitor this parameter). Out-of-order queues, on the other hand, have been strengthened, the reorder buffer has increased from 224 items to 288 items, and is therefore larger than in Zen 3 (it has a very small depth of 256 items for its performance, while Ice Lake / Tiger Lake from Intel 352). Effectively, the depth is even better, because the kernel is also said to have improved ability to join operations together, thanks to which they take up only one item instead of two.

The various queues of the Load / Store part are also deepened, ie buffers, which include read and write operations to the memory (cache). They should be up to 33% larger, so the kernel can process more of these operations at once and thus better optimize them. The L1 TLB for the data has been increased from 40 to 48 items. Extending various of these queues generally increases IPC. The Macro-Op cache has remained the same and has 3,000 entries.


Architektura ARM Cortex X2 07

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

In addition, there should also be improved prefetchers, including an improved temporal prefetcher, which the X1 and A78 cores already had and was more or less unique. Like the predictors of branching, prefetch is an area in which continuous incremental evolutionary improvements are made and at the same time it is a highly important factor in improving IPC. Like branch prediction, prefetch can decide whether the kernel will be able to load its units or whether it will tread idle while waiting for data and thus have low performance.

CEE and CEE 2 finally

Along with ARMv9, a new architecture of SIMD instructions in the FPU / SIMD unit: SVE and SVE 2 with flexible vector width is also finally supported. But it turns out that with that flexibility, it’s slightly more limited than might be expected. At first glance, it might seem that flexible width elegantly solves the problem with big.LITTLE processors, where small cores cannot have a large vector width, but large ones do (at least I personally imagined that).

However, if this is possible, ARM has not gone so far yet, precisely due to compatibility with small cores, Cortex-X2 has a width of units, but also of the processed SVE vector only 128 bits, ie the same as Neon units in Cortex X1 it is comparable to the SSE * instructions, AVX * are already 256-bit). There are four FPU / SIMD pipelines, so the gross computing power has not changed.

Where performance improves, it will be in AI applications running on these SIMDs. The kernel will support 16-bit bfloat16 and 8-bit INT8, allowing for up to twice the performance of AI applications. But these, of course, must be recompiled and / or rewritten to use the appropriate new instructions.

The resulting increase in power to 1 MHz

So how did all these changes affect kernel performance? It seems that the Cortex-X2 will not be a big leap where IPC will improve by leaps and bounds – this may be added by the next generation, but this first series of ARMv9 is more conservative. ARM states that the Cortex-A2 should have up to 16% higher power at the same frequency, which means 16% better IPC.


Architektura ARM Cortex X2 08

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

However, this is a slightly improved comparison, as it is a prediction for cores with the same 64KB L1 and 1MB L2 cache, but for Cortex-X2 ARM used 8MB L3 cache, while for Cortex-X1 only 4MB. This alone can make a few percent improvement, so the real increase in IPC is probably lower and marketing has manipulated it a bit (admittedly). In addition, ARM also promises up to 2x higher performance in AI, which is the support of bfloat16 / INT8.

ARM also showed this graph, according to which the Cortex-X2 will achieve its higher performance also at the cost of a certain increase in consumption, if both cores were on the same process.


Architektura ARM Cortex X2 09

Architektura ARM Cortex-X2 Source: ARM, via AnandTech

In mobile devices, therefore, there may be a problem where still 5nm Cortexy-X2 will not be able to achieve the claimed performance, because the consumption (already problematic for Cortex-X1 in 5nm Qualcomm Snapdragonu 888 a Samsung Exynosu 2100) will prevent the maximum frequency from being maintained for longer. In notebooks, however, for a higher power consumption (we’re probably still only talking about 6-8 W, for example) in exchange for better performance, it would probably be quite hidden and paid off.

The article continues on the next page.

Galleries: ARM Cortex-X2 Architecture

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.