Qualcomm is finally talking about the tech of their CPUs with the new Snapdragon X2 Elite Extreme CPUs. SemiAccurate likes the tech and silicon a lot but the final product isn’t really functional.
Authors Note: This is a massive story, longer than I have done in years so it will be broken up into pieces and posted in the coming days. If it seems a bit negative at first, that is only the beginning, and end, the vast middle is the tech and we like that a lot. Part 2 can be found here.
Overview and History:
Qualcomm has been making PC CPUs for over a decade with, lets be charitable, mixed results. The reference SemiAccurate found after almost several minutes of searching is here in 2012. Ignore those early attempts at making an ARM based Windows laptop, aka WART or Windows on ARM RT. What we are trying to point out is this is not a new program, it is 10+ years in, and that is important to consider when evaluating the end product, not the silicon. We will do both.
Last summer the belated Snapdragon X Elite CPU came out and the silicon was really good, the Nuvia acquisition delivered on the promise. SemiAccurate liked the silicon but found the platform as a whole unacceptable. Why? We received a sample the day after release, a Microsoft Surface Laptop. It flat out didn’t work. After dozens of hours of trying to get this device to a state where it did something stably, we gave up getting no further than a boot loop.
When Qualcomm asked us what we wanted to to test, we told them we were going to strip Windows off and install Linux, then see what the experience was like. About six weeks ago we finally got the device to the point where it would load a desktop and open a browser. If you moved the mouse wheel, the trackpad and many other things were MIA, the whole system crashed. We have since gotten our hands on an Acer Swift 14 and had a much better experience, as long as things like sound and afew other bits are not on your functionality checklist but, well, this is only a few days in.
Before you click away because Linux or us not running Windows, it is relevant to you even if you run Windows. Why? Because Qualcomm’s software enablement is something between non-existent and antagonistic to developers regardless of OS. After a decade plus of work, these WARTBooks are not close to functionally compatible with the x86 Windows software base. If you have a problem with the devices, good luck to you.
About a year ago, Dell and Microsoft went on a sales call to a close relative who works at an engineering firm. The presentation was pushing WARTBooks hard, and to be blunt, the duo outright lied to the customer. How do we know? He was texting me the whole time fact checking presentation claims. They claimed the compatibility with x86 software was 82%, likely higher now. I asked him to get a copy of the presentation so he could look at the fine print about what that 82% encompassed. They declined. They also declined to answer the question directly. And they lied about what Intel was releasing and when. This isn’t a Qualcomm fault, it just illustrates the woeful state the devices were in about six months after launch.
The real kicker was when my relative asked about management. Why? Because I asked him to. Qualcomm X Elite systems have no hardware management, no vPro equivalent. The duo claimed this wasn’t necessary and software could do everything, no one needed hardware management anymore. I don’t think my relative would find the humor in flying out to a bridge building project in rural Oman to reimage a dead laptop. On the bright side, this has been addressed in the Snapdragon X2 Elite Extreme CPUs, more on that much later.
In the end, the current version of the Snapdragon X Elite CPU is solid silicon but the devices they go in to are simply unacceptable. They are not laptops, they are big phones, and they just don’t work right. There is no software support, vendors lie for sales, and Qualcomm shows no sign of addressing the situation. Will things change with the new Snapdragon X2 Elite Extreme CPU? Based on conversations had with dozens of Qualcomm people last week, we highly doubt it, but we hope to be pleasantly surprised. In the mean time, avoid the X Elite and wait for independent testing before you add any X2 devices to your consideration list.
Before we go on to the details of the new platform, one more thing to think about. Nvidia has now launched their N1X/GB10 devices, the Spark platform. It basically doesn’t work, the release was a PR stunt, and it is priced to stupidity. That said it is a better product than the Snapdragon X and X2 Elites. Why? Because the hardware may be broken but the software support is there. By the time the consumer devices ship, currently set for announcement after CES and availability in June-ish, things will be better. Qualcomm has lost the entire dev community to Nvidia, squandered a decade lead. No sane developer will engage with Qualcomm over Nvidia at this point, their antagonism over the past 18 months has ensured that. Self-inflicted wounds like this are why we can’t recommend any Qualcomm laptop products until we see direct evidence of a change, and a lasting one at that.
The Platform:
On that happy note, on to the CPU itself, and it is a good thing. The basic platform is called the Snapdragon X2 Elite Extreme with a lesser non-Extreme variant in the wings. Jokes abounded about waiting for the Snapdragon X2 Elite Extreme Pro Plus Platinum before you buy, at least Qualcomm is not short on suffixes, and can prove it.

A long long list of the details to save a lot of typing
The Extreme is an 18 core CPU with a four slice Adreno GPU while the non-Extreme SKUs have either 12 or 18 cores. Those cores are divided into clusters of six that share caches and other structures, come in two categories, Prime and Performance, but are otherwise similar. Prime are the high end cores, Performance are a more efficient core but not a ‘small’ core in the phone sense of the term small. Think high end and middle, not high and low. This is a good choice, true small cores make almost no sense in a laptop form factor with a laptop battery, phones sure, big device, no.
Prime cores clock to 5.0GHz in the Extreme, one core at that clock per cluster, 4.7GHz in the 18C non-Extreme, and 4.7/4.4GHz in the 12C SKU. There are a lot of tricks that Qualcomm did with the clocking, Cluster-Level Multi-Level Boost (CLMLB – Rhymes with ‘orange’ while sliding off the tongue) is the main one, but again more on that in a bit.

A Sadly Deceased X2 Elite Extreme
Memory is the main difference between the Extreme and non-Extreme packages, but all use LPDDR5x. The Extreme devices have memory on package, three stacks totaling 48GB on a 192b bus running at a max of 9523MT/s. We were hoping to see the magic 9527MT/s figure but Qualcomm let us down on this performance front. Yes that was a joke, the memory is pretty damn fast. All other devices don’t have memory on package and only use a 128b bus but theoretically run at the same speed. If you know anything about modern CPU power, the remote memory SKUs will chew through a lot more energy, and do it slower. For this reason if you go for any Qualcomm CPU, the Extreme is the only sane choice. Or buy an Intel Lunar Lake like the one this is being written on, as a bonus, it also works right every time.
The new devices also sport a new Adreno GPU, a new Hexagon NPU capable of up to 80 TOPS, a new Spectra ISP, Adreno DPU and VPU which supports 8K in hardware, and bigger caches. The last level cache is 9MB shared system memory, the CPUs and GPUs each have their own caches which will be covered later.
More importantly is Snapdragon Guardian, the hardware management system we referenced earlier. Because of this the system has explicit ties to either an X75 5G modem or a 4G cellular IOT modem which is a double edged sword. Again more on this when we go into details later but it has issues and advantages.
Toss in connectivity via Wi-Fi 7 with 6GHz band support, multi-link support, and dual Bluetooth connectivity. There are also 3x 40Gbps USB4 ports, 12x PCIe5, 4x PCIe4, and SD card support. The PCIe lanes can support 2x NVMe drives leaving 4x lanes for an external GPU which Qualcomm says is possible. Don’t hold your breath to see a device with one but technically it is supported. Overall it is a pretty solid set of I/O capabilities, nothing to complain about here.

Block Diagram of the X2 Elite SoC
All of this is built on a TSMC 3nm process, exact variant unspecified, which means it should be pretty efficient. The die size wasn’t officially disclosed but they did give us a sample as a keepsake. As you can see above, there is a bit of epoxy on the edges but the die is about 220mm^2. Not bad, Intel’s Lunar Lake is a dual die of 146mm^2 and 46mm^2 on TSMC N3B and N6 respectively, we don’t count the 22nm interposer. Overall the new Snapdragon X2 Elite CPU is about what you would expect in most ways, but the details are where it gets interesting.
The CPU Cores and Cluster:

The Prime Cluster Layout
The first thing to talk about on the CPU side is the clustering. As we said above, there are three clusters on this die, 2x Prime and 1x Performance, six cores each. The Prime cores have 16MB of shared L2 cache on the cluster, the 12MB for Performance, and all fully coherent. Both cluster types also have a Matrix Engine (ME) coupled to it. More on this later but just realize there is one per cluster even if software sees it as one as part of each core.

Prime Cluster Fetch and Decode
For the front end, things get rather wide rather fast. Fetch is 16 instructions per cycle, helped along by a 192KB L1I$, 6-ways and fully coherent if you care about those details. No L1D$ size was specified. The L1 iTLB is 256 entry with 8-ways and supports page sizes from 4KB to 1MB. Pretty standard stuff to feed the wide bits that follow.
There are four branch predictors, a 1-cycle for the BTB, a 2-cycle conditional branch predictor, and a 2-cycle indirect target predictor. Before you point out that 1+1+1=3 not 4, there is a fourth that was mentioned but wasn’t fully explained, three are mainly used but a fourth exists. Mispredict penalty is 13-cycles and as you can see above, that all feeds a 9-wide decoder.
As if by magic, there is also a 9-wide register rename which totally matches the decoder width, coincidence or conspiracy? Jokes aside this isn’t anything too out of the ordinary but it is wide, and there are separate physical registers for each type of register. The integer and vector units have 400+ registers for rename, each. The most interesting bit about this portion of the CPU is that it is checkpointed so a branch mispredict won’t necessarily mean a full flush with the attendant penalties, you can just rewind it a bit and carry on.
Other bits, the CPU now uses micro-ops and can dispatch up to 14 a cycle to the reservation stations. That is one each to the 6 integer, 4 vector, and 4 load/stores, plus none because that would add to 15 and there are only 14. What do you think this is, branch prediction? And while we are on the subject of micro-ops, the X2 can also fuse them when needed, and it supports memory disambiguation. In case you are wondering, I am indeed getting flashbacks of the Pentium 4 briefings and it is a bit unpleasant. Retirement is out of order with up to 9 micro-ops per cycle with a 650+ entry reorder buffer.

Prime Core Integer Pipeline
Looking at the integer execution pipeline, there are six as previously mentioned. 20 entry reservation stations feed each pipe, then into a 400+ entry reservation station, then muxed to the pipes. Rather than type out the capabilities of each, just look above and realize the latencies are one cycle for ALU ops and branches, three cycles for multiplies.

Prime Core FP and Vector Pipeline
As we mentioned above there are four pipes for FP but each is 128b wide as opposed to the 64b width of the Int pipes. The reservation stations are a bit larger at 48 entries which is understandable due to the nature of their work and likely higher but unspecified latencies. Each vector pipe can be split into 4x 32b ops, supported by predication. There are also crypto and matrix multiply instructions. See why the latencies might be a tad longer than the 1-3 cycles for the Int side?
That brings us to the Load Store Unit, which has it’s own 96KB 6-way L1 cache, 64B coherency if you care for such details. It will translate addresses for loads and stores and supports 4KB and 64KB granules. The four pipes can all do loads or stores in any combination which is not unique but most modern CPUs tend to emphasize loads. For queues there is a 192 entry load and 56 entry store. Qualcomm mentions advanced prefetching techniques but for obvious reasons doesn’t go into the secret sauce any more than what was said earlier.
On the MMU front it is pretty much what you would expect starting with page sizes from 4KB to 1GB, larger pages are chunked into 1GB pieces. Virtualization is now 2-stage so you can pretend you are in Inception while you are doing undocumented things with VMMs. Joking aside this can be useful if you are silly enough to run Windows for real work.
The TLB setup is a little less common with a dedicated 256 entry 8-way TLB for each of the I$ and D$ but the L2 gets a dedicated 256K 8-way. As you might expect the L2 TLB is slower, 2 cycles, while the L1 TLB can do the job in 1 cycle. There is also hardware table walk support for 16 requests in flight and can serve intermediate requests. This is to allow the 2-stage virtualization to work, it is necessary overhead.
Moving on to the L2 cache it is 16MB per cluster or 32MB for the two Prime core clusters, fully coherent, MOESI, and inclusive of the L1 caches. Since it is tightly coupled with the cluster it runs at CPU clock frequency but don’t confuse this with the frequency the individual cores run at. The cache serves six cores and the Matrix engine, more on this soon, and has a 21-cycle average latency. Each core can have more than 50 instructions pending on the cache and the cache itself can support 220+ in flight instructions. To use the old joke, we were hoping for 223+. Sick of that one yet? The cache can be partitioned and allocated to a thread or core for performance reasons as most modern CPUs are wont to do.S|A
Part 2 can be found here.
