AVX-512 Mask Registers, Again

  • 时间: 2020-05-27 08:36:42

Afew posts ago we looked at the AVX-512 mask regsiters. Specifically, the number of phyiscal registers underlying the eight architectural ones, and some other behaviors such as zeroing idioms. Recently, a high resolution die shot of SKX appeared, and I thought it would be cool to verify our register count by visual inspection. While trying to do that, I ran across something else interesting instead…

We’re interested in this die shot, recently released by Fritzchens Fritz on Flickr . We’ll be focusing on the highlighted area, which seems to have all the register files on the chip. If you want a full breakdown of the core, you can check guesses here on Twitter , on RWT and on Intel’s forums ( thread ).

Here’s a close-up of that section, with the assumed purpose of the various register files labeled.

We guess the register file identities based on:

  • The general purpose register files are the right relative width (64 bits), and are in the right position below the integer execution units.
  • The SIMD registers are obvious from their size and positioning underneath the vector pipes.
  • The uppper 256 bits of the 512-bit zmm registers (labelled ZMM on the closeup) can be determined from comparing the SKL (no AVX-512) and SKX (has AVX-512) dies and noting that the bottom file is not present in SKL (a large empty area is present at that spot in SKL ).

This leaves the unknown block in red. This block is in a prime spot below the vector execution units. Could it be the mask registers (kregs)? We found in thefirst post that the mask aren’t shared with either the scalar or SIMD registers, so we expect them to have their own physical register file. Maybe this is it?

Let’s compare the mystery register file to the integer register file, since they should be similar in size and appear to be implemented similarly:

Loooking at the general purpose register file on the left, each block (6 of which are numbered on the general purpose file) seems to implement 16 bits, as if you zoom in you see a repeating structure of 16 elements, and 4 blocks makes 64 bits total which is the expected width of the file. We know from published numbers that the integer register file has 180 entries, and since there are 6 rows of 4 blocks, we expect each row to implement 180 / 6 = 30 registers.

Now we turn our attention to the mystery file, with the idea that it may be the kreg file. There are a total of 30 blocks. Looking at the general purpose registers, we determined each block can hold 16 bites from 30 registers. So 30 blocks gives us: 30 blocks * 30 registers/block * 16 bits / 64 bits = 225 registers. Hmmm, we calculated last time that there are ~142 physical mask registers, so this seems way too high.

There’s another problem: we only have three columns of 16-bit blocks, for a total of 48 bits, horizontally. However, we know that a mask register must be up to 64 bits (when using a byte-wise for a full 512-bit vector register). Also, while our calculation above worked out to a whole number, the number of blocks (30) is not divisible by 4, so even if you assumed the arrangement of the blocks didn’t matter, there is no posible mapping from each register to 4 distinct blocks. Instead, we’d need something weird like 2 blocks providing 15 registers (instead of 30), but 64 bits wide (instead of 32). That seems very unlikely.

So let’s look just at the two paired columns on the left for now. If we take the SIMD registers as an example, it is not necessary that the full width of the register is present horizontally in a single row: the SIMD registers have 256 bits in a row (split into two 128 lines), but the other 256 bits in a 512-bit zmm register appears vertically below, in the register file maked ZMM in the diagram.

Since the mask registers are associated with elements of zmm registers, maybe they are split up in the ame way? That is, a 64-bit mask register uses one 2x16 (32-bit) chunk from the top half and one from the bottom half, to make up 64 bits? This is 20 total blocks, giving 150 registers by the same calculation above. This is much closer to the 142 we found from experiment.

Still… that nagging feeling. 142 is not equal to 150, and what about that third colum of blocks? I started doubting this was the mask register file after all. What could it be then? I realized there was one register file unaccounted for: the file for legacy x87 floating point and MMX. x87 floating point and MMX are the same file because MMX registers are architecturally aliased onto the x87 registers. So where is this file? I looked all aroundthe die shot. There are no good candidates.

So maybe this thing we’ve been looking at is actually the x87/MMX register file? In one way it’s a better fit: the x87 FP register file needs to be ~80 bits wide, so that would explain the extra column: if we assume each row is half of a register as before, that’s 96 bits. That’s enough to hold 80 bit extended precision values, and the 16 bits left over are probably could be used to store the FPU status word accessed by ftstw and related instructions. This status word is updated after every operation so must also be renamed for reasonable performance.

Additional evidence that this might be the x87/MMX register file comes from this KBL die shot also from Fritz:

Note that while the high 256 bits of the register file are masked out (this chip supports only AVX2, not AVX-512 so there are no zmm registers), the register file we are considering is present in its entirety.

Cool theory bro, but aren’t we back to square zero? If this is the file for the x87/MMX registers, where do the mask registers live?

There’s one possibility we haven’t discussed although some of you might be scraming it at your monitors by now: maybe the x87/MMX and the kreg register files are shared . That is, physically aliasedto the same register file, shared competitively.

The good news? We can test for this.

We’ll use the test method originally described by Henry Wong and which we used in thelast post on this topic. Here’s the quick description of the technique, directly copy and pasted from that post.

We separate two cache miss load instructions by a variable number of filler instructions which vary based on the CPU resource we are > probing. When the number of filler instructions is small enough, the two cache misses execute in parallel and their latencies are overlapped so the > total execution time is roughly as long as a single miss.

However, once the number of filler instructions reaches a critical threshold, all of the targeted resource are consumed and instruction allocation > stalls before the second miss is issued and so the cache misses can no longer run in parallel. This causes the runtime to spike to about twice the > baseline cache miss latency.

Finally, we ensure that each filler instruction consumes exactly one of the resource we are interested in, so that the location of the spike indicates the size of the underlying resource. For example, regular GP instructions usually consume one physical register from the GP PRF so are a good choice to measure the size of that resource.

The trick we use to see if two register files are shared is first to use a test the size of each register file alone, using a test that uses filler that targets only that register file, then to run a third test whose filler alternates between instructions that use each register file. If the register files are shared, we expect all tests to produce the same results, since they are all drawing from the same pool. If the register files are not shared, the third (alternating) test should result in a much higher apparent resource limit, since two different pools are being drawn from and so it will take twice as many filler instructions to hit the RF limit.

Enough talk, let’s do this. First, we look at Test 38 , which uses MMX instructionsto target the size of the x87/MMX register file:

[raw data]

We see a clear spike at 128 instructions, so it seems like the size of the speculativex87/MMX register file is 128 entries.

Next, we have Test 43 which follows the same pattern as Test 38 but using kaddd as a filler instruction so targets the mask (kreg) register file:

[raw data]

This is mostly indistinguishable from the previous chart and we conclude that the size of the speculative mask register file is also 128.

Let’s see what happens when alternate MMX and another instruction type. Test 39 alternates MMX with integer SIMD instructions, and Test 40 alternatives with general purpose scalar instructions:

[raw data]

[raw data]

Both of these show the same effect: the effective resource limitation is much higher: around 210 filler instructions. This indicates strongly that the x87/MMX register is not shared with either the SIMD or scalar register files.

Finally, we get to the end of this tale, Test 41 . This test mixes MMX and mask register instructions:

[raw data]

This one is definitely not like the others. We see that the resource limit is now 128, same as for the single-type tests. We can immediately conclude that this means that mask registers and MMX registers are allocated from the same resouce pool: they use the same physical register file .

This resolves the mystery of the missing register file: nothing is missing but rather this one register file simply serves double duty.

Normally a shared register file might be something to watch out for, performance-wise, but it is hard to imagine this having an impact in any non-artificial example. Who is going to be making heavy use of x87 or MMX (both obsolete) along with AVX-512 mask registers (the opposite end of the spectrum from “obsolete”)? It seems extremely unlikely. In any case, the register file is still quite large so hitting the limit is unlikely in any case.

So sharing these files is a cool trick to reduce power and area: the register files aren’t all that big, but they live in pretty prime real-estate close to the execution units.

What’s cool about this one though is that is the first time that I’ve looked at a chip (that this is even possible is remarkable to me) and come up with a theory about the hardware we can test and confirm with a targeted microbenchmark. Here, it actually happened that way. I was already aware of the possibility of register file sharing (Henry had tests for this right in robsize from the start) – but although I considered other sharing scenarios I never considered sharing between x87/MMX and the mask registers until I tried to identify the register files on Franz’s die shots.

Thanks

Thanks to Fritzchens Fritz who created the die shots analyzed here, and who graciously put them into the public domain.

Henry Wong who wrote the original article which introduced me to this technique and subsequently shared the code for his tool, which is now hosted on github .

Nemez who did a breakdown of the die shot, noting the register file in question as some type of integer register file.

Thanks to Daniel Lemire who provided access to the SKX hardware used in this post.