Intel cache inclusion [Computer Architecture]

Prev: Ye Olde Log-Based SpMT Uarch
Next: x86 i/o management

From: Andy 'Krazy' Glew on 7 May 2010 02:56

On 5/6/2010 7:28 PM, Andy 'Krazy' Glew wrote:
> On 5/6/2010 9:32 AM, MitchAlsup wrote:
>> On May 5, 9:34 pm, Mark Brehob<bre...(a)gmail.com> wrote:
>>> Does anyone know the history of cache inclusion on Intel processors?

>> I suspect that if the sets of associativity of the interior caches
>> are larger than the sets of associativity of the outermost cache, then
>> it is unwise to try to guarantee inclusion.

> Mitch has identified one of the glass jaws of inclusive caches, wrt
> Multi/ManyCore and associativity.

Explaining with an example:

Nehalem
Core i7
L1$: 32KiB 8-way associative
L2$ / MLC: 256KiB 8-way
L3$ / LLC: 2MiB/core, 4-way associative per core

Let's assume 4 cores.
For a given 8-way set in the L2$/MLC, each of the 4 cores could have a different address => 32 different addresses. But
it could happen that all of those lines map to the same set in the L3$/LLC, which is only 4*4 way associative.

I.e. worst case, inclusion could mean that you can only use half of the L2$, or, worse, only half of the L1$.

That's a glass jaw. It may be unlikely, but if it happens it is very bad.

How likely is it to happen?

From: nmm1 on 7 May 2010 03:28

In article <4BE3B99B.20303(a)patten-glew.net>,
Andy 'Krazy' Glew <ag-news(a)patten-glew.net> wrote:
>
>>> I suspect that if the sets of associativity of the interior caches
>>> are larger than the sets of associativity of the outermost cache, then
>>> it is unwise to try to guarantee inclusion.
>
>> Mitch has identified one of the glass jaws of inclusive caches, wrt
>> Multi/ManyCore and associativity.
>
>Explaining with an example:
>
>Nehalem
>Core i7
>L1$: 32KiB 8-way associative
>L2$ / MLC: 256KiB 8-way
>L3$ / LLC: 2MiB/core, 4-way associative per core
>
>Let's assume 4 cores.
>For a given 8-way set in the L2$/MLC, each of the 4 cores could have a different address => 32 different addresses. But
>it could happen that all of those lines map to the same set in the L3$/LLC, which is only 4*4 way associative.
>
>I.e. worst case, inclusion could mean that you can only use half of the L2$, or, worse, only half of the L1$.
>
>That's a glass jaw. It may be unlikely, but if it happens it is very bad.
>
>How likely is it to happen?

More likely than many people think. Consider a parallel application
working on very large matrices that have dimensions that are multiples
of large powers of two.

In addition to knackering caches, those can also be very bad news for
TLBs, and then one scarcely notices the cache problems ....

Regards,
Nick Maclaren.

From: Quadibloc on 7 May 2010 08:04

On May 6, 8:28 pm, Andy 'Krazy' Glew <ag-n...(a)patten-glew.net> wrote:

> Another issue with inclusive caches is maintaining inclusion. Typically we use an LRU or pseudo-LRU policy to determine
> what cache line should be evicted when a new cache line fill must be done.. However, LRU bits in the L3 cache will not
> be updated for lines that are constantly hitting in the L1 and L2 caches.

One would expect that an inclusive cache includes a bit that says "oh,
this cache line has been checked out", and those cache lines are
completely immune from eviction until they're returned from the L1 or
L2 cache that has a copy. (The owner of the cache line would also need
to be identified, and, yes, for inclusion to be useful, the interior
caches would probably need to be write-through, although keeping the
L3 cache's dirty bit up to date could be made to work too.)

John Savard

From: Andy 'Krazy' Glew on 7 May 2010 10:23

On 5/7/2010 5:04 AM, Quadibloc wrote:
> On May 6, 8:28 pm, Andy 'Krazy' Glew<ag-n...(a)patten-glew.net> wrote:
>
>> Another issue with inclusive caches is maintaining inclusion. Typically we use an LRU or pseudo-LRU policy to determine
>> what cache line should be evicted when a new cache line fill must be done.. However, LRU bits in the L3 cache will not
>> be updated for lines that are constantly hitting in the L1 and L2 caches.
>
> One would expect that an inclusive cache includes a bit that says "oh,
> this cache line has been checked out", and those cache lines are
> completely immune from eviction until they're returned from the L1 or
> L2 cache that has a copy. (The owner of the cache line would also need
> to be identified, and, yes, for inclusion to be useful, the interior
> caches would probably need to be write-through, although keeping the
> L3 cache's dirty bit up to date could be made to work too.)
>
> John Savard

Think about it.

If you have P processors with N-way associative inner caches, then the outer cache needs to be at least P*N way
associative for such "check out" to work. In fact, that's the general problem: you need P*N way associativity to
guarantee no glass jaw.

If you want to get away with lower associativity

a) you probably want to spread the requests across more sets in the outer cache, e.g. by a better hash function - not
just extracting a bit field, but skewing in some manner. I particular like prime moduli.

b) and/or you need to be able to choose a victim that might possibly be in an inner cache. I.e. not strict check-out.
Which is the whole issue that leads to backwards invalidations, etc.

--

Also, such a check-out protocol requires that you do not have silent evictions from the inner caches. I.e. if an inner,
L1 or L2 in the Nhm example, cache needs to fill a new line in a set, and the chosen victim is clean, in S or E (or
similar), state, in many protocols you don't need to inform the outer cache. In the presence of such silent evictions,
the outer cache knows that a line was pulled into an inner cache, but it does not know for sure that the line is still
in the inner cacbe, if the line is clean. I.e. the outer cache's inclusion tracking is conservative.

You can make the inclusion tracker accurate, but then you need to "write through" information about such clean evictions.

As well as "writing through" LRU state, as already described.

More likely, "trickling through" such updates.

--

In Nehalem, all levels of the cache - 32K L1, 256K L2, 2M/core L3 - are write back, not writethrough.

IBM seems sometimes still to build inner caches with writethrough. I have long suspected that it die to their
reliability concerns, and/or their string memory ordering model.

First | Prev |
Pages: 1 2 3
Prev: Ye Olde Log-Based SpMT Uarch
Next: x86 i/o management