本文强调的一点是,非常微弱的处理器,如Alpha,可以重新排序依赖负载,这似乎是分区缓存的副作用
论文摘录:
1 struct el *insert(long key,long data) 2 { 3 struct el *p; 4 p = kmalloc(sizeof(*p),GPF_ATOMIC); 5 spin_lock(&mutex); 6 p->next = head.next; 7 p->key = key; 8 p->data = data; 9 smp_wmb(); 10 head.next = p; 11 spin_unlock(&mutex); 12 } 13 14 struct el *search(long key) 15 { 16 struct el *p; 17 p = head.next; 18 while (p != &head) { 19 /* BUG ON ALPHA!!! */ 20 if (p->key == key) { 21 return (p); 22 } 23 p = p->next; 24 }; 25 return (NULL); 26 }
> cpu0和cpu1有2个处理器.
>每个cpu有2个缓存库CB0(奇数地址),CB1(偶数地址).
>头部在CB0中,而CB在CB1中.
> insert()有一个写屏障,确保第6-8行的无效首先在总线中,然后在第10行无效.
>但是,执行搜索的其他处理器可以轻载CB0并且CB1负载很重.
>这意味着处理器引导头的最新值但是p的旧值(因为对于p的无效请求尚未由CB1处理.)
题:
看起来所有架构都期望Alpha荣誉依赖负载.
例如:IA64可以重新排序以下除了依赖负载重新排序之外.
>加载后重新加载负载
>存储后加载重新排序
>商店在商店后重新订购
>商店在装载后重新订购
>使用载荷重新排序的原子指令.
>原子指令与商店重新排序.
一个可能的答案是所有其他架构(IA64)没有分区缓存,因此不会遇到此问题,也不需要显式硬件支持.
任何见解?
在无序处理器中,加载 – 存储队列用于跟踪和实施内存排序约束.诸如Alpha 21264之类的处理器具有防止依赖负载重新排序的必要硬件,但强制执行此依赖性可能会增加处理器间通信的开销.
答案很长:
依赖追踪的背景
这可能是使用示例最好地解释的.想象一下,您有以下指令序列(为简单起见使用伪代码指令):
ST R1,A // store value in register R1 to memory at address A LD B,R2 // load value from memory at address B to register R2 ADD R2,1,R2 // add immediate value 1 to R2 and save result in R2
在此示例中,LD和ADD指令之间存在依赖关系. ADD读取R2的值,因此在LD使该值可用之前它不能执行.这种依赖关系是通过寄存器进行的,它是处理器的问题逻辑可以跟踪的.
但是,如果地址A和B相同,则ST和LD之间也可能存在依赖关系.但是与LD和ADD之间的依赖性不同,在发出指令(开始执行)时,ST和LD之间可能的依赖性是未知的.
处理器不是在发布时尝试检测内存依赖性,而是使用称为加载 – 存储队列的结构来跟踪它们.此结构的作用是跟踪已发布但尚未停用的指令的挂起加载和存储的地址.如果存在内存订购违规,则可以检测到这种情况,并且可以从发生违规的位置重新开始执行.
因此,回到伪代码示例,您可以想象在ST之前执行LD的情况(可能由于某种原因,R1中所需的值尚未准备好).但是当ST执行时,它看到地址A和B是相同的.所以LD应该真正读取ST产生的值,而不是缓存中已经存在的陈旧值.因此,LD将需要重新执行,以及LD之后的任何指令.可以通过各种优化来减少一些开销,但基本思想仍然存在.
正如我之前提到的,检测这种依赖性的逻辑存在于允许推测执行存储器指令(包括Alpha处理器)的所有无序处理器中.
内存排序规则
但是,内存排序规则不仅限制处理器从其自己的内存操作中看到结果的顺序.相反,内存排序规则限制了操作的相对顺序,在一个处理器上执行的内存操作变得对其他处理器可见.
Alpha例子
在依赖负载重新排序的情况下,处理器必须跟踪此信息以供其自己使用,但Alpha ISA不要求它确保其他处理器看到此排序.如何发生这种情况的一个例子如下(我引用了this link)
Initially: p = & x,x = 1,y = 0 Thread 1 Thread 2 -------------------------------- y = 1 | memoryBarrier | i = *p p = & y | -------------------------------- Can result in: i = 0
The anomalous behavior is currently only possible on a 21264-based
system. And obvIoUsly you have to be using one of our multiprocessor
servers. Finally,the chances that you actually see it are very low,
yet it is possible.Here is what has to happen for this behavior to show up. Assume T1
runs on P1 and T2 on P2. P2 has to be caching location y with value 0.
P1 does y=1 which causes an “invalidate y” to be sent to P2. This
invalidate goes into the incoming “probe queue” of P2; as you will
see,the problem arises because this invalidate could theoretically
sit in the probe queue without doing an MB on P2. The invalidate is
acknowledged right away at this point (i.e.,you don’t wait for it to
actually invalidate the copy in P2’s cache before sending the
acknowledgment). Therefore,P1 can go through its MB. And it proceeds
to do the write to p. Now P2 proceeds to read p. The reply for read p
is allowed to bypass the probe queue on P2 on its incoming path (this
allows replies/data to get back to the 21264 quickly without needing
to wait for prevIoUs incoming probes to be serviced). Now,P2 can
derefence P to read the old value of y that is sitting in its cache
(the inval y in P2’s probe queue is still sitting there).How does an MB on P2 fix this? The 21264 flushes its incoming probe
queue (i.e.,services any pending messages in there) at every MB.
Hence,after the read of P,you do an MB which pulls in the inval to y
for sure. And you can no longer see the old cached value for y.Even though the above scenario is theoretically possible,the chances
of observing a problem due to it are extremely minute. The reason is
that even if you setup the caching properly,P2 will likely have ample
opportunity to service the messages (i.e.,inval) in its probe queue
before it receives the data reply for “read p”. Nonetheless,if you
get into a situation where you have placed many things in P2’s probe
queue ahead of the inval to y,then it is possible that the reply to p
comes back and bypasses this inval. It would be difficult for you to
set up the scenario though and actually observe the anomaly.The above addresses how current Alpha’s may violate what you have
shown. Future Alpha’s can violate it due to other optimizations. One
interesting optimization is value prediction.
概要
强制执行相关负载排序所需的基本硬件已经存在于所有无序处理器中.但是确保所有处理器都能看到这种内存排序会增加处理缓存行无效的附加限制.并且它可能在其他场景中添加额外的约束.然而,在实践中,似乎弱的Alpha存储器模型对于硬件设计者的潜在优势不值得花费软件复杂性并且增加了需要更多存储器障碍的开销.