Branch prediction in the Pentium family
How the branch
prediction mechanism in the Pentium has been uncovered with all its quirks,
and the incredibly more effective branch prediction in the later versions.
By Agner Fog
What is branch prediction?
Imagine a simple microprocessor where all instructions
are handled in two steps: decoding and execution. The microprocessor can
save time by decoding one instruction while the preceding instruction
is executing. This assembly line-principle is called pipelining. In advanced
microprocessors, the pipeline may have many steps so that many consecutive
instructions are underway in the assembly line at the same time, one at
each stage in the pipeline.
The problem now occurs when we meet a branch instruction.
A branch instruction is the implementation of an if-then-else construct.
If a condition is true then jump to some other location; if false then
continue with the next instruction. This gives a break in the flow of
instructions through the pipeline because the processor doesn't know which
instruction comes next until it has finished executing the branch instruction.
The longer the pipeline, the longer time it will have to wait until it
knows which instruction to feed next into the pipeline. As modern microprocessors
tend to have longer and longer pipelines, there has been a growing need
for doing something about this problem.
The solution is branch prediction. The microprocessor
tries to predict whether the branch instruction will jump or not, based
on a record of what this branch has done previously. If it has jumped
the last four times then chances are high that it will also jump this
time. The microprocessor decides which instruction to load next into the
pipeline based on this prediction, before it knows for sure. This is called
speculative execution. If the prediction turns out to be wrong, then it
has to flush the pipeline and discard all calculations that were based
on this prediction. But if the prediction was correct, then it has saved
a lot of time.
The detective work
Intel manuals have never been very specific about
how the branch prediction works. However, since mispredictions are expensive
in terms of execution time, I found it important to know how the prediction
works in order to optimize my programs. I started to do a lot of experiments
together with some clever persons I had met on the Internet, most importantly
Karki J. Bahadur at the university of Colorado, and Terje Mathisen in
Norway, the guy who reverse engineered system software to find out how
to get access to the performance monitor counters on the Pentium chip.
Well, my first finding was that the Pentium predicts a branch instruction
to jump if it has jumped any of the last two times. This fitted all my
experiments, but Karki pointed out that a branch which jumps every third
time is predicted one time out of six, where, according to my first model,
it should never be predicted correctly. Then followed a series of new
experiments until Karki and I independently came out with the same state
diagram, shown in fig. 1a. While we agreed on
this mechanism, we disagreed on the interpretation and in particular on
why it was asymmetric. In the meantime, another guy had found an old article
in Microprocessor Report claiming that the mechanism was a symmetric one
as illustrated in fig 1b. My opinion was that the designers had actually intended
the mechanism to be as in fig. 1b and that the asymmetry was a bug. But
Karki and Terje maintained that there had to be an intention behind this
asymmetry. It didn't convince them that I demonstrated how the symmetric
mechanism was superior to the asymmetric one in almost all cases.
|
| Now I discovered a powerful
tool to dig deeper into this mechanism. The Pentium has a set of test registers
that make it possible to read or write directly into the area that holds
the history information for all branches, the branch target buffer (BTB).
I had found this information on the home page of another hacker, Christian
Ludloff. His page was shut down (rumors say that this was due to pressure
from Intel) but fortunately I had downloaded his page before it was too
late. Having direct access to the BTB, I was able to see exactly what happened:
When a branch does not have an entry in the BTB it is predicted to not jump.
The first time it jumps it gets an entry in the BTB and immediately goes
to state 3. The complication is that the designers have equated state 0
with 'vacant BTB entry'. This makes sense because state 0 is predicted to
not jump anyway. But since it cannot distinguish between state 0 and a vacant
BTB entry it will go to state 3 next time the branch jumps rather than to
state 1. This is where the quirk comes from. Apparently, somebody at the
design labs has done a lot of research to find a good branch prediction
scheme, and then somebody else has messed it all up by letting state 0 mean
vacant BTB entry without realizing the consequence. And the consequence
is that a branch which seldom jumps will have three times as many mispredictions
as it would with the symmetric design.
Karki and Terje were still not convinced that this
design was a blunder. The convincing proof came when I discovered that
tight loops behave differently. In a small loop the microprocessor doesn't
have enough time to update the BTB entry for a branch instruction before
it meets the same branch instruction again. In order to avoid a delay,
it bypasses the BTB and reads the branch prediction state directly from
the pipeline. And in this case it is actually able to go from state 0
to state 1, as in fig. 1b.
|
a. asymmetric design in the Pentium:
The state follows the +arrows when the branch
instruction jumps, and the -arrows when not jumping. The branch instruction
is predicted to jump next time if in state 2 or 3, and to not jump when
in state 0 or 1.
b. symmetric design:
This is how the branch prediction should
work. The state is incremented when jumping (+arrows) and decremented
when not jumping (-arrows).
|
| More quirks
We soon found that there were more strange things about the
Pentium's branch prediction. We couldn't make sense of what happened when
more branch instructions came close after one another. This time Karki
and Terje came with the 'wild' ideas that led to the solution, while I
played the role of the sceptic. After a hectic period where we exchanged
results by E-mail every day, we found that the BTB information may actually
be stored several instructions ahead of the branch it refers to. If there
happens to be another branch in between then the BTB information is likely
to be misapplied to somewhere in the wrong branch. This can lead to many
funny phenomena: a branch instruction can have more than one BTB entry;
two branches can share the same BTB entry so that one branch is predicted
to go to the target of the other one; an unconditional jump instruction
can be predicted to not jump; and a non-jumping instruction can be predicted
to jump. I will not go into detail with all these quirks here, but you
can find it all on my homepage. None of these quirks are fatal, because
all mispredictions eventually get corrected.
A much more powerful mechanism
The later processors in the Pentium family: the Pentium
MMX, Pentium Pro, Pentium II, Celeron, and Xeon, all have a much more
advanced branch prediction mechanism. I will refrain from more detective
stories here and go right to the mechanism.
|
| This mechanism is based on
the same fundamental idea of the state diagram in fig 1b. This is simply
a two-bit counter with saturation. The counter is incremented when jumping
and decremented when not jumping. The branch instruction is predicted to
jump next time if the counter is in state 2 or 3, and to not jump if in
state 0 or 1. This mechanism makes sure that the branch has to deviate twice
from what it does most of the time before the prediction changes.
The improvement in the later processors comes from
the so-called two-level branch prediction. The first level is a shift
register that stores the history of the last four events for any branch
instruction. This gives sixteen possible bit patterns. You get a pattern
of 0000 if the branch did not jump the last four times, and a pattern
of 1111 after four times of jumping. The second level in the branch prediction
mechanism is constituted of sixteen 2-bit counters of the type in fig.
1b. It uses the 4-bit pattern in the first level to choose which of the
sixteen counters to use in the second level. See fig. 2.
|
Level two consists of 16 two-bit counters
of the type in fig. 1b. Level one is a four bit shift register storing
the history of the last four events. This four bit pattern is used to
select which of the 16 two-bit counters to use for the next prediction.
|
| The advantage of this mechanism
is that it can learn to recognize repetitive patterns. Imagine a branch
that jumps every second time. You can write this pattern as 01010101 where
0 means no-jump and 1 means jump. After 0101 always comes an 0. Every times
this happens, the counter with the binary number [0101] will be decremented
until it reaches its lowest state. It has now learned that after 0101 comes
an 0 and will therefore make this prediction correctly the next time. Similarly,
counter number [1010] will be incremented until state three so that it will
always predict a 1 after 1010. The remaining fourteen counters for this
branch are never used as long as the pattern is the same.
This mechanism is quite powerful as it can handle
complex repetitive patterns like 00101-00101-00101. In fact, it can handle
any repetitive pattern with a period of up to five, most patterns of period
six and seven, and even some patterns with periods as high as sixteen.
To see if a pattern of period n can be handled without misprediction,
write down the n 4-bit sub-sequences in the pattern. If they are
all different, then you will have no mispredictions after an initial learning
time of two periods.
But the two-level mechanism is more powerful than
that. It is also extremely good at handling deviations from a regular
pattern. If a branch instruction has an almost regular pattern
with occasional deviations, then the processor will soon learn what the
deviations look like, so that it can handle almost any kind of recurrent
deviation with only one misprediction.
Furthermore, it can handle a situation where you alternate
between two different repetitive patterns. Assume that you have given
the processor one repetitive pattern until it has learned to handle it
without mispredictions. Then another pattern. And then return to the first
pattern. If the two patterns do not have any 4-bit subsequences in common,
then they do not use the same counters, so the processor doesn't have
to re-learn the first pattern. Therefore, it can handle the transitions
back and forth between the two patterns with a minimum of mispredictions.
Conclusion
The first microprocessor in the Pentium family introduced
a simple one-level branch prediction mechanism with many ludicrous quirks.
The later versions, Pentium MMX, Pentium Pro, Pentium II, etc. have longer
pipelines and therefore a higher need for effective branch prediction.
This need has been met by the incredibly powerful two-level mechanism
with its ability to learn and recognize repetitive patterns and even deviations
from the regular patterns. This mechanism is also quite economical in
terms of chip area as the history of a branch can be stored in only 32
bits.
The most important shortcoming of the two-level branch
prediction is that it is not very good at predicting the branch pattern
of a loop control. If, for example, you have a program with a loop that
always repeats ten times, then the control instruction at the bottom of
the loop will branch back nine times and fall through the tenth time at
the cost of one misprediction. For the Pentium Pro and Pentium II, where
branch misprediction costs a lot of time, it may acually be advantageous
to replace a loop that executes ten times with two nested loops that execute
five and two times, in order to avoid mispredictions.
|
Technical details can be found at my
homepage: www.announce.com/agner/assem.
Back to Books and
Articles home page
|