The Prefetch Queue
The 80386 is documented as having a 16-byte prefetch queue. At
one time, it did, but due to a bug in the pipelining
architecture, Intel had to abandon the 16-byte queue, and only
use a 12-byte queue. The change occurred (I believe) between the
D0, and D1 step of the '386. The '386SX wasn't affected by the
bug, and therefore hasn't changed.
Determining the size of the prefetch queue is no trivial task.
The theory is simple, but the reality is different.
Theoretically, one could write an algorithm which employs
self-modifying code to determine the depth of the queue. The
algorithm would execute code that writes exactly "n"
bytes beyond the beginning of the prefetch queue. Once the
modified code gets executed, then the prefetch queue size has
been determined. This is not as easy in practice as it is in
principal. There are many aspects of the internal architecture
that make the task very difficult:
- The Bus Interface Unit (BIU) only prefetches code when it
is has no other bus cycles pending. Memory and I/O cycles
take a higher priority than prefetches.
- The Decode Unit (DU) can hold up to three fully decoded
instructions. It decodes them at the rate of one byte per
CPU clock.
- The Prefetch Unit (PU) and the DU must be fully loaded
and no bus cycles pending. The self-modifying instruction
should be at the head or the middle of the DU.
- The modifying instruction should lock the BUS to inhibit
any BUS activity (if some should occur).
- Synchronizing the starting of the algorithm with refresh.
What we need to meet the criteria of step 3 is a method to
fill the PU and DU and ensure that no BUS cycles are pending. To
do that, we need an instruction that takes longer to execute than
a prefetch. This will guarantee that any prefetch will be
absorbed in the execution time of the instruction. Next, that
instruction must not perform any bus cycles. If it performs any
bus cycles, then they take priority over prefetches, and we can't
guarantee the state of the PU or DU. BSF or BSR are ideal
instructions for this purpose. Both instructions take longer to
execute than a prefetch, and neither requires any BUS activity.
To satisfy step 4, OR, XOR are probably best. Any
Read/Modify/Write instruction implicitly locks the BUS during
execution. This will guarantee that one last fetch doesn't come
through while the code is being modified. Finally, synchronizing
the beginning of the algorithm with refresh will guarantee that
the first few prefetches won't be interrupted, thereby throwing
the algorithm "out of sync."
Making any single algorithm work in all cases and CPU speeds
is very difficult. In a period of six months, I have spent over
100 hours (in attempts at) perfecting a general-purpose
algorithm. Just when I thought I understood it, I found a
different revision of the CPU at a speed I hadn't tried, that
gave me trouble. The culmination of this effort has led to my
current understanding of the CPU prefetch and decode units: Both
the prefetch unit, and decode unit have a "low-water"
point. Only when each unit gets to this point is a refill
triggered. The file PREFCHSZ.ASM
has an algorithm that works on '386's with CPU ID of 303, 305,
and 308. The algorithm assumes that the decode unit requests
instructions from the prefetch unit ONLY after it has exhausted
two slots; the prefetch unit ONLY requests more data after it has
exhausted all but four bytes. Therefore the assumptions are that
the decode unit has a 1-instruction "low-water" point,
and the prefetch unit has a 4-byte "low-water" point.
View and download source code:
ftp://ftp.x86.org/pub/x86/source/prefetch/prefchsz.asm
ftp://ftp.x86.org/pub/x86/dloads/PREFCHSZ.ZIP
Back to secrets and
bugs
|