Ideally, only iterations satisfying the prefetch predicate should
issue prefetch instructions. A naive way to implement this is to enclose
the prefetch instructions inside an IF statement with the
prefetch predicate as the condition. However, such a statement in the
innermost loop can be costly, and thus defeat the purpose of reducing
the prefetch overhead. We can eliminate this overhead by decomposing
the loops into different sections so that the predicates for all
instances for the same section evaluate to the same value. This
process is known as loop splitting. In general, the predicate
requires the first iteration of the loop to be peeled. The
predicate
requires the loop to be unrolled by a
factor of
. Peeling and unrolling can be applied recursively to
handle predicates in nested loops.
Going back to our example in Figure 2(a), the i = 0 predicate causes the compiler to peel the i loop. The (j mod 2) = 0 predicate then causes the j loop to be unrolled by a factor of 2-both in the peel and the main iterations of the i loop.
However, peeling and unrolling multiple levels of loops can potentially expand the code by a significant amount. This may reduce the effectiveness of the instruction cache; also, existing optimizing compilers are often ineffective for large procedure bodies. Our algorithm keeps track of how large the loops are growing. We suppress peeling or unrolling when the loop becomes too large. This is made possible because prefetch instructions are only hints, and we need not issue those and only those satisfying the prefetch predicate. For temporal locality, if the loop is too large to peel, we simply drop the prefetches. For spatial locality, when the loop becomes too large to unroll, we introduce a conditional statement. When the loop body has become this large, the cost of a conditional statement is relatively small.