In this section we had a close look at two broad classes of algorithms.
The first class is represented by algorithms we've implemented
in order to evaluate
,
,
and
y' = a + b x. These computations are outlined in
Figure 2.10 (page
).
Their characteristic feature is
the use of array operations and a total lack of explicit iterations,
i.e., no DO loop, even though in other languages, like C
or Lisp, we would have to use many iterative constructs in order
to perform the same computations. These operations are
explicitly parallel. This means that if we were to
run them on a parallel computer then
There is no point whatsoever in parallelising small programs or programs working on small data sets. The effort involved in parallelising such programs would never pay off.
The second class is represented by algorithms we've used to calculate
incomplete gamma functions. You can see these algorithms implemented
in Figures 2.12 (page
) and
2.13 (page
). In both cases the
algorithms are recursive, i.e., we have to enter a DO loop
within which we compute successive terms using terms evaluated
in previous iterations. Such algorithms are basically sequential.
They cannot be parallelised, although small optimisations may
be possible sometimes. The only way to make them run
fast, is to have a very fast sequential processor that is clocked
at a very high clock rate.
In case of the algorithm shown in Firgure 2.13 on page
a superscalar processor or an explicitly parallel instruction
set CPU such as Merced can speed up things just a little bit. For example
a0 in the first line of the do loop can be evaluated
at the same time as b0 in the second line. Similarly
lines 3 and 4 of the do loop can be evaluated simultaneously
too. Lines 5 through 7 must then be evaluated in sequence after
we have calculated b1 in line 4. But the last, 8th, line,
which increments
n can be evaluated at the same time as lines 5 through 7.
In summary, instead of 8 operations per loop, we can
do it in 5 parallel steps. The speed up could be of the order of
1.5 to 2 against a purely sequential processor.
There are other things that you can optimise when developing algorithms of this kind. You should try to minimise data movement in the first place. For example if you have an auxiliary variable, which is used to memorise an old value of some quantity to be then used in a next iteration, make that variable reside on the CPU itself, in a register. A good compiler should make such optimisations automatically. Sometimes you may have to use special compiler directives.
But a more important optimisation is in
the choice of an algorithm itself: use an algorithm that converges
quickly, so that you don't have to run many iterations. Continued
fraction expansions discussed in Section 2.2.3
(page
) are often very effective in this
respect. But you must watch for regions of validity: make sure that
you don't divide by zero, nor do you get close to a division by zero.
You can usually switch to a safer procedure in such a region,
as we have done in function goodness (cf. Figure 2.11,
page
).