Whereas subroutine expand_temp_profile is very relaxed when it
comes to checking the transform length, especially against the number
of nodes used by your HPF program, subroutine get_diffusion_matrix
goes about it quite meticulously. This is an overkill in this program
though. Observe that expand_temp_profile is called before
get_diffusion_matrix. Consequently, if transform length is
not a multiple of the number of nodes, we're going to bump out thanks
to fft in expand_temp_profile well before we get to all
that checking in get_diffusion_matrix. But, at least,
this subroutine shows how to do that, if you really have to.
Subroutine get_diffusion_matrix calls subroutine factor_nodes
on entry. Subroutine factor_nodes, in turn, checks the number of
processors used by the program, and then tests if that number can be
represented as
fft. On exit subroutine
factor_nodes returns an array:
After the return of factor_nodes we perform the following
computation:
Consider some examples to see how this works.
First assume that
np = 2h. Then
f1 = np / 2h = 1 and
.
Subroutine minpower2 will adjust
to the
nearest power of two greater than or equal to
and multiplication by f1 = 1 won't change anything. In short, in this
context all that we really do is:
nx = minpower2(4*(dif_nx + 1))and this is much the same as we have already seen in
expand_temp_profile. Furthermore we still don't check if
nx is the multiple of the number of nodes, sic!
Why is there 4*(dif_nx + 1) instead of just 2*(dif_nx + 1),
as was sufficient in expand_temp_profile?
The reason for this is that when we calculate matrix
Fwe will need terms such as
.
In other words, if k and k' run from 1 through 7, then k+ runs from
1 through 14. In principle k- would run from -6 through 6, but
we have already shown that in our case
,
and ditto for j. In short, this time we need to have Fourier coefficients
from 1 through 2 nx and from 1 through 2 ny,
and then, in order to stay away from negative frequencies we need to double the number
of Fourier coefficients still.
But since Fourier coefficients are cheap, and you only need to get them once, you can just as well avail yourself of more than you really need.
Now assume that
np = 2h 31. Then
f1 = np / 2h = 3 and
minpower2 will return 16. Now we multiply that 16 by 3 and get
48. Will this work for, say, 12 nodes?
But observe that the total number of nodes still falls out
of the equation. We'll be alright going for
and for
,
but not for
.
The subroutine doesn't check if nx and ny evaluated by that
process are indeed larger than the number of processors.
Admittedly, not many people can afford 96 node machines, and those who
can are unlikely to run this demonstration program on all 96 nodes either.
Still, this is an omission, or a bug, and, as you see, the procedure
could still be improved.
In summary, both ways to calculate the transform length in
expand_temp_profile and in get_diffusion_matrix are
botched. They demonstrate two techniques of attacking the problem,
but both have lose ends. My preference would be to go for the simpler
one, i.e., for a power of 2 and to forget about the complexities
associated with filling the gaps between the powers of 2 with
2h 3i 5j 7k 11m, unless you really, really need it.
Forcing the number of nodes to be a power of 2 too can be done easily
on program entry. Transform length can be then adjusted simply by
taking a multiple of the number of nodes and comparing that with
the required number of harmonics. On our SP you could then run the
program on 2, 4, 8, 16, or 32 nodes.
Another alternative would be to evaluate fft locally, i.e.,
not to make df or atmp distributed arrays. For our
problem those arrays are so small, and even sequential fft is
so fast, that we don't gain anything by parallelising this computation.