Whereas subroutine `expand_temp_profile`

is very relaxed when it
comes to checking the transform length, especially against the number
of nodes used by your HPF program, subroutine `get_diffusion_matrix`

goes about it quite meticulously. This is an overkill in this program
though. Observe that `expand_temp_profile`

is called *before*
`get_diffusion_matrix`

. Consequently, if transform length is
not a multiple of the number of nodes, we're going to bump out thanks
to `fft`

in `expand_temp_profile`

well before we get to all
that checking in `get_diffusion_matrix`

. But, at least,
this subroutine shows how to do that, if you really have to.

Subroutine `get_diffusion_matrix`

calls subroutine `factor_nodes`

on entry. Subroutine `factor_nodes`

, in turn, checks the number of
processors used by the program, and then tests if that number can be
represented as

where

`fft`

. On exit subroutine
`factor_nodes`

returns an array:
[*h*, *i*, *j*, *k*, *m*]

if the number of processors,

After the return of `factor_nodes`

we perform the following
computation:

Consider some examples to see how this works.

First assume that
*n*_{p} = 2^{h}. Then
*f*_{1} = *n*_{p} / 2^{h} = 1 and
.
Subroutine `minpower2`

will adjust
to the
nearest power of two greater than or equal to
and multiplication by *f*_{1} = 1 won't change anything. In short, in this
context all that we really do is:

nx = minpower2(4*(dif_nx + 1))and this is much the same as we have already seen in

`expand_temp_profile`

. Furthermore we still don't check if
Why is there `4*(dif_nx + 1)`

instead of just `2*(dif_nx + 1)`

,
as was sufficient in `expand_temp_profile`

?
The reason for this is that when we calculate matrix
**F**we will need terms such as
.
In other words, if *k* and *k*' run from 1 through 7, then *k*^{+} runs from
1 through 14. In principle *k*^{-} would run from -6 through 6, but
we have already shown that in our case
,
and ditto for *j*. In short, this time we need to have Fourier coefficients
from 1 through 2 *n*_{x} and from 1 through 2 *n*_{y},
and then, in order to stay away from negative frequencies we need to double the number
of Fourier coefficients still.

But since Fourier coefficients are *cheap*, and you only need to
get them once, you can just as well avail yourself
of more than you really need.

Now assume that
*n*_{p} = 2^{h} 3^{1}. Then
*f*_{1} = *n*_{p} / 2^{h} = 3 and

For

`minpower2`

will return 16. Now we multiply that 16 by 3 and get
48. Will this work for, say, 12 nodes?
,
,
and
,
so we have satisfied both
conditions, i.e., that transform length must be a multiple of the
number of nodes and that it must be of the form
2
But observe that the total number of nodes still falls out
of the equation. We'll be alright going for
and for
,
but not for
.
The subroutine doesn't check if *n*_{x} and *n*_{y} evaluated by that
process are indeed *larger* than the number of processors.
Admittedly, not many people can afford 96 node machines, and those who
can are unlikely to run this demonstration program on all 96 nodes either.
Still, this is an omission, or a bug, and, as you see, the procedure
could still be improved.

In summary, both ways to calculate the transform length in
`expand_temp_profile`

and in `get_diffusion_matrix`

are
botched. They demonstrate two techniques of attacking the problem,
but both have lose ends. My preference would be to go for the simpler
one, i.e., for a power of 2 and to forget about the complexities
associated with filling the gaps between the powers of 2 with
2^{h} 3^{i} 5^{j} 7^{k} 11^{m}, unless you really, really need it.
Forcing the number of nodes to be a power of 2 too can be done easily
on program entry. Transform length can be then adjusted simply by
taking a multiple of the number of nodes and comparing that with
the required number of harmonics. On our SP you could then run the
program on 2, 4, 8, 16, or 32 nodes.

Another alternative would be to evaluate `fft`

locally, i.e.,
not to make `df`

or `atmp`

distributed arrays. For our
problem those arrays are so small, and even sequential `fft`

is
so fast, that we don't gain anything by parallelising this computation.