- First observe that all arrays are declared
`ALLOCATABLE`

. The dimensions of the arrays are obtained from a*name list*by subroutine`init_diffusion`

. We'll see how that is done later. `dif_x`

and`dif_y`

are one dimensional arrays that contain*x*and*y*coordinates of points at which temperature is going to be evaluated for various times. The points are distributed on a 2D rectangular grid. There are`dif_numx`

points in the*x*direction and`dif_numy`

points in the*y*direction. The total number of points is thus`dif_numx * dif_numy`

. The authors of the program chose to list all those points as one long one-dimensional array, rather than a 2D matrix. The one dimensional index that numbers entries in the corresponding arrays*x*_{p}and*y*_{p}(where*p*stands for a point of the grid) is called`ij`

and the formula that converts 2D matrix indexes`i`

and`j`

into`ij`

isij = dif_numx*(j-1) + i

After having obtained basic parameters for the diffusion program we allocate space for

`dif_x`

and`dif_y`

and then initialise their values so that and .ALLOCATE( dif_x(dif_numx*dif_numy) ); ALLOCATE( dif_y(dif_numy*dif_numx) ) delx = PI/ ( dif_numx + 1.d0); dely = PI/ ( dif_numy + 1.d0) DO i = 1, dif_numx DO j = 1, dif_numy ij = dif_numx*(j-1) + i dif_x(ij) = delx* i; dif_y(ij) = dely * j ENDDO ENDDO

This way of doing things adds an unnecessary complexity to the code and gains little, especially on a parallel machine and in HPF, so you can just as well drop it and use 2D grid arrays instead. This is especially so when modifying the program for writing images on HDF animation files.- The next step is to allocate all the arrays used by the
program. Since the call to
`init_diffusion`

should have returned the required dimensions of the problem, i.e., the required number of points in the space and the required number of harmonics, we are now ready to allocate as much space as is needed.The parameters

`dif_nx`

and`dif_ny`

yield the requested number of harmonics in*x*and*y*directions respectively. Recall that our collapsed index*l*(equation 3.51) runs from 1 through , or, using the program notation, from 1 through`n = dif_nx * dif_ny`

.The program uses a simple error control mechanism accumulating a number of observed allocation problems on

`num_errors`

and checking if it's greater than`0`

at the end of this part of the code. The`ALLOCATE`

statement returns exit status on`stat=`

. - The allocated arrays have the following meaning:
- 1.
`lambda(n)`

is the array of eigenvalues, of matrix**F**:ALLOCATE(lambda(n),stat=stat); IF(stat.NE.0) num_errors=num_errors+1

Observe that`lambda`

is not mentioned in the`DISTRIBUTE`

directives. This means that this array is going to be*replicated*on every node.- 2.
`a(n)`

corresponds to what in our derivations we have called**a**_{0}=**a**(0), i.e., Fourier coefficients that correspond to the initial condition*T*_{0}(*x*,*y*) =*T*(*x*,*y*, 0):!HPF$ DISTRIBUTE (BLOCK) :: a ... ALLOCATE(a(n), stat=stat); IF(stat.NE.0) num_errors=num_errors+1

- 3.
`ab(n)`

is the initial condition, i.e.,**a**_{0}``rotated'' into the eigen-basis of**F**. As you remember our formula for that was:

In order to avoid overloading the word`lambda`

the program uses symbol`b`

for . Recall that

But for orthonormal matrices (and hopefully`b`

is going to be such):

**b**^{-1}=**b**^{T}

So in this case we can write that:

where*b*^{k}_{k'}is the matrix of eigenvectors returned by the eigen-problem subroutine. The trick, in other words, is to contract**a**_{0}with the first index of that matrix and not with the second one, as we would normally do for matrix-vector multiplication.In the code this is done as follows:

!HPF$ DISTRIBUTE (*,BLOCK) :: b !HPF$ DISTRIBUTE (BLOCK) :: ab ... ALLOCATE( ab(n), stat=stat); IF( stat .NE. 0) num_errors = num_errors + 1 ... ALLOCATE( b(n,n) , stat=stat ); IF( stat .NE. 0) num_errors = num_errors + 1 ... ab(i) = 0.d0 DO i = 1, n DO j = 1, n ab(i) = b(j,i)*a_rep(j) + ab(i) ENDDO ENDDO ...

The array`a_rep`

is a replicated version of`a`

. Observe the following:ALLOCATE( a_rep(n), stat=stat); IF( stat .NE. 0) num_errors = num_errors + 1 ... CALL expand_temp_profile(a) a_rep = a ...

What do we gain by replicating`a_rep(j)`

? Observe that`b`

is distributed in the second index only. Since`a_rep`

is*replicated*and not distributed, all those multiplications in`b(j,i)*a_rep(j)`

will be performed locally, as will the outer loop (over`i`

) because`ab(i)`

is distributed the same way that`b(*,i)`

is, i.e., in a`BLOCK`

fashion.- 4.
- The next two arrays we allocate space for are
`xsines`

and`ysines`

. These arrays are used in the final evaluation of

and are defined as follows:

= (3.59) = (3.60)

where , and , and*n*_{x}is the number of harmonics in the*x*direction (`dif_nx`

),*n*_{y}is the number of harmonics in the*y*direction (`dif_ny`

), and*n*_{p}is the total number of physical grid points (`dif_npts = dif_numx * dif_numy`

).The first index of these arrays runs over the physical grid points, and the second index runs over the harmonics:

ALLOCATE( xsines(dif_npts, dif_nx) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1 ALLOCATE( ysines(dif_npts, dif_ny) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1

These arrays are*not*distributed either. That is, they are local to every node. The reason for this is the following computation that is performed towards the end of the program:DO it = 1, dif_ntemps DO i = 1, dif_npts iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1 DO k = 1, dif_npts temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it) ENDDO ENDDO ENDDO

where the variables`temp`

and`x`

are*distributed*in the*second*index only. This means that the two inner loops over`i`

and`k`

will be performed locally, and only the outer loop, over`it`

will be distributed.Observe the somewhat unusual work partitioning: it is the time slices that we're distributing between various processors so that then the whole history is calculated in parallel, simultaneously for various time slices.

The

`i`

index of the middle loop corresponds to the power space index*l*, whereas the`k`

index of the innermost loop runs over all physical grid points. The variable`x(i,it)`

corresponds to . - 5.
- Now we allocate space for temperature at various
points in space and time:
!HPF$ DISTRIBUTE (*,BLOCK) :: temp ... ALLOCATE(temp(dif_npts, dif_ntemps), stat=stat); IF(stat .NE. 0) & num_errors = num_errors + 1

The first index of this array runs over all points of the physical grid, and the second index numbers consequtive instances of time. - 6.
- The next array is what we have called
*F*^{l}_{l'}or**F**. This is a square array with dimensions :!HPF$ DISTRIBUTE (*,BLOCK) :: f ... ALLOCATE( f(n,n) , stat=stat ); IF( stat .NE. 0) num_errors = num_errors + 1

- 7.
- The diagnolisation of
**F**will return a matrix of eigenvectors, apart from the array of eigenvalues. We have denoted that matrix in our derivations by and in the code it is called`b`

. We have discussed the allocation of this distributed array above, when describing`ab`

. - 8.
- Variable
`abt`

contains the whole history of coefficients**a**(*t*) for various instances of time in the eigen-basis. That is these coefficients have not been rotated back onto the original basis yet. In other words they are:

`abt`

like`temp`

is a distributed array, where it is the various time instances that have been distributed over multiple nodes.!HPF$ DISTRIBUTE (*,BLOCK) :: abt ... ALLOCATE( abt(n,dif_ntemps) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1 ... DO it = 1, dif_ntemps i = 0 DO i = 1, n abt(i,it) = EXP( -lambda(i) *(it-1)*dif_delta_t) * ab(i) ENDDO ENDDO

- 9.
- The last array we allocate space for is
the array that corresponds to
This array is returned by a call to subroutine
`gemm`

, which takes`abt`

on input. Array`x`

is distributed in the same way as`abt`

:!HPF$ DISTRIBUTE (*,BLOCK) :: x ... ALLOCATE( x(n,dif_ntemps) , stat=stat ); IF( stat .NE. 0) & num_errors = num_errors + 1 ... CALL gemm(1.d0,b,abt,0.d0,x,'N','N') ... DO it = 1, dif_ntemps DO i = 1, dif_npts iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1 DO k = 1, dif_npts temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it) ENDDO ENDDO ENDDO

I have quoted again the computation of*T*, this time in order to draw your attention to*x*, rather than to`xsines`

and`ysines`

.

- In summary we have the following arrays:
lambda(harmonics) replicated a(harmonics) distributed over harmonics ab(harmonics) distributed over harmonics a_rep(harmonics) replicated (copy of ab) xsines(grid, harmonics_in_x) replicated ysines(grid, harmonics_in_y) replicated temp(grid, time) distributed over time f(harmonics, harmonics) distributed over second harmonics arg b(harmonics, harmonics) distributed over second harmonics arg abt(harmonics, time) distributed over time x(harmonics, time) distributed over time

- Once all the arrays have been allocated we check if
there were any problems and bump out if there were:
IF( num_errors .GT. 0 ) THEN WRITE(*,*) 'Error in allocating arrays in main'; STOP ENDIF

- The next two statements call two external procedures,
which we're going to discuss later, and evaluate
**a**_{0}and**F**. After the distributed array`a`

(which stands for**a**_{0}) is returned, it is gathered on one of the processes and then scattered in full amongst all participating processes. In HPF this is simply done by saying:a_rep = a

where`a`

is a`DISTRIBUTED`

array, and`a_rep`

is a*replicated*array. All HPF objects are*replicated*by default unless they are either`DISTRIBUTED`

or`SEQUENCE`

.In summary, the arrays

**a**_{0}and**F**are evaluated by the following:CALL expand_temp_profile(a); a_rep = a CALL get_diffusion_matrix(f)

- The following two double nested
`DO`

loops evaluate

for and , where*n*_{p}is the total number of grid points. Function`dsin`

is the`DOUBLE PRECISION`

version of`sin`

.DO i = 1, dif_nx DO j = 1, dif_npts t1 = i*dif_x(j); xsines(j,i) = SQRT(2.d0/pi) * dsin(t1) ENDDO ENDDO DO i = 1, dif_ny DO j = 1, dif_npts t2 = i*dif_y(j); ysines(j,i) = SQRT(2.d0/pi) * dsin( t2) ENDDO ENDDO

- Now we're finally ready to solve the problem we have
constructed. This is done simply by calling a PESSL
routine called
`syevx`

, which solves the eigenvalue problem and returns both the eigenvalues and eigenvectors:CALL syevx(f,lambda,'u', z=b)

You will find this routine documented on page 598 of ``PESSL Guide and Reference''.The routine computes eigenvalues and eigenvectors of a

*real symmetric*matrix.Is our matrix

**F**real and symmetric? The matrix is certainly real, because of the way we have constructed it. The matrix is really based on sines and cosines, which we have replaced at some stage with equivalent Euler formulas employing exponenses. So we are not working here with auxiliary imaginary quantities that we would have to throw away at some stage. All entries of**F**are real.Is

**F**symmetric? Consider two indexes:*l*= (*j*- 1)*n*_{x}+*k*and*l*' = (*j*' - 1)*n*_{x}+*k*'. Switching*l*and*l*' around corresponds to switching*j*and*j*' and then*k*and*k*' around. But this would not affect expressions such as:*k k*' +*l*^{2}_{r}*j j*' and*k k*' -*l*^{2}_{r}*j j*' in front of the integrals. The integrals are expressed in terms of-
*e*^{i(k + k')}, no change here -
*e*^{i(j + j')}, ditto -
*e*^{i(k - k')}, sign change here -
*e*^{i(j - j')}, sign change here

(3.61)

so even though there is a change of sign in*k*^{-}and*j*^{-}the integrals themselves are unaffected.To sum up

**F**is real and symmetric. Since it is symmetric, this implies that**F**=**F**^{T}, and thus**F**is also*normal*, hence the transformation can be chosen to be orthonormal. -
- Subroutine
`syevx`

has a somewhat complicated synopsis. It takes three mandatory arguments and then a plethora of optional arguments, all of which must be flagged with appropriate keywords, such as`abstol=`

, or`ifail=`

.The mandatory arguments are:

**f**- the input matrix itself
**lambda**- a preallocated space for a vector of eigenvalues, which will be returned in it on exit.
**uplo**- this is a character constant, which
has to be either
`'U'`

or`'L'`

(case doesn't matter), and which tells`syevx`

whether it will find the valid values of**F**in upper or in lower part of matrix`f`

. Because that matrix is symmetric`syevx`

references only half of it

`z=`

. If this option is requested, subroutine`syevx`

will return orthonormal eigenvectors of**F**on exit in preallocated (and correctly dimensioned) space given as an argument to this option. In our case this is:z=b

- After a successful return from
`syevx`

we evaluate first

and store it on`ab`

and then

and store it on`abt`

for various values of*t*:ab(i) = 0.d0 DO i = 1, n DO j = 1, n ab(i) = b(j,i)*a_rep(j) + ab(i) ENDDO ENDDO DO it = 1, dif_ntemps i = 0 DO i = 1, n abt(i,it) = EXP( -lambda(i) *(it-1)*dif_delta_t) * ab(i) ENDDO ENDDO

- All that we need to do next is to
*unrotate*the solution away from the eigen-basis. Although this would be quite easy to do given matrix`b`

, there is a special PESSL function called`gemm`

, which stands for*General Matrix Matrix Product*, and which can do it for us easily and very efficiently. See page 497 of ``PESSL Guide and Reference'' for more information about it. This subroutine, depending on various switches with which it's been called, will perform one of the following seven operations:- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.

We call this function in the following way:

CALL gemm(1.d0,b,abt,0.d0,x,'N','N')

And the meaning of various parameters passed to it is as follows:

**1.d0**- This is the value of
**b**- This is matrix
**A**. In our case it is or`b`

. **abt**- This is matrix
**B**. In our case this is the as yet unrotated eigensolution, which is stored on`abt`

. The matrix here is of a peculiar type, because its second index numbers time slices. **0.d0**- This is the value of
**x**- This is
**C**, i.e., space allocated and correctly dimensioned for the result of the matrix multiplication. In our case these will be the coefficients*a*^{l}(*t*_{m}) (remember that the index*l*here is just an index, not a power - whenever I want to be very strict about what is a vector and what is a form, I place vector indexes upstairs, and at some stage we needed to get strict about these things in order to derive a correct rule for rotating the eigensolution). **'N'**- This character constant tells
`gemm`

that matrix**A**should not be transposed. **'N'**- This last parameter tells
`gemm`

that matrix**B**should not be transposed.

- Now we already have our solution in terms of
*a*^{l}(*t*_{m}) (array`x`

). All that needs to be done still is to convert it to values of*T*(*x*,*y*,*t*). This is done by the following two loops:DO it = 1, dif_ntemps DO k = 1, dif_npts temp(k, it) = 0.d0 ENDDO ENDDO DO it = 1, dif_ntemps DO i = 1, n iy = (i-1)/dif_nx+1; ix = MOD(i-1,dif_nx)+1 DO k = 1, dif_npts temp(k,it) = temp(k,it) + xsines(k,ix) * ysines(k,iy) * x(i,it) ENDDO ENDDO ENDDO

- The following two loops with lots of
`WRITE`

statements simply write all the results on standard output. They're not worth quoting here. The only thing that you ought to remember is that in HPF programs only one node is responsible for I/O. There is no parallel I/O in standard HPF. Yet. So the HPF run-time system has to collect all those distributed arrays, and`temp`

is distributed so that various time slices live on various processors, on a single node and then that node does printing on standard output or writing on files.When trying to link HDF subroutines with HPF programs you must not assume a priori that HPF will do that for you though. You must declare all HDF related variables as

`SEQUENCE`

. - Last, but not least, we clean up. That is free all space that has been allocated
for the arrays. This isn't strictly necessary, since we don't do anything more
in this program, but in case we would, we might wish to reclaim and reuse
that space for other tasks.
DEALLOCATE(xsines); DEALLOCATE(ysines); DEALLOCATE(lambda) DEALLOCATE(temp); DEALLOCATE(a); DEALLOCATE(abt); DEALLOCATE(a) DEALLOCATE(f); DEALLOCATE(dif_x); DEALLOCATE(dif_y)

Observe that various arrays used in this program have been very thoughtfully distributed or replicated, as the case may be, in order to minimize communication between processes.