next up previous index
Next: One Dimensional Parallel Fast Up: Fast Fourier Transform Previous: The Danielson Lanczos Algorithm

Fast Fourier Transform in Parallel

Below is the parallel version of our Danielson-Lanczos program, for a 2 dimensional data.

PROGRAM danielson_lanczos

  IMPLICIT NONE

  REAL(kind=8), PARAMETER :: pi = 3.141592653589793_8
  INTEGER :: i, j
  COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y
  REAL(kind=8) :: dx, dy

  DO i = 0,7
     DO j = 0,7
        dx = 2.0_8 * pi / 8.0_8 * i
        dy = 2.0_8 * pi / 8.0_8 * j
        x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8)
     END DO
  END DO
  y = x
  CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
  y = -2.0_8 * pi / 64.0_8 * y
  WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
  WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
  STOP 0

CONTAINS

  FUNCTION bit_reverse(i, size)

    IMPLICIT NONE

    INTEGER :: bit_reverse
    INTEGER, INTENT(in) :: i, size

    INTEGER :: length, temp

    temp = i
    DO length = size, 2, -1
       temp = ISHFTC(temp, -1, length)
    END DO
    bit_reverse = temp

  END FUNCTION bit_reverse

  SUBROUTINE swap (a, b)

    IMPLICIT NONE

    COMPLEX(kind=8), DIMENSION(:) :: a
    COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b
    COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold

    ALLOCATE(hold(SIZE(a)))
    hold = a
    a = b
    b = hold
    DEALLOCATE(hold)

  END SUBROUTINE swap

  SUBROUTINE fft(x)

    IMPLICIT NONE

    COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x

    DOUBLE PRECISION, PARAMETER :: pi = 3.141592653589793_8
    INTEGER :: length, number_of_bits, i, j, mmax, istep, m
    COMPLEX(kind=8) :: wp, w, ws
    REAL(kind=8) :: theta
    COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp

    length = SIZE(x, 2)
    number_of_bits = INT(LOG(REAL(length))/LOG(2.0))
    IF (2**number_of_bits .NE. length) THEN
       WRITE(*,*) 'fft: error: input data length must be a power of 2'
       STOP 10
    ELSE
       DO i = 1, length-2
          j = bit_reverse(i, number_of_bits)
          IF (j .GT. i) CALL swap(x(:,i), x(:,j))
       END DO
    END IF

    ALLOCATE(temp(SIZE(x, 1)))
    mmax = 1
    DO
       IF ( length .LE. mmax) EXIT
       istep = 2 * mmax
       theta = - pi/(mmax)
       wp=CMPLX(-2.0_8 * SIN(0.5_8*theta)**2, SIN(theta), kind=8)
       w=CMPLX(1.0_8, 0.0_8, kind=8)
       DO m = 1, mmax
          ws=w
          DO i=m-1, length-1, istep
             j=i+mmax
             temp = ws*x(:,j)
             x(:,j) = x(:,i) - temp
             x(:,i) = x(:,i) + temp
          END DO
          w = w*wp + w
       END DO
       mmax = istep
    END DO
    DEALLOCATE(temp)

  END SUBROUTINE fft

END PROGRAM danielson_lanczos

This program differs very little from our original one-dimensional program.

I will outline those differences by discussing diffs between the two source files:

1.
6,16c6,10
<   INTEGER :: i, j
<   COMPLEX(kind=8), DIMENSION(0:7, 0:7) :: x, y
<   REAL(kind=8) :: dx, dy
< 
<   DO i = 0,7
<      DO j = 0,7
<         dx = 2.0_8 * pi / 8.0_8 * i
<         dy = 2.0_8 * pi / 8.0_8 * j
<         x(i,j) = CMPLX(SIN(dx), 0.0_8, kind=8) * CMPLX(SIN(dy), 0.0_8, kind=8)
<      END DO
<   END DO
---
>   INTEGER :: i
>   COMPLEX(kind=8), DIMENSION(8) :: x, y
> 
>   x = (/ (i, i=0,7) /); x = 2.0_8 * pi / 8.0_8 * x; x = SIN(x) + COS(2 * x) &
>       - SIN(3 * x)
This is the first difference: this time my arrays x and y are two dimensional. For clarity, instead of using an implicit DO loop, I have used explicit calculation, where dx and dy measure the distance from the origin, (0,0), and the input data is of the form of $\sin{x}\sin{y}$.
2.
18,21c12,15
<   CALL fft(y); y = TRANSPOSE(y); CALL fft(y); y = TRANSPOSE(y)
<   y = -2.0_8 * pi / 64.0_8 * y
<   WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') x
<   WRITE(*,'(8 ( 8 ("(", f6.2, ",", f6.2, ") "), /))') y
---
>   CALL fft(y)
>   y = 2.0_8 * pi / 8.0_8 * y
>   WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') x
>   WRITE(*,'(8 ("(", f6.3, ",", f7.3, ")", /))') y
Here, instead of just calling fft once, we call it twice. This is still the same fft as before, but this time it operates on whole columns of matrix x, not on scalar entries of vector x. Every manipulation that we have discussed before is now performed on all column entries at once. This is where the parallelism comes in. There is no communication between different rows, hence, if we were to implement this program in High Performance Fortran, the matrix should be distributed row-wise, i.e., different rows should live on different processors. This way our column manipulations will not require expensive inter-process communications.

The 2-dimensional Fourier Transform works as follows: first we need to find Fourier Transform in the x direction, and then we need to find Fourier Transform on the first Fourier Transform in the y direction.

Because our subroutine fft does only one direction and it is always x, in order to evaluate Fourier Transform in the y direction we need to transpose the matrix, thus interchanging the directions, and then call subroutine fft again.

The returned result must be transposed back to normal.

This is the only place where communication is involved in our program. Matrix transpose is a fairly costly operation.

Observe that I have also changed the value of the scaling constant. In this case I wanted the value to match what we have used in our heat diffusion program for the initial temperature distribution, T0(x,y), so that we can make a quick check if the program runs and if it returns correct results.

3.
43a35,36
>     COMPLEX(kind=8) :: a, b
>     COMPLEX(kind=8) :: hold
45,51d37
<     IMPLICIT NONE
< 
<     COMPLEX(kind=8), DIMENSION(:) :: a
<     COMPLEX(kind=8), DIMENSION(SIZE(a)) :: b
<     COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: hold
< 
<     ALLOCATE(hold(SIZE(a)))
55d40
<     DEALLOCATE(hold)
This is the change from subroutine swap. This time we're swapping two columns of a matrix, no longer two scalar numbers. The columns must be of the same length, of course, and the auxialiary variable hold must be an array too.

This array must be allocated once we know the size of the arrays to be swapped, and then deallocated when it is no longer needed.

4.
63c48
<     COMPLEX(kind=8), DIMENSION(0:, 0:), INTENT(inout) :: x
---
>     COMPLEX(kind=8), DIMENSION(0:), INTENT(inout) :: x
66,67c51
<     INTEGER :: length, number_of_bits, i, j, mmax, istep, m
<     COMPLEX(kind=8) :: wp, w, ws
---
>     COMPLEX(kind=8) :: wp, w, ws, temp
69c53
<     COMPLEX(kind=8), DIMENSION(:), ALLOCATABLE :: temp
---
>     INTEGER :: length, number_of_bits, i, j, mmax, istep, m
Here we're already looking at declarations in subroutine fft. Observe that x is now an array of rank 2, or, as I prefer to put it, a two-dimensional array.

Variable temp must be made into an array this time. It is going to hold ws*x(:,j). And it must be made allocatable to, because, we don't know its size a priori.

5.
71c55
<     length = SIZE(x, 2)
---
>     length = SIZE(x)
The length of the array of columns is not obtained by calling SIZE on the second dimension of the matrix.
6.
79c63
<           IF (j .GT. i) CALL swap(x(:,i), x(:,j))
---
>           IF (j .GT. i) CALL swap(x(i), x(j))
Instead of swapping two scalar entries of an array, we swap two whole columns. But if each row lives on its own separate processor, this operation should take exactly the same time as swapping two scalar entries of a one-dimensional array.
7.
83d66
<     ALLOCATE(temp(SIZE(x, 1)))
95,97c78,80
<              temp = ws*x(:,j)
<              x(:,j) = x(:,i) - temp
<              x(:,i) = x(:,i) + temp
---
>              temp = ws*x(j)
>              x(j) = x(i) - temp
>              x(i) = x(i) + temp
103d85
<     DEALLOCATE(temp)
Array temp must be able to store the whole column, so the number of entries that we must allocate for it corresponds to the number of rows, which is the first dimension in an array (rember that the ordering is always rows followed by columns).

And this is it. This is all we had to do, to parallelise our one dimensional Fast Fourier Transform program. If the program is to run on the SP and if it is to be compiled by HPF, we need to add the following HPF directive in main:

!HPF$ DISTRIBUTE (BLOCK, *) x, y
This directive should be matched by corresponding directives in subroutines of our program.

Does this program work?

Here is how it compiles and runs:

gustav@blanc:../src 16:51:02 !575 $ f90 -o lanczos-p lanczos-p.f90
gustav@blanc:../src 16:53:40 !576 $ ./lanczos-p
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  0.50,  0.00) (  0.71,  0.00) (  0.50,  0.00) (  0.00,  0.00) ( -0.50,  0.00) ( -0.71,  0.00) ( -0.50,  0.00) 
(  0.00,  0.00) (  0.71,  0.00) (  1.00,  0.00) (  0.71,  0.00) (  0.00,  0.00) ( -0.71,  0.00) ( -1.00,  0.00) ( -0.71,  0.00) 
(  0.00,  0.00) (  0.50,  0.00) (  0.71,  0.00) (  0.50,  0.00) (  0.00,  0.00) ( -0.50,  0.00) ( -0.71,  0.00) ( -0.50,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) ( -0.50,  0.00) ( -0.71,  0.00) ( -0.50,  0.00) (  0.00,  0.00) (  0.50,  0.00) (  0.71,  0.00) (  0.50,  0.00) 
(  0.00,  0.00) ( -0.71,  0.00) ( -1.00,  0.00) ( -0.71,  0.00) (  0.00,  0.00) (  0.71,  0.00) (  1.00,  0.00) (  0.71,  0.00) 
(  0.00,  0.00) ( -0.50,  0.00) ( -0.71,  0.00) ( -0.50,  0.00) (  0.00,  0.00) (  0.50,  0.00) (  0.71,  0.00) (  0.50,  0.00) 

(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  1.57,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) ( -1.57,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) 
(  0.00,  0.00) ( -1.57,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  0.00,  0.00) (  1.57,  0.00) 

 STOP 0
gustav@blanc:../src 16:53:42 !577 $
Comparison with the result discussed in P573 for $T_0(x,y) = \sin{x}\sin{y}$ shows that we're right on the spot, i.e.,

\begin{eqnarray*}a_{1,1} &=& \pi/2 \approx 1.57 \\
a_{1,7} &=& -\pi/2 \\
a_{7,1} &=& -\pi/2 \\
a_{7,7} &=& \pi/2
\end{eqnarray*}


Although in our example we have worked with a square data matrix, subroutine fft should work with rectangular matrices too.


next up previous index
Next: One Dimensional Parallel Fast Up: Fast Fourier Transform Previous: The Danielson Lanczos Algorithm
Zdzislaw Meglicki
2001-02-26