Accelerating 3D FFT with Half-Precision Floating Point Hardware on GPU

> Students: Yanming Kang (HKUST) and Tullia Glaeser (Tulane) Mentors: Ed D'Azevedo (ORNL) and Stan Tomov (ICL, UTK)

Discrete Fourier Transform (DFT)

& Fast Fourier Transform (FFT)

$$X(k) = \sum_{n=0}^{N-1} x(n) e^{-(i\frac{2\pi nk}{N})}$$

DFT  $[O(N^2)]$ : for num. computations in digital signal processing (incl fast convolution, spectrum analysis)

- N discrete time series signals  $\rightarrow$ (into) N discrete frequency components (amplitude + Ο phase)
- FFT [O(NlogN)]: Fast algorithms for DFT -- widely used num. Algorithm -- plays vital role in many scientific and engineering applications (image processing, speech recog., data analysis, large scale simulations
  - Maj. time in large comp. apps Ο
  - Ο
- i. Symmetry of DFT:  $X_{N+k} = X_k = X_{k-1} x (n) e^{-(i\frac{2\pi nk}{N})}$ ii. Divide DFT alg. into odd + even  $p = \sum_{m=0}^{N/2-1} x_{2m} \cdot e^{-i2\pi k m / (N/2)} + e^{-i2\pi k / N} \sum_{m=0}^{N/2-1} x_{2m+1} \cdot e^{-i2\pi k m / (N/2)}$ 
  - $\rightarrow$  halved the computations to be O(2M) where M is half of  $N \rightarrow O(N)$

Keep doing this recursively  $\rightarrow$  halves computation cost every time  $\rightarrow O(NlogN)$ 1.

To keep improving performance/time -- implement it on GPU Ο

# Implementing 1D, 2D, & 3D FFT

• 1D FFT of x:

- a. x = 1D array, B (4 x N/4) matrices or 1 (4 x N/4 x B) tensor (B = # of batches)
- b. Find DFT of each of those matrices
- c. Multiply by twiddle factor (W =  $e^{-i2} kn/N$ )

• 2D FFT:

- a. x = (m x n x batch)
- b. Reshape x to be 1D array [m\*n\*batch, 1, 1]
- c. Call 1D FFT on it
- d. Transpose & do 1D FFT in other direction
- 3D (breakdown shown in pic):
  - a. Take 1D FFT in each direction OR
  - b. Take 2D FFT in 2 directions & 1D in last dir.
- MATLAB + CUDA
  - a. Currently use CUBLAS/CUTLASS and Radix-4





# Mixed Precision & Tensor Cores

- Tensor: "a mathematical object analogous to but more general than a vector, represented by an array of components that are functions of the coordinates of a space" -- large dense matrix
- NVIDIA Volta microarchitecture ft. specialized computing units, *Tensor Cores*
- tensore core support → mixed precision -- matrix multiplication operations done w/ halfprecision input data (FP16)-- the rest FFT done on single precision data (FP32)
- FP16 arithmetic enables Volta Tensor Cores which offer 125 TFlops of computational throughput on generalized matrix-matrix multiplications (GEMMs) and convolutions, an 8X increase over FP32
- Matrix entries multiplied in neural networks are small w/ respect to value of prev. Iter. → can
  use half precision, result is still small in val. → result accumulated to other much larger val., in
  single precision to avoid precision loss
- Deep neural network training = tolerant to precision loss up to certain degree



# Inefficiency with Transform -- volta\_sgemm\_fp16\_128x64\_nn



# The FFT (radix-n1) in matrix form



-1 1

N-1

 $X(k) = \sum x(n) e^{-\left(i\frac{2\pi nk}{N}\right)}$ 

-1

# The algorithm

//Batched 1d FFT of length N
Radix\_4\_FFT\_recursion(X, N, Batch):
 If N=4 then
 Return F4 \* X

(batched gemm) //See X as a (4 by N/4 by Batch) array permute(X, [2,1,3]) //X as a (N/4 by 4 by Batch) array Y <- Radix\_4\_FFT\_recursion(X, N/4, Batch\*4) Multiply elementwise Y with W\_N Return Y \* F4

(batched gemm)

## End

Splitting is done before gemm Combining is done after gemm  $x(32) = s1(32) * x_hi(16) + s2(32) * x_lo(16),$ Gemm is done to x\_hi, x\_lo



# CUTLASS (CUDA Templates for Linear Algebra Subroutines)

The most expensive step in the recursion: the second batched gemm

```
Result1 = X * F4_re; Result2 = X * F4_im
```

where

F4\_re, F4\_im: 4 by 4, fp16 X=[X\_re\_hi, X\_re\_lo, X\_im\_hi, X\_im\_lo]: m by 4 by Batch\*4, fp16 Result1, Result2: m by 4 by Batch\*4, fp32

For batch size = B, length = N input, will do gemms for: m = N, Batch = B m = N/4, Batch = 4B ... m = 4, Batch = NB/4

cuBlas is not optimized for slender matrix multiplication (volta\_sgemm\_fp16\_128x64\_nn)

# CUTLASS vs cuBlas m by 4 \* 4 by 4 matrix multiplication

| m     | Batch size | cuBlas(ms) | cutlass(ms) | Mean error  |
|-------|------------|------------|-------------|-------------|
| 64    | 1048576    | 40.7779    | 13.4457     | 1.23754e-12 |
| 256   | 65536      | 5.10469    | 3.07621     | 1.27887e-12 |
| 256   | 262144     | 20.4031    | 12.2688     | 1.24481e-12 |
| 1024  | 16384      | 5.07802    | 3.00108     | 1.23879e-12 |
| 1024  | 65536      | 20.2993    | 11.9628     | 1.24625e-12 |
| 4096  | 4096       | 5.08486    | 3.00046     | 1.26754e-12 |
| 4096  | 16384      | 20.2965    | 11.882      | 1.22616e-12 |
| 16384 | 4096       | 44.524     | 11.8838     | 1.23812e-12 |

# In the Near Future: Radix-2 vs. Radix-4

- Radix-2 algorithms: 2<sup>v</sup> data points
  - *a. decimation-in-time* (DIT): Simplest + most common form of Cooley-Tukey alg
    - i. DFTs of even- & odd-indexed inputs, repeat recursively (O(NlogN))
  - *b. Decimation-in-frequency* (DIF): (O(NlogN)) -- divide + conquer
    - i. split DFT into 2 summations  $[(0 \rightarrow N/2) + (N/2 \rightarrow N)]$
    - ii. Split those split summations into even & odd
    - iii. Repeat recursively
- Currently using radix-4 (4<sup>v</sup> data pts)
- Why radix-2?
  - a. DFT of identity [2,2] matrix = real matrix (not complex) & exactly representable in FP16
  - b. Use tensor cores to implement it
  - c. ALTHOUGH radix-4 = more efficient when  $N = 2^{v}$



# Citations

- <u>https://www.jics.utk.edu/files/images/recsem-reu/2018/fft/Report.pdf</u>
- https://jakevdp.github.io/blog/2013/08/28/understanding-the-fft/
- https://arxiv.org/pdf/1803.04014.pdf
- http://www.cmlab.csie.ntu.edu.tw/cml/dsp/training/coding/transform/fft.html