DSP_fft16x16
4-10
To vectorize the FFT, it is desirable to access the twiddle factor array using
double word wide loads and fetch the twiddle factors needed. To do this, a
modified twiddle factor array is created, in which the factors WN/4, WN/2,
W3N/4 are arranged to be contiguous. This eliminates the separation between
twiddle factors within a butterfly. However, this implies that we maintain a
redundant version of the twiddle factor array as the loop is traversed from one
stage to another. Hence, the size of the twiddle factor array increases as
compared to the normal Cooley Tukey FFT. The modified twiddle factor array
is of size “2 * N” where the conventional Cooley Tukey FFT is of size “3N/4”
where N is the number of complex points to be transformed. The routine that
generates the modified twiddle factor array was presented earlier. With the
above transformation of the FFT, both the input data and the twiddle factor
array can be accessed using double-word wide loads to enable packed data
processing.
The final stage is optimized to remove the multiplication as w0 = 1. This stage
also performs digit reversal on the data, so the final output is in natural order.
In addition, if the number of points to be transformed is a power of 2, the final
stage applies a radix-2 pass instead of a radix-4. In any case, the outputs are
returned in normal order.
The code performs the bulk of the computation in place. However, because
digit-reversal cannot be performed in-place, the final result is written to a
separate array, y[].
Benchmarks Cycles (6 * nx/8 + 19) * ceil[log
4
(nx) − 1] + 8*nx/8 + 30
Codesize 864 bytes