ApproxFFT: Fastest FFT Function for Arduino

1,753

11

20

Introduction: ApproxFFT: Fastest FFT Function for Arduino

FFT or Fast Fourier transform is one of the most important tools in signal processing. This can be used to detect frequencies with good speed. However, it is a computationally-intensive process and takes a good amount of processing power as well as memory especially on Arduino. This makes it difficult to use for real-time applications where multiple FFT results required every second.

In this instructable, the fastest program to perform FFT on Arduino is presented. I am naming it ApproxFFT as there are so many approximations taken at multiple stages. However, with very little compromise inaccuracy, I was able to achieve around 3 times speed.

If you want to only know how to use it directly go to step 5.

The complete project was also explained in a half-hour long explanation video. you may choose to go through that or read this tutorial.

I have prepared two code previously for the same application that can be accessed here:
EasyFFT: this code gives accurate output for FFT, however, it is relatively slow and consumes high memory.
It is recommended to go through this tutorial for background on FFT prior to this instructable.

QuickFFT: This code gives a very fast output. However, amplitudes are far off and not usable for most of the applications. It also gives multiple peaks that might mislead the results.

Step 1: Need of the ApproxFFT Function

If you refer to the attached image, it is very clear that QuickFFT has some significant advantages over conventional approaches (speed calculated with inbuilt sine/cosine functions for EasyFFT). It is almost 4 times faster but lacks accuracy. As can be seen in the second image, the amplitude is way off. The main reason behind this the fact that we have used the square wave (Approximate sine wave!!!) that consists of multiple frequencies (harmonics). there are multiple waves present in the result that makes it difficult for various applications.

Here we have assumed the sine/cosine wave as a square wave. If we take some better approximation for these waves we can reach close to the exact output. In this code better approximations were considered for these waves. we also have some provisions to control the accuracy of these waves. By tweaking this setting we can also play with various accuracy/speed relations.

Step 2: Fast Sine and Cosine Function

If you refer to the previous EasyFFT function, we had two options available. One of those was accurate one to make use of inbuilt sine/cosine function which are accurate but slow. If we go with another method where we have used a lookup table, speed was improved slightly with a small loss inaccuracy.

Here completely different approach was used to calculate this value, which is around 10-12 times faster than conventional ones. These function intensively make use of only bitshift operation for multiplication and (approximate) division. These operations are significantly fast than normal division or multiplication. however, these may cause a precision loss that we need to somehow take care of.

below two lines are the ones that made the speed difference between EasyFFT and QuickFFT. Typically first we have to calculate the sine/cosine for a given value. once the value is available that needs to be multiplied to some another float value. Both of these processes are slow and takes too much time.

tr=c*out_r[i10+n1]-s*out_im[i10+n1];<br>ti=s*out_r[i10+n1]+c*out_im[i10+n1];  // c is cosine and s is sine of some value

Here we have used something similar to the binary shoring algorithm. It will directly give Approximate output for A*sin(θ) and we do not need to do multiplication. In the first plot, ArcsinX vs X plot is shown. a similar plot is a store as the table in Arduino. where angles are mapped from 0-1024 which are equivalent to 0-360 degrees. Arcsine values are also considered as multiple of 128.

this is how the flow of calculation works: here we are ignoring negative values and will only consider for 0-90. (i.e. 500*sin 50 (in degree))

1. For any value (i.e. 500) sine multiple can be 0 (for 0 degrees) to that value (for 90 degrees) (i.e.0-500). we will check angle value for midpoint. we will half the value of the input (500) and check the angle for 0.5. based on that we can check out output will be 0-mid point or midpoint-max (in our case angle for 0.5 will be 30 degrees which is lower than 50, so we can conclude that the output will be somewhere in between 250-500).
At this point of time, we have halved the size of the possible range. (previous 0-500 to 250-500 )

2. newly defined range we will again do a similar step repeated. In our example, the midpoint will be 375, where we will check the angle for .75 that will be 48 degrees. so our answer will lie in between 375 to 500.

similar steps need to be repeated multiple times for accurate output. As can be shown in the second image (graph). The more level we go to, the output will be more close to the exact value.

Here we can perform all operations by only a bit shift and we also eliminated the floating multiplication.

Step 3: Fast Root of Squared Sum (FastRSS)

This function makes some approximation for the RSS value. its impact on overall speed is not very significant. However, it saves a few Milliseconds of time.

The attached plot shows the value of error if we assume root off squared sum output same as bigger value. The Error associated with this assumption decreases as the ratio increases. In this code, if the ratio exceeds it simply assumes a larger value as output that saves time. If the ratio is lower than that, based on the value some predefined scaling is set that is applied by doing some repetitions. All these values are stored as the RSSdata.

Step 4: Overall Calculation Flow:

Flow for overall calculation will be as follow.

1. Input data is scaled to +512 to -512: this step is important to preserve the precision of data. as we are using bit shift operation, there will be precision loss in case of an odd number. The bigger the number lower the loss.

2. Bit reversed order is generated,

3. Input data is ordered as per generated bit reverse order.

4. Frequency transformed is performed. fast sine and cosine were used wherever required. after every loop, the value of all integer checks if the amplitude is higher than 15000, both real and imaginary arrays are divided by 2.

wherever the scaling is performed, scale value is recorded.

5. Root of the squared sum is calculated using the FastRSS function.

6. Max value is detected and returned as output.

Step 5: Application

1. Lookup data needs to be declared as a global variable. we need to simply paste the below data at the top of the code.

//---------------------------------lookup data------------------------------------//<br>byte isin_data=
{0,  1,   3,   4,   5,   6,   8,   9,   10,  11,  13,  14,  15,  17,  18,  19,  20,
22,  23,  24,  26,  27,  28,  29,  31,  32,  33,  35,  36,  37,  39,  40,  41,  42,
44,  45,  46,  48,  49,  50,  52,  53,  54,  56,  57,  59,  60,  61,  63,  64,  65,
67,  68,  70,  71,  72,  74,  75,  77,  78,  80,  81,  82,  84,  85,  87,  88,  90,
91,  93,  94,  96,  97,  99,  100, 102, 104, 105, 107, 108, 110, 112, 113, 115, 117,
118, 120, 122, 124, 125, 127, 129, 131, 133, 134, 136, 138, 140, 142, 144, 146, 148,
150, 152, 155, 157, 159, 161, 164, 166, 169, 171, 174, 176, 179, 182, 185, 188, 191,
195, 198, 202, 206, 210, 215, 221, 227, 236};
unsigned int Pow2={1,2,4,8,16,32,64,128,256,512,1024,2048,4096};
//---------------------------------------------------------------------------------//

2. ApproxFFT, fast_sine,fast_cosine, and fastRSS function need to be pasted at the end of the code.

3. Using function:

float f=Approx_FFT(data,sample,sampling_rate);

This function will return the value of frequency with max amplitude by default. This is exactly the same as EasyFFT and QuickFFT function.

• first is the array of which we need to perform FFT,
• second is the number of samples: ideally, it should be 2^n, which can be 2, 4, 8, 16, 32, 64, 128,..
here the max number of sample is limited by the memory available
• the third input is the sampling rate.

4. Accuracy setting:

this value can be set from 1 to 7. higher the number better accuracy. here improving the accuracy will improve the fit of our approximate sine wave. By default, the accuracy value is set to 5. In most cases, this value works well. All result and speed claims are based on this value. Speed and accuracy can be tweaked by changing this value.

int fast_sine(int Amp, int th)<br>{
int temp3,m1,m2;
accuracy=5;    // set it value from 1 to 7, where 7 being most accurate but slowest
// accuracy value of 5 recommended for typical applicaiton
while(th>1024){th=th-1024;}   // here 1024 = 2*pi or 360 deg
while(th<0){th=th+1024;}
.
.
.

5. This code can be modified to display (and use) Raw output or amplitude of various frequencies. the output will as multiple of 2^n. As integer can only hold up to +-32000 values, the output is represented as multiple.

Step 6: Conclusion

All the test shown on the above image is done on Arduino mega with an accuracy level of 5. below are some of the significant observation on the test:

• Speed is more than 3 times faster than conventional FFT,
• Memory consumption is low (almost half),
• The output is comparable with exact values (low error),

Hope this code will useful for projects. In case of any query or feedback, do let me know by comment or Email.

Update (06/05/21): Code modified for Arduino due. Participated in the
Anything Goes Contest

• Recommendations

Thanks for the work to do this. I tried a dual frequency waveform and it worked accurately and quickly to find the peak frequency.

Is it possible to see the frequency bins vs the amplitude ? ( ie rather than just a peak find )

I thought maybe the code to uncomment, would be
serial print out_im[i] vs out_r[i] but the frequencies do not match. Ie you can see the max amplitude bin - but the frequency doesnt match the accurate peak frequency returned by the routine.

Thanks !

Hello. Thank you for sharing your code. I tried to upload the code to my ESP32 but an error occurs and it will reboot endlesly.

My code is the one in the example with an array of data
int data={0, 103, 195, 267, 313, 324, 308, 257, 181, 87, -16, -118, -207, -276, -317, -326};
This is de error

rst:0xc (SW_CPU_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:1
ho 0 tail 12 room 4
entry 0x400806b4
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x400d0ec3 PS : 0x00060330 A0 : 0x800d119f A1 : 0x3ffb1ce0
A2 : 0x3ffbdbb4 A3 : 0x00000084 A4 : 0x3ffb1ce0 A5 : 0x00000001
A6 : 0x3ffb1df0 A7 : 0x3ffb1f00 A8 : 0xd6927248 A9 : 0x00000000
A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3ffb1ce0 A13 : 0x00000000
A14 : 0x00000001 A15 : 0x3ffbdc08 SAR : 0x0000001c EXCCAUSE: 0x0000001c
EXCVADDR: 0xd6927248 LBEG : 0x400d0d48 LEND : 0x400d0d56 LCOUNT : 0x00000000

ELF file SHA256: 0000000000000000

Backtrace: 0x400d0ec3:0x3ffb1ce0 0x400d119c:0x3ffb1f90 0x400d1af1:0x3ffb1fb0 0x400860dd:0x3ffb1fd0

I have tried this, it is very nice and easy to use, however I have two problems, there are random data at output when input is idle and sometimes it outputs harmonic of the signal. ie. 2400 instead of 1200. I biased the analog input and I think hardware is okay. It is idling at 512-513. Are these normal or am I missing something?

1. it might capture minute noise and given the frequency of that data which does not have any meaning. To avoid this you may set some limits on amplitude (check my code on a project on arduino chord detection). Like, if the standard deviation of data is higher than some value, it should perform FFT and give output, else do nothing.

2. for many instruments it is possible to have the amplitude of harmonics being higher than the fundamental frequency. you can check the FFT of such data using some app. if you still face the same problem do share the raw data (along with timestamp) I will try to work on it.

Thank you, I have tried same method and worked fine, nothing on idle. However, I am trying to decode 5 tone selective calling signals on unsquelched radio channel. You may think like some single frequency notes(not dtmf) between 930-2400 hz, in broadband background noise. I tried to get results from 2 cycles and register it if they are equal, due to false positives it missed most of digits. I think I need something that requires signal which is longer than predetermined time. I am not sure is the fft right algorithm for this.

Hi, if you want to detect only 5 tone than doing fft is overkill (also might not be suitable for real time use). Goertzel algorithm are used for such applications. That's will consume far less mamory and computation. It is also good for real time application like you are doing.

Hi, 5 tone is a widely used name actually, it is generally 5 digit codes that may contain 12-16 different frequency codes. I have tried goertzel but, it takes too much computation time to run 16 scans on sample. You may get more information about it here: https://www.sigidwiki.com/wiki/CCIR_Selcall

hello, i'm trying but really not sure what i'm doing :(
it is way beyond my knowledge.

//---------------------------------lookup data------------------------------------//
byte isin_data=
{0, 1, 3, 4, 5, 6, 8, 9, 10, 11, 13, 14, 15, 17, 18, 19, 20,
22, 23, 24, 26, 27, 28, 29, 31, 32, 33, 35, 36, 37, 39, 40, 41, 42,
44, 45, 46, 48, 49, 50, 52, 53, 54, 56, 57, 59, 60, 61, 63, 64, 65,
67, 68, 70, 71, 72, 74, 75, 77, 78, 80, 81, 82, 84, 85, 87, 88, 90,
91, 93, 94, 96, 97, 99, 100, 102, 104, 105, 107, 108, 110, 112, 113, 115, 117,
118, 120, 122, 124, 125, 127, 129, 131, 133, 134, 136, 138, 140, 142, 144, 146, 148,
150, 152, 155, 157, 159, 161, 164, 166, 169, 171, 174, 176, 179, 182, 185, 188, 191,
195, 198, 202, 206, 210, 215, 221, 227, 236};
unsigned int Pow2={1,2,4,8,16,32,64,128,256,512,1024,2048,4096};
//---------------------------------------------------------------------------------//
int data_in={};
int data={};
long int t;
long int i;
long int f;
void setup()
{
Serial.begin(250000);
}
void loop() {
float f=Approx_FFT(data,512,100);
Serial.println(f);
t=micros();
for (int i = 0; i <= 255; i++){
//for (int i = 0 ; i { as per programer
data_in[i]=analogRead(A7); // Set you analog pin
delayMicroseconds(200); // depending on your application
}
t=micros()-t;
t=128000000/t;
ApproxFFT(data_in,128,t);
delay(1000); // as per application
}
//-----------------------------FFT Function----------------------------------------------//
/*
Code to perform High speed and Accurate FFT on arduino,
setup:
1. in[] : Data array,
2. N : Number of sample (recommended sample size 2,4,8,16,32,64,128,256,512...)
3. Frequency: sampling frequency required as input (Hz)
It will by default return frequency with max aplitude,
if you need complex output or magnitudes uncomment required sections
If sample size is not in power of 2 it will be clipped to lower side of number.
i.e, for 150 number of samples, code will consider first 128 sample, remaining sample will be omitted.
For Arduino nano, FFT of more than 256 sample not possible due to mamory limitation
Code by ABHILASH
Contact: abhilashpatel121@gmail.com
Documentation & details: https://www.instructables.com/member/abhilash_pat...
*/
float Approx_FFT(int in[],int N,float Frequency)
{
int a,c1,f,o,x,data_max,data_min=0;
long data_avg,data_mag,temp11;
byte scale,check=0;
data_max=0;
data_avg=0;
data_min=0;
for(int i=0;i<12;i++) //calculating the levels
{ if(Pow2[i]<=N){o=i;} }
a=Pow2[o];
int out_r[a]; //real part of transform
int out_im[a]; //imaginory part of transform
for(int i=0;i<a;i++) //getting min max and average for scalling
{ out_r[i]=0; out_r[i]=0;
data_avg=data_avg+in[i];
if(in[i]>data_max){data_max=in[i];}
if(in[i]<data_min){data_min=in[i];}
}
data_avg=data_avg>>o;
scale=0;
data_mag=data_max-data_min;
temp11=data_mag;
for(int i;i<128;i++) //scalling data from +512 to -512
if(data_mag>1024)
{while(temp11>1024)
{temp11=temp11>>1;
scale=scale+1;
}
}
if(data_mag<1024)
{while(temp11<1024)
{temp11=temp11<<1;
scale=scale+1;
}
}
if(data_mag>1024)
{
for(int i=0;i<a;i++)
{ in[i]=in[i]-data_avg;
in[i]=in[i]>>scale;
}
scale=128-scale;
}
if(data_mag<1024)
{ scale=scale-1;
for(int i=0;i<a;i++)
{
in[i]=in[i]-data_avg;
in[i]=in[i]<<scale;
}
scale=128+scale;
}
x=0;
for(int b=0;b<o;b++) // bit reversal order stored in im_out array
{
c1=Pow2[b];
f=Pow2[o]/(c1+c1);
for(int j=0;j<c1;j++)
{
x=x+1;
out_im[x]=out_im[j]+f;
}
}
for(int i=0;i<a;i++) // update input array as per bit reverse order
{
out_r[i]=in[out_im[i]];
out_im[i]=0;
}
int i10,i11,n1,tr,ti;
float e;
int c,s,temp4;
for(int i=0;i<o;i++) //fft
{
i10=Pow2[i]; // overall values of sine/cosine
i11=Pow2[o]/Pow2[i+1]; // loop with similar sine cosine
e=1024/Pow2[i+1]; //1024 is equivalent to 360 deg
e=0-e;
n1=0;
for(int j=0;j<i10;j++)
{
c=e*j; //c is angle as where 1024 unit is 360 deg
while(c<0){c=c+1024;}
while(c>1024){c=c-1024;}
n1=j;
for(int k=0;k<i11;k++)
{
temp4=i10+n1;
if(c==0) {tr=out_r[temp4];
ti=out_im[temp4];}
else if(c==256) {tr= -out_im[temp4];
ti=out_r[temp4];}
else if(c==512) {tr=-out_r[temp4];
ti=-out_im[temp4];}
else if(c==768) {tr=out_im[temp4];
ti=-out_r[temp4];}
else if(c==1024){tr=out_r[temp4];
ti=out_im[temp4];}
else{
tr=fast_cosine(out_r[temp4],c)-fast_sine(out_im[temp4],c); //the fast sine/cosine function gives direct (approx) output for A*sinx
ti=fast_sine(out_r[temp4],c)+fast_cosine(out_im[temp4],c);
}
out_r[n1+i10]=out_r[n1]-tr;
out_r[n1]=out_r[n1]+tr;
if(out_r[n1]>15000 || out_r[n1]<-15000){check=1;} //check for int size, it can handle only +31000 to -31000,
out_im[n1+i10]=out_im[n1]-ti;
out_im[n1]=out_im[n1]+ti;
if(out_im[n1]>15000 || out_im[n1]<-15000){check=1;}
n1=n1+i10+i10;
}
}
if(check==1){ // scalling the matrics if value higher than 15000 to prevent varible from overflowing
for(int i=0;i<a;i++)
{
out_r[i]=out_r[i]>>1;
out_im[i]=out_im[i]>>1;
}
check=0;
scale=scale-1; // tracking overall scalling of input data
}
}
if(scale>128)
{scale=scale-128;
for(int i=0;i<a;i++)
{out_r[i]=out_r[i]>>scale;
out_im[i]=out_im[i]>>scale;
}
scale=0;
} // revers all scalling we done till here,
else{scale=128-scale;} // in case of nnumber getting higher than 32000, we will represent in as multiple of 2^scale
/*
for(int i=0;i<a;i++)
{
Serial.print(out_r[i]);Serial.print("\t"); // un comment to print RAW o/p
Serial.print(out_im[i]);
Serial.print("i");Serial.print("\t");
Serial.print("*2^");Serial.println(scale);
}
*/
//---> here onward out_r contains amplitude and our_in conntains frequency (Hz)
int fout,fm,fstp;
float fstep;
fstep=Frequency/N;
fstp=fstep;
fout=0;fm=0;
for(int i=1;i<Pow2[o-1];i++) // getting amplitude from compex number
{
// Approx RSS function used to calculated magnitude quickly
out_im[i]=out_im[i-1]+fstp;
if (fout<out_r[i]){fm=i; fout=out_r[i];}
/*
// un comment to print Amplitudes (1st value (offset) is not printed)
Serial.print(out_r[i]); Serial.print("\t");
Serial.print("*2^");Serial.println(scale);
*/
}
float fa,fb,fc;
fa=out_r[fm-1];
fb=out_r[fm];
fc=out_r[fm+1];
fstep=(fa*(fm-1)+fb*fm+fc*(fm+1))/(fa+fb+fc);
return(fstep*Frequency/N);
}
//---------------------------------fast sine/cosine---------------------------------------//
int fast_sine(int Amp, int th)
{
int temp3,m1,m2;
accuracy=5; // set it value from 1 to 7, where 7 being most accurate but slowest
// accuracy value of 5 recommended for typical applicaiton
while(th>1024){th=th-1024;} // here 1024 = 2*pi or 360 deg
while(th<0){th=th+1024;}
temp1= 0;
temp2= 128; //2 multiple
m1=0;
m2=Amp;
temp3=(m1+m2)>>1;
Amp=temp3;
for(int i=0;i<accuracy;i++)
{ test=(temp1+temp2)>>1;
temp3=temp3>>1;
if(th>isin_data[test]){temp1=test; Amp=Amp+temp3; m1=Amp;}
else if(th<isin_data[test]){temp2=test; Amp=Amp-temp3; m2=Amp;}
}
return(Amp);
}
int fast_cosine(int Amp, int th)
{
th=256-th; //cos th = sin (90-th) formula
return(fast_sine(Amp,th));
}
//--------------------------------------------------------------------------------//
{ if(a==0 && b==0){return(0);}
int min,max,temp1,temp2;
byte clevel;
if(a<0){a=-a;}
if(b<0){b=-b;}
clevel=0;
if(a>b){max=a;min=b;} else{max=b;min=a;}
if(max>(min+min+min))
{return max;}
else
{
temp1=min>>3; if(temp1==0){temp1=1;}
temp2=min;
while(temp2<max){temp2=temp2+temp1;clevel=clevel+1;}
for(int i=0;i<temp2;i++){max=max+temp1;}
return(max);
}
}
//--------------------------------------------------------------------------------//

Thanks for this. It looks very useful.

I have been recasting it so it is a library.

The scaling part of the code puzzles me as there is no {} for its for loop (line 81) so only the first if condition is executed within the loop. It looks like the four if conditions should be part of the loop.

I'm also testing it on a esp8266 where a 512 transform takes 20mSec (with all scaling parts in the loop) and executed at the standard 80MHz clock.

Thanks for the feedback.

were you pointing this part?
I think there are already 2 loops present. Or probably i didn't get the question.

if(data_mag>1024)
{while(temp11>1024)
{temp11=temp11>>1;
scale=scale+1;
}
}

And on ESP8266, have you tried with accuracy level 5?

Time test on esp8266 was for accuracy 5. As a library I made the accuracy dynamically configurable.

The part I didn't understand that as the
for(int i;i<128;i++)

has no {} then only the next statement gets executed inside the for loop. That is the first
if(data_mag>1024)
{while(temp11>1024)
{temp11=temp11>>1;
scale=scale+1;
}
}

I think the other 3 if compound statements get executed outside the for loop.

Edit: Actually now I look at it more closely I'm not sure why the first for is there at all. It looks like the scaling should just be the four compound if statements. If I comment out the first for(int i;i<128;i++) I still get the same result.