Introduction: Analysis of Variance (ANOVA) in R

This an instructable on how to do an Analysis of Variance test, commonly called ANOVA, in the statistics software R.

ANOVA is a quick, easy way to rule out un-needed variables that contribute little to the explanation of a dependent variable. It is acessable and applicable to people outside of the statistics field.

This instructable will assume no prior knowledge in R and will give basic software commands that may be trivial to an experienced user. If you are familiar with R I suggest skipping to Step 4, and proceeding with a known dataset already in R.

R is a free, open source, and ubiquitous in the statistics field. R has all-text commands written in the computer language S. It is helpful, but by no mean necessary, to have an elementary understanding of text based computer languages. If you can not stand working with text-based software I suggest that you try the statistics software JMP.


What You need:
-Access to a computer
-A data set to analyze

Estimated time to complete an ANOVA Test:
-15 minutes for a new user.
-5 minutes for an experienced user.

Step 1: Getting Started:


If R isn't on your computer already it can be downloaded for free from the official website at:
  • http://cran.r-project.org/bin/windows/base/ (Windows)
  • http://cran.r-project.org/bin/macosx/ (Mac)
  • http://cran.r-project.org/ (Linux)
Choose the version (32bit/64bit) that is of the operating system’s natural base.

Open R:

You will see the basic command counsel open. This is a log and output of commands executed. However, it can’t be edited making it hard to work with, instead open a Script window with the following:

  • File >>
  • New Script

This window acts as a basic word processor (close to notepad)from which commands can be written, edited, and executed by right-clicking a line or selection and running it. Alternatively the shortcut Control+r will also execute a line or selection.

Note:
You can write comments in R by putting a pound sign (#) at the start of the comment. an example of this is shown in Step 3.

Step 2: Reading Data:

.csv is perhaps the most prevalent file type when dealing with data files. .csv Files can be made easily from excel. Alternatively you can enter your data directly into R by naming and pointing variables (see the secondary image for help)

If you have a .csv file, great! Read it in using one of the following commands:
  • data name = read.csv("appropriate web page or file directory")
  • data name = read.csv(file.choose())
Once this is done, explore your data with the following commands:
  • dim(data name)
  • str(data name)
  • head(data name)
  • attach(data name)
Note:
You will need to run attach, or else R will not know what data set you are referring to.

Step 3: Running the ANOVA Test:


You’re doing great! You are close to being done with a single independent variable ANOVA test already.

Run the Analysis of Variance with the following R command:
name=aov(y variable~x variable) #runs the ANOVA test.
ls(name) #lists the items stored by the test.
summary(name) #give the basic ANOVA output.

The example in the images compare Calories as the dependent variable, y, compared to one independent variable (Sugars in this example).

Note:
If R cannot find the variable specified make sure the punctuations match and that you have executed the ‘attach(data)’ command.

Step 4: More Then One Independent Variable


The case with multiple independent variables x1,x2 to xn is a simple change.

Modify the code such the independent variables are a product with an asterisk (*) in-between them:

  • name=aov(y~x1*x2*xn)

Step 5: Interpreting the Data:

Lets us the multivariate model from step 4.

Here we are trying to describe Calories in terms of Sugars, Calories from Fat, Protein, and their interactions with each other (Sugar*CalFat, Sugars*Protein, CalFat*Protein, and Sugars*CalFat*Protein)

Focus on the column: the probability that F is greater then the listed value from the previous column. This is often called the p value. In most cases you put significance at the alpha=.05 level, or we require the P value to be less then .05 to be considered statistically significant.

Immediately we can see that the terms Sugars, Sugars*CalFat, and CalFat*Protein are not significant at the .05 level. Alternatively we see that CalFat and Sugars*CalFat*Protein are the best terms respectively with P values much less then .05.

From this we can conclude that if your goal is to describe Calories you only need to do a regression on CalFat or potential Sugars*CalFat*Protein. If you plan to take more samples and all you care about its predicting or describing Calories you now only have to gather Calories from Fat and forgo gathering all the other variables.

Congratulations! You just completed and are now able to interpret your very own data set with the analysis test. This handy tool can save you and your company untold amounts of time, effort and money!