Introduction: How to Create Attractive Statistical Graphics on R/RStudio
R/RStudio is a powerful free, open-source statistical software and programming language that is regarded as a standard in the statistics community. In addition to exploring data and performing analyses, R/RStudio can create graphics using its default graphics capabilities. Packages such as 'ggplot2' provide upgraded functions that make it easier to make attractive graphics so that even beginners have the ability to create appealing graphics with relative ease. This Instructable will introduce how to produce high-quality and effective graphics to enhance your reports, web pages, or other documents with several types of 'quickplots' from the package ggplot2: scatter plots, histograms, box plots, bar plots, and stacked bar plots.
I recommend RStudio because its interface is much more user-friendly and appealing.
Skill Level: Beginner-Intermediate
You should know at least the basics of R programming. If you don't, here are some helpful introductions:
- Introduction to R - This can be pasted into an R script so that it is an 'interactive' tutorial.
- A (Very) Short Introduction to R
- R for Ecologists
You can find more by searching on Google.
Time Required: 30 minutes - 1 hour.
Additional time may be required if you need to troubleshoot or if you do any further exploration.
Step 1: Get Started
1. Create a new R script and save it to your preferred location
On the RStudio menu bar, click File - New File - R Script, then click File - Save As.
2. Install the 'ggplot2' and 'dplyr' packages
3. Load the packages
4. Explore the diamonds dataset
This command will pull up information about the diamonds dataset's variables and their meanings.
Note: The variable 'colour' should actually be referred to in your code as 'color' because "colour" is actually a grouping option, as we will cover in Step 6.
Step 2: Create a Basic Scatter Plot
Use carat as the explanatory variable and price as the response variable.
qplot(carat, price, data=diamonds)
Step 3: Group Data on a Scatter Plot
You can group the data by variables with the following ggplot2 options: colour, size, shape, and fill. In this step, we will look at colour, size, and shape since these work great for scatterplots. The options colour and size can be set to either a continuous numerical variable or a categorical variable.
1. Group the scatter plot of price vs. carat by clarity using the option 'colour.'
qplot(carat, price, data=diamonds, colour=clarity)<br>
Note: The ggplot2 option 'colour' is intentionally spelled in British English form because the creator, Hadley Wickham, is from New Zealand.
2. Group the scatter plot of price vs. carat by x using the option 'size.'
qplot(carat, price, data=diamonds, size=x)<br>
Note: As you can see, it is possible to group by numeric variables as well as categorical variables because ggplot2 will automatically scale them.
3. Group the scatter plot of price vs. carat by cut using the option 'shape.'
qplot(carat, price, data=diamonds, shape=cut)<br>
Step 4: Facet the Data
Faceting is similar to grouping, but it creates separate plots for each level of a variable by basically creating subsets of the data. I would suggest only using categorical variables to split up the data when faceting.
Facet the scatter plot of carat vs. price by the variable cut.
qplot(carat, price, data=diamonds, facets=~cut)
You can also combine faceting with the other grouping methods discussed in the previous step, though you should avoid having too much clutter.
Step 5: Create a Histogram
Create a histogram of price.
Note: Adding the binwidth option (binwidth=) allows you to specify how wide you want the intervals to be on the x-axis (not how often you want tick marks and value labels). It is not necessary to specify this option, but you will get a warning that says the default bin width is the range of the variable divided by 30.
Step 6: Create a Box Plot
Box plots are also known as box and whisker diagrams.
1. Create a box plot of depth by cut.
qplot(cut, depth, data=diamonds, geom="boxplot")
2. Create a box plot of depth by cut and add jittered points.
qplot(cut, depth, data=diamonds, geom=c("jitter", "boxplot")
Note: The order in which you specify the geoms matters because the first one will be placed behind the second. You can experiment it to determine which arrangement is better for your data, though it doesn't look great either way in this case. A better example of how including both of these geoms can produce a good visualization of data can be found near the bottom of this page (as well as more information about jittering).
Step 7: Create a Bar Plot
1. Use dplyr statements to create a new variable for average carat called meancarat.
meancaratdata= diamonds %>% group_by(clarity, cut) %>% summarise(meancarat=mean(carat))
Note: This actually causes you to create a new dataset, which we will call meancaratdata. The variable cut is included in the group_by() statement because we will be using it to create a stacked bar chart later.
2. Create a bar plot of meancarat for each level of clarity, using the new data set meancaratdata.
qplot(clarity, meancarat, data=meancaratdata, geom="bar", stat="identity")
Note: You must use the statement stat="identity" or it will cause an error. You may use stat="bin" if you want it to set the y-value to counts, like a histogram.
3. Add color!
qplot(clarity, meancarat, data=meancaratdata, geom="bar", stat="identity", fill=clarity) + guides(fill=FALSE)
The guides(fill=FALSE) statement gets rid of the legend for the fill colors because it would be redundant for a bar plot.
Step 8: Create a Stacked Bar Plot
Create a stacked bar plot of meancarat by clarity, grouped by cut.
qplot(clarity, meancarat, data=meancaratdata, geom="bar", stat="identity", fill=cut)
Step 9: Add/change Labels
For this step we will be using the scatter plot from Step 6 of carat vs. price, grouped by clarity using the option 'color.' However, these label options can, and typically should, be applied to all types of plots.
1. Using the 'main' option, set the title as "Scatter Plot of Carats vs. Price."
qplot(carat, price, data=diamonds, colour=clarity, main="Scatter Plot of Carats vs. Price")
2. Change the x-axis label to "Carats" using the 'xlab' option.
qplot(carat, price, data=diamonds, colour=clarity, main="Scatter Plot of Carats vs. Price", xlab="Carats")
This is a small change, but capitalization looks more professional. You will find this step more important for variable names with unknown meanings to most viewers, such as 'x.'
3. Change the y-axis label to "Price (US $)" using the 'ylab' option.
qplot(carat, price, data=diamonds, colour=clarity, main="Scatter Plot of Carats vs. Price", xlab="Carats", ylab="Price (US $)")
Step 10: Conclusion
Now, you should know how to create several types of basic plots using ggplot2 on RStudio: scatter plots, histograms, boxplots, bar plots, and stacked bar plots. You also know a few ways to visualize the differences between groups of data, including faceting, stacking (with 'fill'), and grouping with colors, sizes, and shapes, and can polish your graphics by creating and replacing plot labels. The attractive graphics produced with ggplot2 are a great way to visualize data and will be a valuable addition to reports, articles, or other documents.
Remember that the plots you just created are only basic plots. With more knowledge and time, you can customize your graphics even more to create better visualizations. Some visualizations can even be interactive when you use the package Shiny.
If you are interested in learning more about ggplot2, I recommend first browsing Hadley Wickhmam's site dedicated to ggplot2. Another great source is Cookbook for R, which is provided by the author of R Graphics Cookbook. Other helpful information can typically be found by searching on Google and more is constantly being added as ggplot2 is updated.