Ocean Acidification: a Real Data Analysis With Free Software




Introduction: Ocean Acidification: a Real Data Analysis With Free Software

Oceans play a central role in regulating the concentration of carbon dioxide in the atmosphere since they absorb a significant portion of the atmospheric CO2. When seawater absorbs carbon dioxide, a series of chemical reactions occur and, as a result, the pH of water decreases, meaning that the ocean becomes more acidic. This phenomenon is known as ocean acidification. The acidification of seawater makes creating and maintaining their structures difficult for calcifying organisms such as oysters, clams or corals, and can also affect the behavior of other marine organisms. This has also negative consequences for people whose economies or food depend on fish and shellfish.

It is well known that since the industrial revolution, emissions of carbon dioxide to the atmosphere have increased considerably. But how has this affected the oceans? Long time-series observations in the ocean, although extremely important, are rare. However, the Hawaii Ocean Time-series (HOT) project has been conducting observations in the North Pacific—at the ALOHA station located 100 km north of the island of Oahu (Hawaii)—since October 1988. They carry out measurements on the hydrography, chemistry, and biology of the seawater approximately once a month, and the measured data are free and open access.

Thanks to the Hawaii Ocean Time-series project our students can work with real data to analyze how the concentration of CO2 in the Pacific Ocean has changed in the last 30 years and how it has affected the acidity of seawater.

Secondary education (15-18 years)

Learning objectives

Having the students get to know about ocean acidification by working with real data is the main objective of this activity. But besides this, the students will as well learn a programming language that they will be able to use afterward to analyze scientific data of different types. Specifically, the students will learn:

  • How ocean acidification has increased in recent years.
  • How the concentration of CO2 dissolved in water has changed in recent years.
  • How pH is related to the amount of CO2 dissolved in water.
  • How to create a .csv data file using a text editor.
  • How to use a spreadsheet program to edit a data table.
  • How to extract information from real data using a programming language.

Given these learning objectives, there are two possible contexts in which this activity can be carried out:

  • In Earth and Environmental Sciences: while learning about ocean acidification and environmental issues, students learn to use free software and a programming language for data analysis and graphing.
  • In Computer Science: while learning about spreadsheets and data analysis software, the use of this real data makes the students more concerned about the environmental consequences of their carbon footprint.

Step 1: The Software

All the programs used in this activity are available as free software.

The text editor: Atom

We will obtain the data from a web page, so the first step is to make a .csv file out of them. For this, we need a text editor that allows us to create and edit plain text files. Do not use a word processor, because they include special formatting symbols in the text. The text editor we have chosen is Atom. Atom is a free and open-source text and source code editor available for macOS, Linux, and Microsoft Windows. If you don’t want to (or can’t) install it on your computer, you can as well work with the default text editor of your operating system (Notepad in Windows, TextEdit in macOS, or Vim in Linux).

The spreadsheet: OpenOffice Calc

Once we have the data file, we’ll have to prepare the raw data for our purposes. A spreadsheet program lets us easily modify the .csv data file. We’ll work with the open-source, multiplatform Apache OpenOffice Calc spreadsheet, but any other spreadsheet program will work similarly.

The R programming language

Once the data are ready, we have to analyze them so we can extract conclusions. We’ll use R, a programming language and environment for statistical computing that is available as free software under the terms of the GNU General Public License. R is widely used in data analysis and scientific research to analyze, visualize and present data.

Although R has a command line interface, there are several graphical user interfaces that make it easier to work with it. The most specialized one, the open-source RStudio IDE, runs on the desktop in Windows, Mac, and Linux. But, due to its popularity as a programming language, there are as well many free online compilers that you can use to run R programs without the need to install anything. Very handy, isn’t it? We’ll focus on one of these online IDE’s: repl.it. According to their website, their mission is to make programming more accessible for educators, learners, and developers, or, in their own words, “to get people to code as soon as possible. What if teachers who want to teach programming don't have to also work as IT administrators? What if students can just code their homework without having to set up the development environment on every computer they wanted to code on?”. And they do make it easy to start coding. You don’t even need to sign up to start working with it! Sure, the desktop RStudio environment is more versatile, and there are some functionalities of R that don’t work in the online version, but to start with simple programs it’s more than enough. So choosing the desktop or the online environment is up to you.

Step 2: Raw Data

The data measured by the Hawaii Ocean Time-series (HOT) project in the ALOHA station, as adapted from “Dore, J.E., R. Lukas, D.W. Sadler, M.J. Church, and D.M. Karl. 2009. Physical and biogeochemical modulation of ocean acidification in the central North Pacific (Proc Natl Acad Sci USA 106:12235-12240)”, are freely available. Go to this link: http://hahana.soest.hawaii.edu/hot/products/HOT_surface_CO2.txt, and you will find the values of the measured magnitudes as a tab-delimited text file with 22 data columns. The meaning of these values is explained in the HOT_surface_CO2_readme.pdf document in this link: http://hahana.soest.hawaii.edu/hot/products/HOT_surface_CO2_readme.pdf.

Select the text on this page and copy it. You can either use the mouse to select and copy the data, or the keyboard shortcuts Ctrl+A to select all the text and Ctrl+C to copy it (or Cmd+A and Cmd+C in Mac). Once the data are copied to the clipboard, in the next step you will paste them in the text editor.

Step 3: Creating the .csv File

In this step we are going to create the .csv file that we need to analyze the data with R. A .csv file is a plain text file that contains tabular data (a data table). Each line of the file is a data record (a row in the table), and each record contains several fields (the columns in the table). All the records must have an identical list of fields. The fields in each line are separated by commas (in fact, CSV stands for "comma-separated values", although sometimes the field delimiter can be a semicolon or a tab), and individual lines are separated by a newline. The first row of the file contains the header of each data field (the name of the columns of the data table).

To create the .csv file with the ALOHA station data, you have to proceed as follows:

  1. Open the text editor (Atom, or the one of your choice).
  2. Create a new file (File > New File).
  3. Paste the content of the clipboard that you copied in the previous step.
  4. Delete the first rows of the file, which only contain a description of the data. Just don’t delete the line containing the header of the data fields!
  5. The original file uses the tab as a delimiter between fields. In the Find menu (or the Edit menu, depending on the editor you’re using), find the tabs and replace them with commas.
  6. Save the file. You can use whatever name you want, but it is essential that you add the .csv extension to the name. For example, you could name it dataAloha.csv. Now you’re ready to move on to the next step!

If you want to skip this step, just download the .csv file here.

Step 4: Editing the Data File

The data file that we obtained in the previous step contains far more information than we need. As we want to study the evolution of the amount of CO2 dissolved in water and how it affects the acidity of the ocean water, the only columns that we are interested in (check the readme.pdf document accompanying the data) are the following:

  • cruise: integer number of the HOT cruise to the ALOHA station.
  • date: mid-day of the cruise expressed as a calendar date.
  • pHcalc_insitu: mean seawater pH.
  • pCO2calc_insitu: mean seawater CO2 partial pressure, in μatm.

Besides, there are some missing data points, which are indicated by the value -999. Our task now will be to prepare the data file, so it contains only relevant data. We will also change the format of the dates to make them understandable by the data analysis tools, and we’ll add one more column, so the data are more easily processed later on. To do all this, we’ll edit the data with a spreadsheet program. So let’s get to work.

  1. Open OpenOffice Calc, or the spreadsheet program you are going to use.
  2. Open the dataAloha.csv file (File > Open...). You will be asked to choose the delimiter between the data fields: in Separator options, choose Separated by → Comma. Now the file looks like an ordinary data table, doesn’t it?
  3. Select the unwanted columns and delete them. Keep only the columns named cruise, date, pHcalc_insitu and pCO2calc_insitu. Now you should have a table with 299 rows (the header plus 298 data records) and only four columns (four data fields for each record).
  4. Just to make things easier later, simplify the names of the last two columns: name them pH and pCO2, respectively.
  5. Some records contain unknown values of the magnitudes, represented by -999. We are going to eliminate these rows since they don’t add any useful information. To do so, filter (in Data > Filter > Standard Filter…) the values of column pH that are equal to -999. Select these rows and delete them. Once you have deleted them, remove the filter to show the remaining 292 rows (in Data > Filter > Remove Filter).
  6. Let’s focus now on the second column, containing the date in which the measurements were taken. The values in this column have a problem: they are not expressed in a proper date format. Look at the first one: 31-Oct-88. OpenOffice Calc doesn’t understand it as a date, although it would identify 31 Oct 88, 31 October 1988, 31/10/88, and some others. So we will have to transform the values in this column into valid dates. This can be done in several ways, but the simplest one is to find all the hyphens (-) and replace them with spaces. Do so (in Edit> Find and Replace…) and all the dates are automatically expressed with the 31/10/88 format.
  7. We’re not done yet with the dates. The format we want is a very specific one: the ISO 8601 format. According to Wikipedia, “The purpose of this standard is to provide an unambiguous and well-defined method of representing dates and times, so as to avoid misinterpretation of numeric representations of dates and times, particularly when data are transferred between countries with different conventions for writing numeric dates and times.” To obtain this format, select the date column and go to Format > Cells… This opens a window where you’ll have to select Category → Date, and Format → 1999-12-31. There it is! Now the dates are expressed in the YYYY-MM-DD format (four digits for the year, followed by two digits for the month and finally two more digits for the day).
  8. To better visualize the relationship between the amount of CO2 and the acidity of water in the subsequent analysis, we are going to group the measurements in periods of five years: 1988-1992, 1993-1997, 1998-2002, 2003-2007, 2008-2012 and 2013-2017. This can be easily done this way: add the header period in a new column. Type 1988-1992 in both the first and second cells of this new column. Select both cells, and drag the black square in the corner of the second cell up to the last record of this five-year period. This will automatically copy the content of the first two cells into the other cells. Now repeat this process with the remaining periods.
  9. Save your work, and we’re done!

So far, we have processed the data, and they are ready to be analyzed. If you want to avoid this long step of editing the data file, just skip it and download here the dataAloha.csv file already processed.

Step 5: The R Programming Language and Environment

As we said in the introduction, the data will be analyzed using the R programming language. We’ll work with the online environment repl.it, so we can immediately start programming without the need to download or set up anything. Click on this link: https://repl.it/languages/rlang (or, in https://repl.it/, go to the bottom of the page and, in Languages, select R). This link will open in your browser the programming environment we are going to work with.

As you can see, the window is divided into three panels:

  • The panel on the left has the name “Files”. All the files of a project are gathered in this panel. For example, the main.r program file contains the R program, you’ll upload here the file containing the data of the project, and the pdf document generated by R when plotting a graph will appear here.
  • The center panel is where you’ll type the code of your program.
  • When you run the code, the text output is presented in the panel with the black background.

The programs that we’re going to use in this activity are rather simple, but they are a good starting point to discover R. If you want to learn more about this programming language, there are many online resources and books. This is a small selection:

Step 6: Example Program

If you haven’t worked with R before, we recommend you to try this simple example, so you can get acquainted with the programming language and the repl.it environment.

Imagine you’ve made an experiment to study the functional relationship between two variables x and y, so you record the values of the x variable together with the corresponding values of the y variable. The values obtained from the experiment are the following:

x = 0.1 → y = 6.2

x = 1.4 → y = 5.2

x = 2.6 → y = 4.4

x = 3.1 → y = 3.9

x = 4.6 → y = 2.8

To analyze the behavior of the data, start by transforming each set of variables into a vector. In R this is done using the c() function. With the <- operator (the equal sign, =, would as well work), we assign the values of each variable to a vector:

x <- c(0.1, 1.4, 2.6, 3.1, 4.6)
y <- c(6.2, 5.2, 4.4, 3.9, 2.8) 

Type these two lines in the center panel of a new repl window. Be careful to copy the values in the correct order: the first value of the x vector corresponds to the first value of the y vector, and so on. Once you’ve copied them, click the green “Run” button just above this panel. Nothing happens yet. Good. If you get an error message, check that you have typed it correctly, with no missing commas or parentheses.

To see the relationship between x and y, plot the graph of y as a function of x. The following instruction:

plot(x, y)

plots x as a function of y, so type it below the previous instructions and run the program. A new file called Rplots.pdf should have appeared in the left panel. Click on its name. Do you see the scatter plot of y vs x?

Look at the graph: there’s a very strong linear relationship between both variables. To make the regression analysis of the data you’ll use a linear model, which in R is performed by the lm() function. The syntax of this function is lm(y~x), which fits the data to the straight line y=a+bx (a is the intercept and b is the slope of the line). As you’ll use this fitting line twice in your program, assign it to a variable named model (or any other name you prefer). So go back to your main.r program and type one more instruction:

model = lm(y~x)

If you need the values of the coefficients of the linear model, just add the following instruction to your program:


Now, on the right panel you should see these values: the intercept is 6.2803, and the slope is -0.7544, so the fitting line is y=6.2803 - 0.7544x.

Just one more thing to finish with this example: add the fitting line to the graph. This is accomplished with the abline() function, which adds a straight line to the current plot. Type this last instruction:


that prints the fitting line together with the plot, and see the result!

You can try the example program in this link: https://repl.it/@bpadin/example

Step 7: Analyzing the Seawater Acidity

Now let’s go back to our ALOHA data. The first aspect that we are going to analyze is how the acidity of seawater has changed in the last 30 years. Scientific research shows that the oceans are gradually becoming more acidic, but do our data support this claim? To find out, we will plot the graph of the pH of seawater as a function of time. Remember that a lower pH means more acidity, and a higher pH means less acidity.

Let’s write the first program. The data to analyze are gathered in the dataAloha.csv file so, first of all, we need to upload this file. As we said in the previous step, the data files must be included in the left panel. So just drag the file from your hard drive and drop it in this panel. The dataAloha.csv file should appear there. Just to check that the file is properly uploaded, click on its name to see its content. You should see the data records, with the fields separated by commas.

Now click on the main.r file and type the program in the center panel. The program to make the scatter plot of pH versus date can’t be simpler:

aloha <- read.csv("dataAloha.csv")
aloha$date <- as.Date(aloha$date)
plot(aloha$date, aloha$pH)

That’s it! You can try this very basic program in this link: https://repl.it/@bpadin/pHbasic.

Let’s briefly explain the meaning of this code:

Line #1: The instruction read.csv("dataAloha.csv") reads the dataAloha.csv data file, and then the content of the file is assigned (using the assignment symbol <-) to a data object that we have called aloha (you can choose a different name if you want).

Line #2: When working with data in R, we’ll use the “dollar notation” a lot. The expression aloha$date refers to the date data field in the aloha data object. In other words, aloha$date stores the values of the date column. This column contains the dates in which the measurements were taken, but R understands them just as characters. So we use the as.Date() function to inform R that those values are actually dates.

Line #3: The function plot(x, y) plots the y variable as a function of the x variable. As a result, the plot(aloha$date, aloha$pH) instruction will plot the values of the date in the horizontal axis, with the corresponding values of pH in the vertical axis, thus making the scatterplot of pH vs time.

To actually plot the graph we have to run the code. Click on the green “Run” button above the code. Do you see the file Rplots.pdf in the left panel? Click on its name. The scatterplot should open as a pdf file. Great!

So far so good, but we can improve this scatterplot by joining the points with lines so the temporal evolution of the pH is more clearly seen. Go back to the main.r program and add the parameter type="l" ("l” stands for “line”) to the plot() function. Run the program to see the plot. Much better now!

The graph we have just plotted gives us a clear idea of what’s going on: the seawater’s pH has been decreasing over the last 30 years. In other words, the ocean has become more acidic. To make this more obvious, we can add the regression line that shows the tendency of the data in this scatterplot. To do so, just add the following line to your code and see the result:

abline(lm(aloha$pH ~ aloha$date))

Ok, let's be honest. This is not the most gorgeous graph you've ever seen, right? And, besides, the labels of the axes are not appropriate, and the title of the graph is missing. And what about giving it some color? Type the following program to see the final version of the graph:

aloha <- read.csv("dataAloha.csv")
aloha$date <- as.Date(aloha$date)

pH = aloha$pH
date = aloha$date

plot(date, pH,
     type = "l",
     col = "#800000",
     main = "Hawaii Ocean Time-series surface CO2 system data product\nStation ALOHA (Hawai), 1988-2017",
     xlab = "Year", 
     ylab = "Seawater pH")

grid(col = "lightblue4", lty = "dotted")

abline(lm(pH ~ date), col = "#800000")

You can try the program in this link: https://repl.it/@bpadin/alohapH.

The plot() function is highly customizable, so check the bibliography to make any further changes you want.

Step 8: Analyzing the Amount of Dissolved Carbon Dioxide

In the previous step, we studied the evolution of the water’s pH. Let’s do the same with the amount of carbon dioxide dissolved in the seawater. Copy this program to see the resulting graph (you can try the program in this link: https://repl.it/@bpadin/alohaCO2):

textTitle = "Hawaii Ocean Time-series surface CO2 system data product\nStation ALOHA (Hawai), 1988-2017"
textXaxis = "Year"
textYaxis = expression(paste("Partial pressure of CO"["2"], " (", mu, "atm)"))

aloha <- read.csv("dataAloha.csv")
aloha$date <- as.Date(aloha$date)

pCO2 = aloha$pCO2
date = aloha$date

plot(date, pCO2, 
     type = "l",
     col = "#2d572c",
     main = textTitle,
     xlab = textXaxis,
     ylab = textYaxis)
grid(col = "lightblue4", lty = "dotted")
abline(lm(pCO2 ~ date), col = "#2d572c")   

Run the program and look at the graph. The data confirm the expected trend again: in the last 30 years, the amount of CO2 absorbed by the ocean has increased continually.

But there’s something else calling our attention. The graph has a lot of spikes that appear in a rather regular fashion, showing that the amount of carbon dioxide doesn’t increase constantly, but has clear maxima and minima all over the years. Do these values have a seasonal explanation? Are there months when the ocean absorbs more CO2, and months when it absorbs less? To answer these questions, try the programs in the attached files and extract your conclusions!

Step 9: Analyzing the Relationship Between CO2 and PH

Putting together the graphs of the evolution of pH and carbon dioxide, the conclusion is obvious: as the amount of the CO2 dissolved in water has increased, the pH of the seawater has decreased, meaning that its acidity has as well increased. So let’s plot one last graph that represents the pH of seawater as a function of the carbon dioxide dissolved in it. Copy this program (or try it in this link: https://repl.it/@bpadin/alohapH-CO2) and run it to see this relationship:

# Custom color palette
col.list <- c("#ffb8de","#ff72be","#ff1493","#cc1075","#7f0a49","#33041d")

# Text of the labels
textTitle = "pH of Water vs Dissolved Carbon Dioxide\nStation ALOHA (Hawai), 1988-2017"
textXaxis = expression(paste("Partial pressure of CO"["2"], " (", mu, "atm)"))
textYaxis = "Seawater pH"

aloha <- read.csv("dataAloha.csv")
aloha$date <- as.Date(aloha$date)
aloha$period <- factor(aloha$period)

pH = aloha$pH
pCO2 = aloha$pCO2
period = aloha$period

plot(pCO2, pH,
grid(col="lightblue4", lty="dotted")

Step 10: Conclusions

Data are undeniable: ocean acidification is a fact, and it seems to be closely related to the increase in CO2 absorbed by the ocean. With this activity, we expect to raise in our students their environmental awareness, making them at the same time technologically competent to analyze real data and extract conclusions. Of course, working with real data can be hard, but it includes a very strong motivational factor. Students don’t have to blindly assume facts but discover relationships by themselves, thus making their learning more significant.

Classroom Science Contest

Participated in the
Classroom Science Contest

Be the First to Share


    • Make it Move Contest 2020

      Make it Move Contest 2020
    • Back to Basics Contest

      Back to Basics Contest
    • Bikes Challenge

      Bikes Challenge

    3 Discussions


    Question 1 year ago on Step 4

    Hi, thanks for sharing this lesson. I'm wondering why you haven't decided to run all the data manipulation and cleansing in R. Is this a personal choice?


    Answer 1 year ago

    Well, actually it was more a "teaching" choice. I 'm assuming the students have no prior knowledge of R, so I wanted to simplify the programs as much as possible. I think that doing the all the data manipulation in R might be overwhelming for them, so I prefered to do it this way. Besides I've noticed that most of them are not completely fluent in using a spreadsheed, so I thought this would be a good way to practice!


    1 year ago

    This is really interesting! Thanks for sharing.