suppressMessages(library(caret)) suppressMessages(library(rpart)) suppressMessages(library(tree)) suppressMessages(library(randomForest)) suppressMessages(library(e1071)) suppressMessages(library(ggplot2)) suppressMessages(library(dplyr)) suppressMessages(library(rattle))
In this project, we will be working with data from the SeaFlow environmental Seaflow Cytometry instrument.
A flow cytometer delivers a flow of particles through capilliary. By shining lasers of different wavelengths and measuring the absorption and refraction patterns, you can determine how large the particle is and some information about its color and other properties, allowing you to detect it.
The technology was developed for medical applciations, where the particles were potential pathogens in, say, serum, and the goal was to give a diagnosis. But the technology was adapted for use in environmental science to understand microbial population profiles.
The SeaFlow instrument, developed by the Armbrust Lab at the University of Washington, is unique in that it is deployed on research vessels and takes continuous measurements of population profiles in the open ocean.
The scale of the data can be quite large, and is expected to grow significantly: A two-week cruise from one vessel can generate hundreds of gigabytes per day, and the vision is to deploy one of these instruments on not only research vessels but the commercial shipping fleet as well.
While there are a number of challenging analytics tasks associated with this data, a central task is classification of particles. Based on the optical measurements of the particle, it can be identified as one of several populations.
We use a dataset that represents a 21 minute sample from the vessel in a file seaFlow 21min.csv. This sample has been pre-processed to remove the calibration “beads” that are passed through the system for monitoring, as well as some other particle types.
The columns of this dataset are as follows:
· file_id: The data arrives in files, where each file represents a three-minute window; this field represents which file the data came from. The number is ordered by time, but is otherwise not significant.
· time: This is an integer representing the time the particle passed through the instrument. Many particles may arrive at the same time; time is not a key for this relation.
· cell_id: A unique identifier for each cell WITHIN a file. (file_id, cell_id) is a key for this relation.
· d1, d2: Intensity of light at the two main sensors, oriented perpendicularly. These sensors are primarily used to determine whether the particles are properly centered in the stream. Used primarily in preprocesssing; they are unlikely to be useful for classification.
· fsc_small, fsc_perp, fsc_big: Forward scatter small, perpendicular, and big. These values help distingish different sizes of particles.
· pe: A measurement of phycoerythrin fluorescence, which is related to the wavelength associated with an orange color in microorganisms
· chl_small, chl_big: Measurements related to the wavelength of light corresponding to chlorophyll.
· pop: This is the class label assigned by the clustering mechanism used in the production system. It can be considered “ground truth”“, but there are particles that cannot be unambiguously classified, so we should not aim for 100% accuracy. The values in this column are crypto, nano, pico, synecho, and ultra
Step 1: Read and summarize the data
Read the file seaFlow 21min.csv and get the overall counts for each category of particle.
data <- read.csv("seaFlow 21min.csv") data %>% select(pop) %>% table
## . ## crypto nano pico synecho ultra ## 102 12698 20860 18146 20537
Step 2: Split the data into test and training sets
We divide the data into two equal subsets, one for training and one for testing.
indices <- createDataPartition(y=data$pop, p=0.5, list = FALSE) Train <- data[indices, ] Validation <- data[-indices, ]
Step 3: Exploratory Analysis
We plot pe against chl_small and color by pop. The plot is interesting because we notice that there is obvious clustering in the data set.
ggplot(data, aes(chl_small, pe, color = pop)) + geom_point(na.rm = TRUE)
Step 4: Train a decision tree.
We train models of the form:
model <- train(formula, dataframe)
We then use the model object to make predictions by passing it to the predict function.
A formula object describes the relationship between the independent variables and the response variable, and can be expressed with the syntax
pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small
and used with the formula function to construct the formula object itself:
fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small)
The rpart library uses this convention. With training data
Train and formula object called
fol, we construct a decision tree. Here we have used
fancyRpartPlot from the rattle plot to visulize our decision tree.
fol <- formula(pop ~ fsc_small + fsc_perp + fsc_big + pe + chl_big + chl_small) tree_model <- rpart(fol, method="class", data=Train) fancyRpartPlot(tree_model)