ggplot2 graphics in a loop

A client has a specific audit they perform quarterly across 200 of their manufacturing plants. The audit has 8 distinct sections examining the different areas of the plant (shipping, receiving, storage, packaging,etc.) Instead of having one cumulative final score, the audit displays a final score for each section. I wanted to examine the distribution of section’s scores before and after removing outliers and extreme values. I usually use the MASS package’s truehist() for quick looks at data, but since I’m writing a detailed loop I will use ggplot2 for fine aesthetic control. The historical results of audits were imported into a data frame with the 8 score columns as well as other instance identifying columns. Basically I don’t want to waste time writing out “ggplot(df,aes(x=x)) + geom_histogram()” or “qplot(x,data = df))” for each section’s corresponding column in the df.

Below is the function in it’s entirety:

[sourcecode language=”r”]
plotHistFunc <- function(x, na.rm = TRUE, …) {
nm <- names(x)
for (i in seq_along(nm)) {
plots <-ggplot(x,aes_string(x = nm[i])) + geom_histogram(alpha = .5,fill = "dodgerblue")
ggsave(plots,filename=paste("myplot",nm[i],".png",sep=""))
}
}

plotHistFunc(df) ## execute function
[/sourcecode]

Line 1: The usual process of defining the function and assigning it a name, in this case I’m calling the function “plotHistFunc”. I’m also setting na.rm to TRUE so that NA entries are ignored. Here “x” will be a data frame that I specify when running the function later with plotHistFunc(“dataframe name here”)

Line 2: names() retrieves the column names of a given dataframe “x” to be used in the next step.

Line 3: This is where it gets tricky, here we define the typical “i in x” part of a generic function, however instead of putting “x” (our dataframe) here we will use the recently known to me seq_along() function. This function generates a regular sequence along object nm.

Line 4: The code to be iterated in the loop is within the inner set of brackets {}, here the ggplot function is assigned to object “plots”. plots aes_string which is useful when writing functions that create plots because you can use strings to define the aesthetic mappings, rather than having to mess around with expressions.

Line 5: ggsave is used to save plots to a file, along with paste I am able to generate unique file names for each plot. For example filename=paste(“myplot”,nm[i],”.png”,sep=””) will generate a file with name myplotshipping.png, “shipping” coming from the name of the column.

Here is the function if you wanted to generate/view the plots in your R environment, but not generate files.

[sourcecode language=”r”]
plotHistFunc <- function(x, na.rm = TRUE, …) {
nm <- names(x)
for (i in seq_along(nm)) {
print(ggplot(x,aes_string(x = nm[i])) + geom_histogram(alpha = .5,fill = "mediumseagreen")) }
}
[/sourcecode]

It is important to note that the ggplot function needs to be wrapped in print in order for it to display.

Data visualization with R and ggplot2

I’m working on a one-hour ggplot2 lecture for the San Diego R users group, which I will post here when I’m done. I think there are many great intro to R data visualization resources out there so I’ll only share working examples on my blog.

A retail chain client employs a few hundred field agents who perform a variety of standardized audits across thousands of their retail locations. This client wanted to identify how the tenure of a field agent (how long they have been with the company) affected the way they scored locations they audited.

The audit score and tenure information was extracted from two separate SQL databases and will be joined in R.
[sourcecode language=”r”]
df <- read.csv ("df.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)
tenure <- read.csv ("tenure.csv", header = TRUE, sep = ",", quote="\"", dec=".",fill = FALSE)
[/sourcecode]
The audit dataset contains 2 years of completed audits categorized by 2 month “rounds”. That is to say every retail location was audited once every two months, thus there are 6 rounds in one year. Here I define my outlier function:
[sourcecode language=”r”]
remove_outliers <- function(x, na.rm = TRUE, …) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, …)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] – H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
[/sourcecode]
Variation is introduced as the audit changes slightly per round so I use the outlier function to remove outliers per round rather than apply it to the entire sample.
[sourcecode language=”r”]
df$score_outrm <- ave(df$Score,df$Round,FUN=remove_outliers) #apply f(x) over level combinations of factors
dfoutrm <- df[!(is.na(df$score_outrm) | df$score_outrm==""), ] #remove rows with NA in score_outrm column
df <- subset(dfoutrm, select = -c(score_outrm) ) #remove column used to identify outliers
remove(dfoutrm)
[/sourcecode]
The two datasets will be easy to join since the tenure data frame is structured as a lookup table:
[sourcecode language=”r”]
merged <- merge(df,tenure) #merge df and tenure data
merged <- subset(merged, Storetype != "Mall")#remove Mall locations as there are too few too visualize properly
#Just a quick examination of the relationship with a linear model:
lm1<- lm(Score ~ tenureYears,data=merged)
coef(lm1)
#Pasted output:
#(Intercept) tenureYearsone-two tenureYearstwo-three
#93.5756995 -0.6065633 -1.1380081
[/sourcecode]
This shows a negative relationship between tenure and score, that is score decreases as tenure increases. Let’s take a quick look at summaries with plyr’s ddply which allows us to apply a numeric function to a column subsetted by a value of another column. In this case we are looking at mean and median of the Score column by the tenureYears column:
[sourcecode language=”r”]
library(plyr)
ddply(merged, .(tenureYears), summarise, "mean" = mean(Score)) # get summary of means for scores by tenure
#Pasted output
# tenureYears mean
#1 one 93.57570
#2 one-two 92.96914
#3 two-three 92.43769
ddply(merged, .(tenureYears), summarise, "median" = median(Score)) # get summary of median for scores by tenure
#Pasted output
# tenureYears median
#1 one 94.220
#2 one-two 93.345
#3 two-three 92.980
[/sourcecode]
Now for ggplot2:
[code language=”r”]
library(ggplot2)
ggplot(merged, aes(x = tenureYears, y = Score, fill = storetype)) +
geom_boxplot(alpha = .6,size = 1) +
scale_fill_brewer(palette = "Set1") +
stat_summary(fun.y = "mean", geom = "point", shape= 23, size= 2, fill= "white") +
geom_smooth(alpha=2/10,aes(group=1),method = "lm", se=FALSE, color="black", size=.7) +
ggtitle("Audit scores by tenure and store type") +
theme(axis.title.y=element_blank()) + theme(axis.title.x=element_blank()) +
labs(fill=’store type’)
[/code]I added mean (white diamonds) to the boxplot to exhibit how extreme values have pulled it one way or the other away from the median. I could have used a more robust gam line instead of lm, but that wouldn’t be as salient unless this was a time series with geom_point. I might fine tune outlier detection and examine the distributions of the data a bit more before proceeding as this was just a preliminary look. We could gain more insights by faceting the visualization on other variables such state, time of day, management hierarchy, etc. with the facet_grid() parameter.
ggplot tenurescore