Replacing Excel with the Eighteenth Letter of the Alphabet

Every once in a while I have to graph data in order to better understand it. Most of the time, I use Microsoft Excel to generate graphs because it’s one of the easiest ways to produce them. Unfortunately, Excel’s ease of use quickly degrades once you move beyond pasting data into cells and clicking the graph button. I started looking for something to make the process of visualizing data more flexible. Something with pluggable libraries, helpful examples, and room for reproducibility — I ended up replacing Excel with R.

R

R is an open source programming language for statistical computing and publication-quality graphics. Different than a general purpose programming language, R’s core includes many features designed to empower statisticians. I’m no statistician, but I was intrigued by its approachable syntax and familiar data structures. Even more, it’s backed by a potent community that contributes numerous packages to solve all sorts common problems.

R code is typically interpreted in a REPL, but can also be captured in a file and executed as a script. To reduce my learning curve, I looked into an IDE for R. I settled on RStudio because it has a clean UI, was easy to install, and has its source code hosted on GitHub. If you’re going to explore R for the first time, I’d strongly encourage the use of an IDE. It makes searching documentation, viewing graphs, and inspecting output simple.

Graphing

My first task for R was to parse application logs and graph the frequency of specific user interactions. The application being logged provides a web interface to query financial data sets. The log records queries, so my goal was to plot each distinct data set and the number of times it was queried.

After reading through several examples of R’s standard graphing functions, I stumbled upon ggplot2. The ggplot2 package brands itself as the “Grammar of Graphics.” A graphing system that takes what’s good about R’s and omits the bad. In comparison to the base graphing library, ggplot2’s syntax is slightly more intuitive. This, and the fact that several answers on Stack Overflow recommend it, compelled me give it a try.

library(plyr)
library(ggplot2)

# Import a pipe-delimited file without a header row.
requests <- read.csv("requests.dat", header=FALSE, sep="|")

# Extract a subset of the requested data sets (column V8)
# with a frequency greater than 2000.
data_set_freq <- subset(count(requests, 'V8'), freq > 2000)

# Rename columns and rows.
colnames(data_set_freq) <- c('data_set', 'freq')
row.names(data_set_freq) <- data_set_freq$data_set

# Plot it.
ggplot(data_set_freq, aes(factor(data_set), freq)) + 
  geom_bar() + 
  labs(y = "Web Queries", x = "Data Sets")

The short snippet of code above produces the following graph:

2011 Web Queries by Data Set

Now, to take that a step further, I wanted to figure out which data set is associated with the most failures. Building on the code above, here’s how I accomplished that:

# Extract a subset of data set names (column V8) for
# failed requests (column V15).
data_set_errors_freq <- count(subset(requests,
  grepl('ERROR|killed', requests$V15), select = c(V8)), 'V8')

# Rename columns and rows.
colnames(data_set_errors_freq) <- c('data_set', 'freq')
row.names(data_set_errors_freq) <- data_set_errors_freq$data_set

# Merge the data set frequencies with errors and create
# a third column for percent error.
data_set_summary <- merge(data_set_freq, data_set_errors_freq,
  by.x = 'data_set', by.y = 'data_set')
data_set_summary <- ddply(data_set_summary, .(data_set), transform,
  percent_error = (freq.y / freq.x) * 100)

# Plot it.
ggplot(data_set_summary, aes(factor(data_set), freq.x, fill=percent_error)) + 
  geom_bar() + 
  labs(y = "Web Queries", x = "Data Sets", fill="Precent Error")

A similar graph to the one above, except that this one contains colors based on the percentage of errors:

2011 Web Queries by Data Set with Percent Error

Conclusion

Building graphs with R feels a lot like building servers with Chef. I can configure a server once manually, or I can write code that automates its deployment process forever. Likewise, I can paste data into Excel and point and click to build a graph, or I can write R code that reproduces a handful of steps with one command. There are a number of GUI tools that build graphs from data, but once you begin applying filters, merging data sets, or running calculations these tools break down. Writing code instead of clicking buttons has its downsides, but there is something comforting in knowing that as long as our log structure doesn’t change, I’ll be able to reproduce these graphs six months from now and immedietally know which data sets are most error prone.

My name is Hector Castro and I am a developer located in Philadelphia, PA. If you are interested in my services, please review my resume, GitHub account, and profile on Stack Overflow Careers.

You can get in contact with me via Twitter or E-mail.