Preseeding Ubuntu Server and Static IP Addresses

Setting up a cluster of computers for any purpose usually requires installing an operating system. The installation process typically consists of several questions and identical answers for each node in the cluster. Automating the submission of answers to these questions is desirable — not only to prevent inconsistencies, but for general convenience.

Preseeding

I spent the last few days working to stand up a proof-of-concept Riak cluster. The first step involved installing Ubuntu Oneiric Ocelot (11.10) on four virtual machines. Luckily, Ubuntu/Debian has a process called preseeding to facilitate automated installations. Surprisingly, it also has limited support for Red Hat’s Kickstart. Playing it safe, I went with preseeding.

There are three methods that can be used for preseeding: initrd, file, and network. I wasn’t interested in re-authoring ISOs or setting up a TFTP server, so I went with a web-accessible preseed file. The pros of this approach are that the configuration file is easily modifiable, yet still accessible. The cons are that it doesn’t become available to the installer until the network is configured.

Assigning a Static IP Problem

Because web-accessible preseed files aren’t available until the network is configured, the step to assign a static IP address gets missed. Below are several approaches I found to assign a static IP address with preseeding.

Boot Parameters

The boot prompt is where you tell the installer how to locate your preseed file. It is also where you can pass a fixed number of preseed directives. In our example of assigning a static IP address, you’d pass things like IP address, hostname, domain, and netmask. Ultimately, I wasn’t too interested in this approach because it required a lot of typing without clipboard access.

Ballooning

Re-evaluating Network Configuration

The Ubuntu Help wiki has a suggested hack to trigger re-evaluation of preseeded network configuration settings by executing commands via preseed/run. Unfortunately, I was unable to get this to work successfully. In every combination I tried, it resulted in the installer failing. This related Ubuntu Forums post outlines the suggested steps pretty well.

Overwriting Network Configuration

Eventually this is the solution I used to assign a static IP address. It’s a hack, but in my eyes it was the lesser of three evils. Alongside each node’s preseed configuration file, I created a corresponding shell script. The shell script gets executed before the installer triggers a reboot and overwrites /etc/network/interfaces with a static IP configuration:

echo "auto lo
iface lo inet loopback

auto eth0
iface eth0 inet static
 address 192.168.1.10
 netmask 255.255.255.0
 gateway 192.168.1.1
" > /etc/network/interfaces

For completeness, I included my preseed configuration file as a Gist. If anyone has a better approach to setting a static IP address via preseeding or Kickstart, let me know!

Testing Command-line Applications with Aruba

Cucumber is often used to test web applications. Many developers hook it into their Rails projects to integration test site features. Wouldn’t it be great if there were a way to test command-line applications in a similar fashion? You can with Aruba.

Aruba

Aruba is a Cucumber extension for testing command-line applications written in any language. Passing arguments, interacting with the file system, capturing exit codes, and mimicking interactive usage are all features provided out of the box. Below is a basic test for the mv command that passes:

Scenario: Backing up test.conf
  When I run `mv test.conf test.conf.bak`
  Then the output should contain:
  """
  mv: rename test.conf to test.conf.bak: No such file or directory
  """

Now let’s showoff a few of Aruba’s built-in steps to prevent the command from failing:

Scenario: Backing up test.conf
  Given an empty file named "test.conf"
  When I run `mv test.conf test.conf.bak`
  Then the exit status should be 0
  And the following files should exist:
    | test.conf.bak |
  And the following files should not exist:
    | test.conf     |

The first step creates an empty file and executes mv inside of Aruba’s sandbox directory. After the mv command is executed, its exit status is compared to 0 and the existence of test.conf.bak (and non-existence of test.conf) is confirmed.

It’s also worth noting that after each scenario Aruba clears out its sandbox — a temporary directory that becomes the current working directory for your command-line tool — unless you explicitly tag the scenario with @no-clobber. This tag preserves the previous scenario’s final state. Tying this back to the example above, the next scenario would begin with only test.conf.bak in the sandbox. Additional Aruba-specific tags can be found in the README.

Extending the Aruba API

As a command-line application evolves, other conditions not available in Aruba’s built-in API will require testing. For example, say you need to assert a file’s user and group attributes. Because Aruba’s API was built using Ruby modules, it can be reopened inside of Cucumber’s env.rb:

module Aruba
  module Api
    def check_file_owner_and_group(paths_and_users_and_groups)
      prep_for_fs_check do # Lower-level function provided by Aruba
        paths_and_users_and_groups.each do |path, user, group|
          stat = File.stat(path)

          Etc.getpwuid(stat.uid).name.should == user
          Etc.getgrgid(stat.gid).name.should == group
        end
      end
    end
  end
end

Then create a matcher:

Then /^the following files should have username "([^"]*)" and group "([^"]*)":$/ do |user, group, files|
  check_file_owner_and_group(files.raw.map { |file_row| (file_row << user) << group })
end

Now that step can be included to test the user and group attributes of files:

Scenario: Backing up test.conf
  Given an empty file named "test.conf"
  When I run `mv test.conf test.conf.bak`
  And the exit status should be 0
  And the following files should exist:
    | test.conf.bak |
  And the following files should not exist:
    | test.conf     |
  And the following files should have username "hector" and group "staff":
    | test.conf.bak |

Conclusion

Using a behavior-driven development approach for building command-line applications with Cucumber and Aruba was a pleasure. Aruba’s API covers a decent amount of ground and was easily expandable. The source code was straightforward and after skimming its internals, I was able to expand the API to meet my needs. Hopefully reading this will help you do the same.

Replacing Excel with the Eighteenth Letter of the Alphabet

Every once in a while I have to graph data in order to better understand it. Most of the time, I use Microsoft Excel to generate graphs because it’s one of the easiest ways to produce them. Unfortunately, Excel’s ease of use quickly degrades once you move beyond pasting data into cells and clicking the graph button. I started looking for something to make the process of visualizing data more flexible. Something with pluggable libraries, helpful examples, and room for reproducibility — I ended up replacing Excel with R.

R

R is an open source programming language for statistical computing and publication-quality graphics. Different than a general purpose programming language, R’s core includes many features designed to empower statisticians. I’m no statistician, but I was intrigued by its approachable syntax and familiar data structures. Even more, it’s backed by a potent community that contributes numerous packages to solve all sorts common problems.

R code is typically interpreted in a REPL, but can also be captured in a file and executed as a script. To reduce my learning curve, I looked into an IDE for R. I settled on RStudio because it has a clean UI, was easy to install, and has its source code hosted on GitHub. If you’re going to explore R for the first time, I’d strongly encourage the use of an IDE. It makes searching documentation, viewing graphs, and inspecting output simple.

Graphing

My first task for R was to parse application logs and graph the frequency of specific user interactions. The application being logged provides a web interface to query financial data sets. The log records queries, so my goal was to plot each distinct data set and the number of times it was queried.

After reading through several examples of R’s standard graphing functions, I stumbled upon ggplot2. The ggplot2 package brands itself as the “Grammar of Graphics.” A graphing system that takes what’s good about R’s and omits the bad. In comparison to the base graphing library, ggplot2’s syntax is slightly more intuitive. This, and the fact that several answers on Stack Overflow recommend it, compelled me give it a try.

library(plyr)
library(ggplot2)

# Import a pipe-delimited file without a header row.
requests <- read.csv("requests.dat", header=FALSE, sep="|")

# Extract a subset of the requested data sets (column V8)
# with a frequency greater than 2000.
data_set_freq <- subset(count(requests, 'V8'), freq > 2000)

# Rename columns and rows.
colnames(data_set_freq) <- c('data_set', 'freq')
row.names(data_set_freq) <- data_set_freq$data_set

# Plot it.
ggplot(data_set_freq, aes(factor(data_set), freq)) + 
  geom_bar() + 
  labs(y = "Web Queries", x = "Data Sets")

The short snippet of code above produces the following graph:

2011 Web Queries by Data Set

Now, to take that a step further, I wanted to figure out which data set is associated with the most failures. Building on the code above, here’s how I accomplished that:

# Extract a subset of data set names (column V8) for
# failed requests (column V15).
data_set_errors_freq <- count(subset(requests,
  grepl('ERROR|killed', requests$V15), select = c(V8)), 'V8')

# Rename columns and rows.
colnames(data_set_errors_freq) <- c('data_set', 'freq')
row.names(data_set_errors_freq) <- data_set_errors_freq$data_set

# Merge the data set frequencies with errors and create
# a third column for percent error.
data_set_summary <- merge(data_set_freq, data_set_errors_freq,
  by.x = 'data_set', by.y = 'data_set')
data_set_summary <- ddply(data_set_summary, .(data_set), transform,
  percent_error = (freq.y / freq.x) * 100)

# Plot it.
ggplot(data_set_summary, aes(factor(data_set), freq.x, fill=percent_error)) + 
  geom_bar() + 
  labs(y = "Web Queries", x = "Data Sets", fill="Precent Error")

A similar graph to the one above, except that this one contains colors based on the percentage of errors:

2011 Web Queries by Data Set with Percent Error

Conclusion

Building graphs with R feels a lot like building servers with Chef. I can configure a server once manually, or I can write code that automates its deployment process forever. Likewise, I can paste data into Excel and point and click to build a graph, or I can write R code that reproduces a handful of steps with one command. There are a number of GUI tools that build graphs from data, but once you begin applying filters, merging data sets, or running calculations these tools break down. Writing code instead of clicking buttons has its downsides, but there is something comforting in knowing that as long as our log structure doesn’t change, I’ll be able to reproduce these graphs six months from now and immedietally know which data sets are most error prone.

My name is Hector Castro and I am a developer located in Philadelphia, PA. If you are interested in my services, please review my resume, GitHub account, and profile on Stack Overflow Careers.

You can get in contact with me via Twitter or E-mail.