When A Shell Script Is Doing Too Much

The engineering department has only begun to embrace automating code deployments since I’ve joined. I still haven’t been able to convince them to fully embrace a solution like Chef with the development teams, but hey, baby steps.

With that being said, our current code deployment infrastructure is based entirely on shell scripts and Jenkins. It’s a pretty awesome complement of scripts, separating out the various bits of server rolling (marking a server as shutting down, monitoring user levels, shutting the server down, deploying files, starting it, etc) and the lists of environment hosts from the high level logic of “do a rolling deploy now pls thx”. Unfortunately, as is often the case, the simple use-cases that were required early on have grown in complexity with time.

It’s not a perfect system by any stretch, and it hasn’t really gotten the full complement of love and affection that it needs. Recently, we had a bit of a SNAFU when a colleague made a change to the Jenkins job that resulted in the script breaking in a pretty serious way. I can’t help but take blame for it, as it was caused by a sharp corner I left in the deployment script. I had built this out assuming that the job calling it was totally stable and the universe was filled with sweetness and light… it turns out this was not the case.

Let’s take a look at what caused the problem!

The problem

Early on, we added in support for monitoring user levels before shutting a server down, allowing us to perform the rolling restarts without disconnecting users. This was superfluous and painful for our non-production environments, so a switch was added to skip this step. I exposed this as a jenkins parameterized build option.

In Jenkins, a number of built in environment variables are set before a shell script is executed, and any parameterized build options are exposed as well. Here’s how I implemented the check:

if [ "$SKIP_LOCKING" = "false" ]; then
  # Do the lock and monitor stuff
  echo -e "\tLocking will be skipped for $HOST"
fi # End skip locking check

Can you spot the problem above? When the other developer removed the $SKIP_LOCKING option from the Jenkins job, it resulted in the variable being undefined, which meant that it would not match the check and default to skipping the lock step! In production, this meant that one of the hosts went down abruptly, causing the players on that server to be unceremoniously given the boot.

There are a number of ways to fix this, the one I chose was to provide a default value when referencing the Jenkins variable. This default is false, meaning if nothing was provided, we shouldn’t assume that locking should be skipped.

While this wasn’t incorrect behavior, strictly speaking, it was a sharp corner, and I don’t like those. The impact to users ended up causing a number of meetings (sigh.) about how to prevent this in the future, etc. The outcome of these meetings was a desire to put a ton of monitoring, safety checks, and sanity checks into the deployment scripts, as well as some new features.

The path forward

At this point, the requirements have shifted and expanded so much that I don’t think bash is necessarily the best choice going forward. Instead, I’ve decided to use a more full featured scripting language, Ruby. Using clamp I can interact with the same command line utilities I had previously used, with better support for readable code (sand down those sharp corners!).