Salt Pillar Driven Design Pattern

I am a gigantic fan of Salt/SaltStack. While working at YouVersion I migrated us from Puppet to Salt, and I never looked back. In fact, I became such a big fan of how much faster I was at my job thanks to Salt, that I attended the first SaltConf this year and became one of the first ten SaltStack Certified Engineers ever.

I want to share a design pattern I use in my Salt code that helps me get things provisioned faster and with less chance of mistakes.

I use pillar’s heavily in our Salt code. For instance, I have some Postgres servers that run 9.2 and others 9.3. I use a pillar to assign what version should be installed on a particular host and my postgresql state uses the pillar data to decide which version gets installed.

What I found is that when I was adding new machines, I was having to describe a host in 2 different places: the pillar top.sls and the state top.sls. This is what lead me to a design pattern where my states are controlled by my pillars to stop me from having to do double duty.

Here’s how I do it:


I start out with a pillar/top.sls that looks like this:

Nothing too crazy here. I have 3 pillars, pgbouncer.sls, postgresql.sls and postgresql-93.sls and assign them to my hosts as needed.


pgbouncer.sls is very straight forward:


postgresql.sls uses a little bit more advanced Pillar data:

In this pillar, I am using jinja templating that allows me to override pillar values and assign default values. Basically, these pillars mean “if this jinja variable doesn’t exist, use the default value”.


postgresql-93.sls is a pillar that inherits all the values of postgresql.sls, but also sets the jinja variable that sets the pillar data that my postgresql state uses to determine which PostgreSQL version gets installed.


And now, the “secret sauce”, the state top.sls file:

Notice the big *. I do not describe any hosts in my state top.sls, I just use pillar variables to decide if a state should be loaded.

Now, I just have to describe what should be on my hosts once in my pillar, no more double work and one less place to accidentally forget a step.

Compacting/Shrinking a VirtualBox image when using Vagrant

At YouVersion, we are happy users of Vagrant and VirtualBox to power our API engineers development environments. One of the reasons we love vagrant is that there is a SaltStack module, so our development environments use the exact same states that our production environment does, allowing us to always be running our current configs.

Today, I had a need to make our vagrant images smaller, as a lot of bloat had occurred over the past year in the image creation process. Our image was up to 9 gigs. By deleting apt package caches and removing unncessary packages and logs, I was able to shrink the total amount of used space to 5 gigs. AWESOME!

There’s a problem though, because I have a dynamically expanding disk, the virtual was still taking 9 gigs on the host. Bummer.

Not a big deal I thought, I’ll just compress it. Then I found out that vagrant uses vmdk (VMWare standard) instead of vdi files (VirtualBox standard), so I was not able to compress the disk with VirtualBox. Bummer again.

So, I set out to discover the best way to shrink my VirtualBox disk so that I can package a base image to use with Vagrant.

Here is the process I ended up using.

On the VM itself, delete all the files you can and then:

dd if=/dev/zero of=file
rm file

This dd command will fill up every sector of your virtual disk until it is full. Once it fills up all the space, remove the file it just created and BAM! you just 0’d out all your sectors, allowing for maximum compression of free space.

Now, shutdown your machine and head to the path of the running VM on host. You can get this by going into VirtualBox and getting the Location of the Virtual Disk in the settings.

Next, run this:

VBoxManage clonehd box-disk1.vmdk cloned.vdi --format vdi

This command will convert your VMDK file to a VDI file. As an added bonus, when it converts it, it’s aware of all those 0 byte sectors, so we won’t even need to run a compact operation, it will already be done.

Now, we convert the disk back a much smaller VMDK file

VBoxManage clonehd cloned.vdi box-disk2.vmdk --format vmdk

Finally, we end up with a much smaller VMDK, even smaller than the VDI

-rw------- 1 wplatnick 9.9G Sep  2 12:14 box-disk1.vmdk
-rw------- 1 wplatnick 5.0G Sep  2 12:42 box-disk2.vmdk
-rw------- 1 wplatnick 5.6G Sep  2 12:37 cloned.vdi

Now, we delete the old VM files:

rm box-disk1.vmdk
rm cloned.vdi

To wrap it up, go into VirtualBox, go into the settings for the VM, go to Storage and you’ll notice a red exclamation point. Go to the drive entry and select the new VMDK file you just made (box-disk2.vmdk).

Note: I initially tried being slick and just renaming box-disk2 to be box-disk1, but VirtualBox has some checksums in its config so it knows that it’s not the right box.

Now, you’re able to run vagrant package and your file will be much smaller!

My Favorite SSL Certificate: AlphaSSL

Back in the day, we didn’t have to worry about SSL acceptance on mobile phones as much. I have been a huge fan of RapidSSL certificates for many years, but it turns out their mobile acceptance rate isn’t all that great.  Thankfully, this was an issue that my predecessor at YouVersion had to deal with more than I did, and now I get to share this knowledge with you.

Initially, YouVersion used RapidSSL certificates, but we had problems with some Android phones not being able to connect to us because they didn’t have the root certificate that RapidSSL signs their certificates with.

My predecessor ended up finding this site which has a tool which gathers data to figure out SSL acceptance rates. In a twist I didn’t see coming, it turns out GoDaddy SSL certificates have awesome acceptance rates, better than GeoTrust/RapidSSL certificates. The GoDaddy certificate was put into production and all the SSL issues went away. The problem though, is that I really dislike GoDaddy as a company. I could go into why, but I really don’t want that much negativity in my life or my blog.

Thankfully, there is an SSL certificate provider on the SSL client compatibility list that has amazing acceptance rates as well: AlphaSSL, which is a brand of GlobalSign. AlphaSSL has great mobile acceptance and I have deployed their wildcard certificates in YouVersion’s environment with no problems at all.

The problem though, is that the reseller ecosystem for AlphaSSL is not as large as RapidSSL, so I don’t know of a reputable place to get cheaper certificates. You can get the certificate direct from AlphaSSL and it will still be cheaper than GoDaddy (the next best certificate out there), but I know for a fact they offer resellers an amazing discount.  Once hosting companies start to recognize how awesome AlphaSSL certificates are, this will change. In fact, I’m seriously thinking about setting up a turnkey shop and selling them myself.

Nagios/Icinga/Sensu Plugin to monitor SoftLayer Bandwidth

At YouVersion, we get a fair amount of traffic, and we have busier months than others. Because of that, we would sometimes forget to check our bandwidth usage and get hit with some pretty big fee’s.  We wanted to make sure this was something that we monitored so it would never happen again, and thus, this plugin was born.

High Load in Wheezy, High Interrupts for Rescheduling interrupts and timer

Update #7: Earlier this month, I installed the 3.16 kernel from wheezy-backports. I was shocked to see that my high interrupt issue is gone. I still have a higher level of load on my systems, but I can’t find any explanation for it in any metric I can find, so I think the load reporting algorithm has changed. Truthfully, the low load that I experienced in 2.6.32 always made me suspect, because these boxes do quite a bit. Wanting to make sure everything would be fine, I load tested the new kernel, and it performed the same for me.

Update #6: I have gotten absolutely no where with this with the Debian folks. It looks like we’re going to have to take this directly to the kernel developers and see what they have to say. I posted to the linux-kernel mailing list hoping somebody will be able to help. We’ll see how long it takes before they call me an idiot :)

Update #5: I have decided to take another look into this issue. The issue is with the 3.2+ kernels, as I can replicate this by installing a 3.2 kernel into Squeeze. This will hopefully make it easier to troubleshoot, since I can replicate and fix simply by isolating what is happening specifically to the kernel. I filed a new bug report at – Check it out and see if you’re seeing the same issues as I am.

Update #4: No traction at the Debian bug report. We ended up rolling back to Squeeze. I am a big proponent of Debian, but I’m definitely a little bummed out right now. We may end up trying to build an Ubuntu 12.04 environment just to see if we run into the same issue.

Update #3: I got no where with the debian-users mailing list, so I submitted a bug report to Debian at – We’ll see if anything comes of it.

Update #2: This morning I built the latest 2.6.32 kernel for wheezy and it got rid of all the load issues, so if for some reason you have to be on wheezy, that could be a way to go for you.

Update #1: I posted this to the debian-users mailing list, and we have a couple other people at this point who are having the same issue.

At YouVersion, we have been using Debian Squeeze to power our application servers. Personally, I’m a huge fan of Debian and have been for over 10 years now. Once Wheezy was officially released, we started the process of getting ready for Wheezy. We built new AMI’s for our developers, setup testing environments using salt-cloud and built new versions of the components that help power API such as nginx, PHP, Python, Gearman and Twemproxy. Everything was going well until we put Wheezy in production.

Our plan was to only upgrade two boxes to Wheezy and then compare metrics to see how we’re doing. Our load on our application servers is normally between .5 and 1 under Squeeze. Under Wheezy, we’re somewhere around 3, which troubled us greatly.  Worse yet, the Wheezy boxes didn’t hold up on our Sunday traffic levels, php-fpm just wasn’t responding quick enough and monit had to restart php-fpm a few times before we took them out of service.

During our troubleshooting, the first thing we noticed was that most of our stack (nginx, PHP, uWSGI/Python)  was taking more virtual memory in Wheezy than Squeeze. While this isn’t necessarily a big deal, it could be under the right circumstances. We decided instead of doing an in-place upgrade to Wheezy, we’d do a fresh install. Thankfully, SoftLayer makes super easy to do through their portal and we had a new app server loaded with a fresh OS in less than an hour. This got rid of the virtual memory issue, but our load still remained high. The worst part was that we couldn’t attribute the high load to anything in particular very easily. CPU usage was the same, memory usage was smaller in Wheezy and the I/O system all checked out as fine. The only difference we found was that our interrupts are much higher in Wheezy than Squeeze.  Specifically, “Rescheduling interrupts” and “timer” are through the roof on Wheezy, compared to Squeeze.

If you’re interested, you can check out what I found here

We built a new server with a different board/CPU combination in hopes that the issue was somehow hardware related, but we saw the same #’s there.

Ultimately, we decided to load one server back to Squeeze and keep one with the fresh Wheezy install and see how it would hold up against our Sunday load.  We use a custom C program that exports our HAProxy logs into JSON and ship’s them to Google’s BigQuery service to allow us to easily and quickly query against them.  After querying the average response time of all of our servers, despite the high load, our Wheezy box actually performed better than the rest of our app servers by about 3 ms.  With the fresh reload, it was also able to stay up with no issues.

So, now we have a conundrum.  From a performance perspective, we seem to be in good shape with Wheezy, but the high amount of interrupts and load is causing us tremendous unease about rolling it out to production. Right now, we’re not exactly sure what to do.  The issue is bothering us so much we’re thinking about spending the time to build out a test stack on Ubuntu Precise to see if we’re seeing the same thing, since it’s on a more similar kernel to Wheezy.

So long Puppet, hello Salt Stack!

At YouVersion, we have been long time users of Puppet. It works well and suited our needs very well. We’re now at the point however where it’s time for us to start doing some more involved stuff around our deployment processes, workflow automation and dynamic scalability. It’s going to require a good amount of programming and since nobody on our API or DevOps teams wants to or has the slightest desire to learn Ruby, it made Puppet and Chef not ideal candidates for our particular use case. We ended up coming across a product called Salt Stack about 4 months ago that not only had the configuration management aspects of Puppet, but also the remote execution layer of something like Capistrano (but better since it doesn’t use ssh). And the best part for us: it’s written in Python, so our team members can extend it and fix bugs easily. So, when comparing puppet vs saltstack vs chef, salt is a clear winner for us.

We did a few proof of concept exercises with salt, and while it certainly has bugs from time to time, overall, it works very well. We spent the last two weeks migrating all of our puppet classes over to salt states, provisioned a few new machines without puppet and we are happy with the results. We enjoy defining our states using YAML, it allows us to work in salt faster than we ever did in puppet. The remote code execution layer has been awesome to work with as well. The fact that it uses a messaging layer and is tied directly in with our configuration management platform makes it very easy to use, which in turn makes it used much more often.

Salt isn’t perfect, but we’re willing to take a risk on it because we see a huge amount of potential. So far, it’s been able to do everything we’ve needed it to do, but we’ve only barely touched the surface. The community is very active. The mailing lists are very helpful and the people in their IRC channel are easy to talk to. New features are being added daily by the Salt community and the product gets better with every new release.

We are in the process of creating a number of projects that we will be releasing to the community when they are complete, such as a Nagios/Icinga plugin to check for non-responding hosts. We are also going to create a Nagios/Icinga state and module that will allow for the creation of modularized host/service definitions for Nagios/Icinga directly from your states. Some other projects in the pipeline are a module/state that will handle code deployments, and a python wrapper around salt-cloud that will allow for the easy provisioning and self-management of virtual machines for our developers.

We are happy to be joining the Salt community and look forward to watching Salt grow and mature.