How We Built a Scalable SaaS Tech Stack
We don’t mean to brag (Okay – maybe a little bit), but over the past five years, Geniuslink has maintained almost 100% uptime through our technology stack, even as the amount of clicks and users we’ve acquired has exponentially risen and the demand on our service has increased dramatically. Since we started monitoring our service back in April of 2011, we’ve been able to provide an average uptime of 99.9919%, or roughly 2,585,707.5 out of 2,585,917 total minutes. That’s right – from April 2011 until February 2016, Geniuslink has only had 3.5 hours of “downtime” total. To compare, most large SaaS companies promise 99.9% uptime, or 8.75 hours of downtime per year. Not bad, huh?
You’re probably wondering why downtime is in quotes above. Well, we call ourselves out even when just a single region from our massive infrastructure goes offline for a single second – and downtime is rounded up to the nearest minute. What this means is that even when we’re still able to serve clicks (since we automatically route traffic to a different server or region when one goes down), we’ll still consider the service offline – just to keep ourselves honest.
However, our goal is to not just keep things “online” – that would be too easy. We also strive for blazing fast redirection for every click to its intended destination, immediate load times for our user interface, and speedy report load times for billions of clicks, thousands of clients, and hundreds of data points worldwide (where some services only have a single region).
How do we do it? Glad you asked. Let’s break the tools and resources we use into the different categories of the Geniuslink service: Infrastructure and click processing, Monitoring, and of course Development tools for the service.
The Geniuslink Service’s Resolution time – April 2011 – February 2016
Click Processing Infrastructure
This is the core of the Geniuslink service (and the longest section of this article) – making sure each click is sent to the correct destination based off the created rules, retargeting pixels are fired and that the entire process is so quick that the person clicking the link still gets an awesome user experience.
Server infrastructure and hosting
We went with a “hybrid cloud” which means we have physical servers here in Seattle as well as a bunch of virtual servers across the world in “cloud” data centers. Here’s how our Geni.us (see what I did there?) CTO Jesse P set it up:
- The Seattle data centers store our core databases, message queue clusters as well as post-click processing machines. These resources are ones that need to be online all the time, but don’t have the same scale up/out requirements of our regional deployments and we calculated that purchasing these machines paid themselves off in 3 – 6 months (depending on build) when comparing to the monthly fee for cloud services. Since we’re going to be around for a lot longer than that, it made the most sense cost wise.
- The Cloud based data centers are for serving clicks across the globe (so not everything goes through a single location, making sure clicks resolve faster internationally), and data warehousing for offsite backups and archival. Cloud based regional deployments offer us the ability to both quickly and cost effectively, change providers and/or regional deployment locations, scale up or down capacity, as well as adjust our global footprint as we please.
For the cloud-based pieces of our setup, we currently use regional data centers on the cloud providers Linode and Digital Ocean. We chose these guys after a bunch of research and we’ve found their balance of performance, customer service, reliability and cost to be the best of the best. Which is exactly what we want for our clients.
Our infrastructure has been built to be provider agnostic (a fancy way of saying we don’t rely on any single service provider’s setup). Although it required a bit more upfront investment, building everything from bare metal and virtual machines gives us the flexibility to boot up a new box anywhere we need, instead of relying on wherever a single cloud provider offers their services. This gives us an advantage over other services stuck with a single provider – if that provider goes down for any reason, so does that service (and it shows in their uptime reports).
We also use a couple of tools for deploying new servers as the need arises – Puppet and Capistrano. Puppet allows us to configure specific types of virtual machine “images” so we can simply press a button and automatically set up a new VM with the exact settings and configurations that we want. Capistrano makes sure the software on the machines gets deployed in the same fashion every time, meaning once the VM is created, all software packages that we need will be installed quickly and seamlessly without any room for “human error.” So the machine is fully operational without the need for human interaction, making the process much quicker and easier for us.
Load-balancing is used to make sure a single server doesn’t get completely overloaded with requests. Think of it as a line at the grocery store – if there’s only one register open, it takes forever for everyone to get through. But if you have multiple registers open and there’s someone making sure each person is routed to the fastest line, your shopping gets finished much quicker. That someone would be a load-balancer.
Generally, load-balancing is done to spread out traffic to servers inside a single data center. We took this a step further by first spreading out traffic first to the fastest available region, and then continue with the normal load-balancing between each server within that data center (what most services rely on). There are three levels of balancing we do:
- At the highest level, we use Dyn’s enterprise DNS based global traffic management tool (RTTM) to monitor our regional deployments, and route clicks to the fastest (not always the closest) available data center. This setup also allows us to easily add or remove entire data centers from serving clicks for things like adding a new region, or to safely stop traffic to one of our current regions for maintenance activities. So continuing the grocery store analogy – this would be a pair of guys that direct you to the fastest grocery store first. Why a pair? Because if one quits, the other jumps in and takes over without missing a beat.
- Next, within each regional data center, there’s a pair of HAProxy load balancers. This is the guy within the fastest grocery store pointing you to the fastest checkout line.
- Finally, we have backend Nginx servers that provide an extra level of filtering, security, flexibility and scaling over the top of our worker processes. This is like ensuring the person checking you out is the fastest checker possible.
With this setup, every click is routed to the fastest data center no matter where in the world the click comes from. It also means we can scale to higher throughputs (more clicks) and add new servers or regions as needed for much cheaper than running it ourselves. Why is this important? Because studies have shown that every second in link resolution is found to decrease conversions by 7% – and that’s just not something we can allow for our customers.
Here’s how fast we serve traffic in a few different regions of our stack. If we tried to serve traffic from a single data center, the total resolution time would increase by anywhere from half a second to several seconds, and that would be a bad user experience – no one wants to wait that long for a page to load!
You can think of a database as a massive warehouse filled with filing cabinets. In the warehouse, there’s a team of people writing down pieces of information and running around storing that info in the proper filing cabinet. Then, when that info is needed again, that team sprints around gathering the info again and passes it back to whomever asked for it.
For our warehouse, we primarily use MongoDB and have some additional info in MariaDB. Our MariaDB/MongoDB clusters (basically a series of identical warehouses. If one catches on fire, the others are used instead) run on our own high-end hardware, with RAID backed SSD storage for greater performance, reliability, consistency and cost than we were able to find with cloud providers. Also, instead of just a single storage location, we have multiple that are “mirrored” in each region to help with super-fast resolution times. Our stuff is also regularly backed up both onsite as well as offsite to Amazon S3 – which we use like a warehouse used for storing warehouses.
Caching is the term used for storing information that is used often in “memory” so it’s super fast and easy to grab. So – think of the cache as a single filing cabinet in front of the data warehouse that has all of the stuff asked for most often. It doesn’t have every single detail that the warehouse has, but it can quickly find you common answers. In our cache we store things like configurations, shortlink definitions (what each link is and where it goes, including the Advanced Targets), and product information to keep stuff super fast.
Our caching is done through MongoDB which provides us with a fast and reliable way to replicate our data to every region we support and keep a “warm copy” of that data (a filing cabinet in front of every warehouse we have). This means we can continue to serve clicks even if our database infrastructure were to go down for any reason, and serve clicks faster for links that get a ton of traffic. The reason we chose to use MongoDB’s caching feature instead of a different tool like Redis was because of how well the MongoDB replicates data to the other instances of itself. Redis does a fantastic job of quickly working with data, but we found that since it’s limited to memory, it became super expensive for us to try to have the same storage size that we currently have in our MongoDB caches.
This whole setup of grocery stores and warehouses is how we’ve managed to keep our downtime so low and scale up so quickly. We’re pretty proud of it.
Next on the list is the “eyes and ears” of our service – monitoring. We take monitoring seriously. It not only alerts us of potential outages, but also allows us to proactively resolve problems before they turn into said outages – both good things. Here’s our bag o’ tools for keeping an eye on things:
- We have a bunch of alerts set up for different aspects of our service. So if something errors out, a server goes offline, or anything else that we consider a “not perfect situation,” we’ll know about it. All of our alerts funnel into PagerDuty, which handles on-call schedules and alert escalations, meaning we get calls, emails, texts and carrier pigeons 24/7/365 if anything goes down (but allows us to set alert filters so we don’t get woken up if it’s not an emergency).
- We monitor our service availability/performance as a whole with Pingdom, and make those checks available publicly so everyone else can keep us honest as well. This is what we use to check our uptime and make sure that all systems are running normally.
- Each regional data center is monitored with Pingdom as well as by Dyn, which provides data center failover not only for availability reasons, but also if another data center could serve a region’s requests faster. As explained above, Dyn is our grocery store traffic cop that sees the “best” place for a click to go.
- For each server and for our code within those servers, alerting is done through ScoutApp.com, which gives us alerts, historical reports and real-time monitoring/graphing. This means we can track number of clicks per minute, or see if something spikes/drops when it’s not supposed to. We also use their stats support to “push” metrics that we care about, like processing times per request to be more proactive in our monitoring/support of the service.
- Finally, we have server and application log monitoring, filtering, and searching through Papertrail. So when we run our billing, or fire off an email, Papertrail keeps a log of it for us to make sure it worked as it should for both auditing purposes and to find errors or warnings.
Example of Scout showing our load balancers shift traffic across some of our infrastructure. The thicker the line, the more traffic that specific server is handling. When one server is deemed slower, traffic is rerouted to another (faster) server instead.
Just like every band has a bunch of musicians, and just like each musician has an instrument they use to make beautiful music, our team of engineers start with C# running on the Mono runtime (a cross platform, open source .NET framework implementation for the technically inclined) as their instruments in our Orchestra of Awesome. Mono is essentially the bridge between our service (inputs from users and all of the click information) and our databases. We chose it because it’s free, it runs on Linux, and since we’re in Seattle, very good C# developers are everywhere when we need to scale up our team (thanks, Microsoft).
Mono is just the base of the bridge – we build our service on ServiceStack which powers all of the features we build in the dashboard. Why ServiceStack? They say it pretty dang well themselves. It’s fast, easy to use, and plays very nicely with all of the other pieces we like to build.
Our backend processing is done by MassTransit, RabbitMQ. These two take all requests that our service gets, translate them into understandable tasks, and pass them along to the next tool in the stack. They’re the equivalent of a person standing in front of a large auditorium full of people screaming their lunch orders, and that person is successfully writing down each order so every person gets the scrumptious food they crave. We also do use Redis here, even though we don’t use it for caching, to create things like our back end click reports.
Here’s an example of RabbitMQ doing its thing and showing how many messages (requests) it’s currently queueing up to process (top graph), and how many it’s working on a minute (bottom graph). Sorry – can’t show exact numbers here!
Finally, we run all of our server systems on Linux – Ubuntu LTS. Being that Ubuntu is easy to deploy, doesn’t have a licensing fee, and is incredibly stable for large scale deployments, it helps us quickly, easily and cheaply build up our infrastructure to meet our clients’ needs. Plus, we get to feel super geeky running Linux.
There it is – all of the different technologies we’ve built on top of each other to create an amazing experience for all Geniuslink clients and their customers (hence the phrase “stack”). We’re fortunate enough to have a team of obsessive engineers and a CTO that has dedicated his time (and lack of sleep) to ensuring our stack runs like a well-oiled machine, and we’ll continue to feed him coffee through his IV to make sure it stays that way.
One thing we didn’t touch on in this article is the dashboard side of the stack, which is all of the tools we use to make things pretty and interactive for our clients. If you’d be interested in reading about that (even after checking out this article), let us know and we’ll put together a post, just for you.
Want to get into the nitty gritty about a specific piece of our stack? Want to know more about how they all work together? Curious why we chose a specific tool or provider over others? Want to see what other cool stuff we’re looking at adding? Let us know! We love talking shop and are always happy to help. Thanks for reading!