Throughput impedance (mis)matching

Published on — Filed under golang, timehop

This post was originally written for the Building Timehop blog.

The importance of the morning push as part of the Timehop experience has been briefly approached in another article.

In this post we'll focus on the architecture of our Push Notification subsystem, covering some of the decisions made, talk about our open source Go library to send pushes through the Apple Push Notification System, and wrap it up with an anecdote on how we ended up DDoS'ing ourselves with what we've built.

Sunsetting the Rails Monolith

Over the past year at Timehop, we broke our big monolithic Rails app into a service based architecture, written almost entirely in Go.

The Rails Monolith, an ultra-realistic portrait of our old stack
The Rails Monolith, an ultra-realistic portrait of our old stack.

Breaking a big system down into smaller parts makes it far more modular and, when done right, more available and fault-tolerant.

We ended up with more services but fewer single points of failure. There are now more potential points of failure but none of them can — or should — cause a complete halt.

One of the side effects of dividing to conquer is that communication becomes explicit. Functionality and error handling are now spread across multiple processes, over the network — which also makes them less reliable.

Impedance matching: buffering and throttling

At Timehop, we put a lot of effort into making sure that all communication between our systems is correctly buffered and/or throttled, so as to avoid internally DDoS'ing ourselves.

Whenever a system can offload some work for deferred processing by other systems, we use message queues as buffers. As those queues often grow to hold millions of records in a short amount of time, we keep a close eye on them through an extensive set of alarms.

Buffering — A fires and forgets, B will handle work at its own pace
Buffering — A fires and forgets, B will handle work at its own pace.

Whenever a system needs a real-time response from another (e.g. an HTTP call or some other form of RPC), we use aggressive timeouts on the requesting side and throttling on the serving side. It's all designed to fail fast; the requester won't wait longer than a few seconds for a response and the server will immediately return an error if too many requests are already being served.

Throttling — B will rather discard requests than become overloaded and adversely affect median speed
Throttling — B will rather discard requests than become overloaded and adversely affect median speed; A will cancel requests that aren't handled fast enough.

We would rather fail fast and keep the median response times low, even if it comes at a small cost in percentage of successful requests served:

The push notification subsystem

We call it salt-n-pepa. I would have personally gone with static-x but Benny got dibs on it.

Whenever we need to send out a push notification to a Timehopper, we load up all her device tokens (one per phone) and then hit Google's GCM or Apple's APNS.

If you've never dealt with push notification systems, a device token is what Apple and Google use to uniquely identify your device(s) so that we can send you notifications.

With our monolithic system, we kept all these tokens in a PostgreSQL database, which was hidden behind the niceties of Rails' ActiveRecord. Grabbing the Apple device tokens for a user was as easy as calling a method on a User object — user.valid_apns_tokens.

As the need arose to perform the same tasks from multiple parts of our shiny new (but incredibly lean and minimalist) Go stack, multiple problems became apparent:

To solve those problems, we created a black-box service, salt-n-pepa, that has message queues as entry points. Messages (or tasks) in this queue are JSON documents, whose most notable fields a target user ID, some content and, optionally, a delivery time (so that it supports scheduling for future delivery vs immediate.)

The moving parts

Internally, the push system has multiple components, each with a single, very well defined responsibility.

Here's a high-level diagram of how the main flow looks:

salt-n-pepa architecture — How a push gets sent
salt-n-pepa architecture — How a push gets sent.

Seems like there's a lot going on — well, there is — but it essentially breaks down to four main components and their respective buffering queues and sets:

Aside from these, there are also a few other components related to token registration and invalidation:

salt-n-pepa architecture — Token registration and invalidation
salt-n-pepa architecture — Token registration and invalidation.

With this system we send, on average, 25 million push notifications every day.

Timehop's Go APNS library

One of the hardest parts of this whole system was writing the actual code that talks to APNS to send the pushes.

Whereas with GCM you perform an HTTP request and immediately know the results, Apple took on a less common approach in which you have to open a TLS connection and adopt their binary protocol. You write bytes to a socket instead of HTTP POST's to a Web server. To gather feedback on which tokens are now invalid, you have to open up a separate connection to receive this information.

As we looked for good libraries, we realized the landscape was grim so we decided to roll our own, which features:

So head on to the GitHub project page and give it a spin!

How we DDoS'ed ourselves with pushes

Every day, the system that prepares your next Timehop day (briefly discussed in this other article) enqueues about 15 million push notifications to be sent shortly before 9am on your local timezone. This scheduling is randomized within a 30 minute window, so that for every timezone, we get an evenly distributed traffic pattern — as opposed to massive influx of traffic when everyone opens the app at the exact same time.

All this is performed far in advance of the actual push being sent so we end up queueing plenty of messages, which the Deschedulers will then move on to the appropriate queues to be sent immediately when the time comes. It's normal to have a few million entries scheduled for later delivery.

The actual sending on the APNS side is pretty fast. It takes about 2ms to run a full cycle — pop a notification from the queue and send it to Apple's Push servers. Rinse and repeat.

We run a single process, in a single machine, with 50 workers (each in its own goroutine). It's so fast that its queue never backs up, no matter what we throw at it.

A normal APNS push queue peak
A normal APNS push queue peak.

It's one of those things that has been so reliable for so long that you kind of forget about it when there are other fires to put out. So reliable and fast we forgot to put alarms in place for the case when its queue starts backing up.

And then it got fun.

What goes around, comes around

We never really put thought into limiting the outbound rate of our pushes — as long as Apple could handle it, we'd hammer them.

What we naively overlooked was the fact that pretty much every push we send causes an indirect hit on our client-facing API, as the users open the app.

The morning push — Nobody can resist opening the app after one of these
The morning push — Nobody can resist opening the app after one of these.

The higher the volume of immediate pushes sent, the higher the potential volume of hits on our API.

A week ago, due to a certificate problem with our APNS pusher, each of the 50 workers running on the APNS Pusher slowly started to die. We didn't really notice anything as, even with just a couple workers left, we were still keeping up with the rate at which pushes were being generated.

Then, the last worker died. No more APNS pushes were sent.

While we did not have an alarm in place, the unusually low morning traffic that our dashboards were showing was not a good sign — that and the fact that we didn't get our own morning pushes either.

As we investigated and reached the natural conclusion that the APNS Pusher was dead — at that point, the queue had over 6 million pushes and growing — we restarted it.

An abnormal APNS push queue peak
An abnormal APNS push queue peak.

Within 30 minutes, our client-facing API error rates went up by 60% and our inbound traffic went up nearly 3x. When we looked at the push queue, it was empty. Over 6 million pushes sent under 40 minutes. Most of those were people that actually opened Timehop and hit our servers.

Temporary havok, visible in the huge spike in the middle of the charts
Temporary havok, visible in the huge spike in the middle of the charts.

An incredibly simple rate limiter

All it took for this to never happen again were a few lines of code. The algorithm is pretty simple:

Kinda like pushing the button in LOST.

Here’s what it looks like:

import "sync/atomic"

func NewLimiter(limit int64) *Limiter {
  return &Limiter{limit: limit}
}

type Limiter struct {
  limit   int64
  counter int64
}

// Atomically increments the underlying counter
// and returns whether the new value of counter
// is under the limit, i.e. whether the caller should
// proceed or abort.
func (s *Limiter) Increment() bool {
  return atomic.AddInt64(&s.counter, 1) <= t.limit
}

// Atomically resets the value of the counter to 0.
func (s *Limiter) Clear() {
  atomic.StoreInt64(&s.counter, 0)
}

The limit is then shared across all the workers (goroutines) and whenever they're about to begin a new cycle, they simply test whether they can proceed:

func (s *apnsWorker) workCycle() bool {
  if !s.limiter.Increment() {
    return false
  }
  // ...
}

Lastly, another goroutine calls Clear() on this shared Limiter every minute, which allows the workers to begin sending pushes again.

A final note

When going distributed you'll invariably run into throughput impedance mismatches. Make sure you dedicate some time to understand how every part of your system will affect the next and how you can use different techniques, such as the ones we talked about in this article, to help mitigate the effects.

Oh, and always keep an eye out for how outbound traffic can get back at you so you don't end up nuking yourself like we did! 😬