External Garbage Collection

Feb 9

How To Clean Up Your Trash (Examples In Ruby)

A picture of LEGO bricks scattered on a wood floor in a room — *Photo by* *Markus Spiske* on *Unsplash*

A common pattern in SaaS apps is the need to create, and then later clean up, external resources. What do I mean by “external”? Basically, anything that isn’t a row in your transactional database. In the normal course of running Census, we create and then later destroy lots of things:

Google Cloud Service Accounts for fine-grained access to customers’ BigQuery data
AWS S3 objects storing temporary data during a Reverse ETL run
Snapshot tables in customers’ data warehouses used to detect data changes
Auth0 connections so users can use their IdPs to sign in to Census
etc…

Some of these things have very long lifetimes (the lifetime of a customer’s contract) while others have lifetimes measured in milliseconds, but they all share these properties:

They will eventually need to be removed.
Removal is expensive (takes more than a second in the p99 case).
Removal is not guaranteed to succeed - because we need to call an API or otherwise interact with an external system, we may need to retry removals multiple times before they take.

In this post I’ll share a few of the patterns that we have previously used in Census (and in other applications our engineers have built) and provide some tips about how to implement this in your own apps.

A related issue (that we won’t cover in this post) is how to perform cleanup of internal but costly resources (i.e. mass-deletion of items from your own database) - check out my colleague Nate’s post on how we do this at Census, with a bonus gem to help implement this in your own apps.

The Right Tool for the Job

Some of these approaches are more expensive than others (in terms of development cycles) so I don’t think there’s a one-size-fits-all pattern. When deciding how much effort to put into cleanup, there are a few considerations:

Does the external resource contain sensitive data (PII, etc)? If so, you probably don’t want to keep it around for a moment longer than needed.
Does the external resource create costs (in dollars or complexity), and how bad are those costs? Keeping an S3 object around for a year is pennies for most objects, but if you have millions of them that will add up quickly.
Is there a limited-sized resource pool that you risk exhausting? For example, Google Cloud allows up to 100 service accounts per project (ask me how I know).
Who “owns” the resource, you or your customer? This affects each of the criteria above - a table containing sensitive customer data is more critical to clean up if it’s owned by you than if it’s owned by your customer, whereas a cloud resource that costs non-trivial money might be okay to keep around if you’re bearing the cost, but not okay if your customer is.

Choosing Your Approach

I’m fond of my cofounder Anton’s meta-framework for questions like this; to every problem, there are three possible solutions:

Do nothing
Do something
Do the right thing

I’ll outline a few different patterns for cleanup, all of which we’ve used at some point in Census, under these three categories.

Do Nothing

For cheap, non-sensitive resources, the right option (especially if you’re just getting started) might be to kick the can down the road and deal with it later. You can get away with this for a surprisingly long amount of time. It’s okay to generate some “resource pollution” as long as you have a credible plan for dealing with it down the road. We used exactly this approach for Google Service Accounts for the first ~18 months of Census - we didn’t have that many customers who were disconnecting BigQuery from Census, so when a disconnect happened and we left an account behind, no big deal - we were not close to running out of them and those dangling accounts didn’t cost any money nor did they present a security risk (this might not be true for all IAM resources! your milage may vary!).

The hardest part of this can be keeping track of your junk for later cleanup. Some kinds of external resources are easily enumerable, but it might not be easy to determine their “in-use” status. We actually made this mistake - we named our accounts using random guids (i.e service-account-c6450bcf-e2ab-4bbc-b165-75ccca2b50db) so when we finally did bite the bullet and start doing cleanup, we had to do an annoying cross-reference between the internal list of what was in use (stored in our Postgres) and the state of the world in Google Cloud. Other resources might not be easily enumerable (especially if they live in customer-controlled systems), so if you do elect the “do it later” approach, you at lease might want to leave yourself a log table of what work needs to be done later.

Do Something

If you can’t get away with doing nothing, there are increasing degrees of sophistication / complexity in cleanup strategies. All of the examples below are in pseudo-Ruby-on-Rails, using the earlier example of removing Auth0 SAML connections when they are no longer in use (i.e. if a customer disconnects their IdP)

Synchronous Cleanup

When it’s time for a resource to go away, attempt to delete it inline:

class IdpConnectionController < ApplicationController
  def destroy
    @idp_connection = IdpConnection.find params[:id]
    # Make an API call to delete from Auth0
    Auth0.delete @idp_connection.auth0_connection_id
    @idp_connection.destroy
  end
end

This will work most of the time, until it doesn’t. The perils of performing a third-party API call from your own web server are well-known:

API calls are slow (to be more precise, they have high tail latency), so the performance of your controller action will be out of your control.
These calls can fail and because they don’t exist within the cozy transactional confines of your own DB, they can leave your system in an indeterminate state (and ripple back to your customer).
These calls can fail silently (what if delete_connection returns a true / false bit, instead of throwing on error? what if it returns a 200 but the destination service is ill-behaved?) and leave the system in a state where resources are leaked but nobody knows it.

Asynchronous Cleanup

The next obvious thing to do is to defer cleanup to a background task - in Rails, you’d use ActiveJob:

class IdpConnectionController < ApplicationController
  def destroy
    @idp_connection = IdpConnection.find params[:id]
    # Make sure to pass in the ID, not @idp_connection,
    # because the @idp_connection will be gone by the
    # time the job runs!
    DeleteConnectionJob.perform_later(
      @idp_connection.auth0_connection_id
    )
    @idp_connection.destroy
  end
end

class DeleteConnectionJob < ApplicationJob
  def perform(auth0_connection_id)
    Auth0.delete auth0_connection_id
  end
end

This is a notable improvement over our first approach, and a great answer for many use cases. We’ve solved the performance issue by moving the work to a background queue where it won’t block our HTTP request, and we’ve ensured that the object within our own database is getting destroyed whether or not the Auth0 resource is. We still bear the risk that DeleteConnectionJob will fail, loudly or silently, but if it fails loudly, we’ll at least have a log of it in our queuing system (Sidekiq or Resque or otherwise) and we can go manually deal with those failures. If you’re using ActiveStorage (Rails’ internal blob storage abstraction, usually backed by S3) you get this pattern for free - the framework gives you a purge_later method that enqueues a built-in ActiveJob implementation of the above pattern.

If At First You Don’t Succeed

A slightly evolved version of this is to add retries to our cleanup job; retry can be implemented within-job, or by using the job queueing system:

class DeleteConnectionJob < ApplicationJob
  # Use ActiveJob's external retry system
  retry_on RuntimeError

  def perform(auth0_connection_id)
    attempts = 0
    begin
      Auth0.delete auth0_connection_id
    rescue Auth0::RateLimitExceeded
      # in-job retry logic for known exception
      attempts +=1
      if attempts < 5
        sleep 2**attempts
        retry
      end
    end
  end
end

Retries are great when failures are uncorrelated (random) - a temporary outage on the remote system, network issues, or other cosmic rays prevent your first attempt but allow the next one. However, in the real world failures are often correlated, meaning that if an operation fails the first time, it’s much more likely to fail every time - a so-called “poison pill”.

One class poison pill errors is an easy one to solve - what if the resource has already been removed? We can evolve our job by subtly changing its responsibility from “delete this resource” to “ensure that this resource does not exist”:

class EnsureConnectionRemovedJob < ApplicationJob
  # retry code omitted for brevity
  def perform(auth0_connection_id)
    if Auth0.exists? auth0_connection_id
      Auth0.delete auth0_connection_id
    end
  end
end

This is a small improvement, but it does remove a source of false positive errors and it’s a stepping stone to our final, most evolved approach.

Do the Right Thing

Once you’re reframed the problem from “delete this resource” to “the system should ensure there are no orphaned resources”, we can make another, more powerful change: instead of ensuring the “no orphaned external resources” invariant for a single internal resource, try to ensure it across all internal resources. Instead of treating cleanup as an imperative action to take at a particular point in the workflow, turn it in to garbage collection - a never-ending process that exists as its own top-level control loop. Here’s an example:

class GarbageCollectAuth0Job < ApplicationJob
  def perform
    existing_connection_ids = Set.new(
      Auth0Client.all_connections.map(&:id)
    )
    connection_ids_in_use = Set.new(
      IdpConnection.pluck(:auth0_connection_id)
    )
    orphaned_connection_ids =
      existing_connection_ids -
      connection_ids_in_use

    orphaned_connection_ids.each { Auth0.delete(_1) }
  end
end

Unlike our previous jobs, GarbageCollectAuth0Job doesn’t run in response to a particular input; instead, you’d use a scheduler to kick it off periodically (could be once a minute, could be once a week, depending on your needs). We’ve now moved to a fully GC-based system that can be monitored independently of the core application. Poison pills will still exist (and you’ll need to monitor them and come up with some way to deal with them) but this approach is the most general and will cover you in the event of all kinds of issues (your post-destroy trigger didn’t work for a while? no worries, the cleanup job will bail you out).

Best of Both Worlds

One thing we’ve lost in the above implementation is “quick cleanup” - if your GC period is long, you may have to wait a while before resources get removed, which may not always be acceptable to your product / business needs. The final form of our approach is a hybrid that acts as a GC loop that can be “poked” to do a quick, low-latency cleanup. Presenting the Cadillac of cleanup jobs:

class IdpConnectionController < ApplicationController
  def destroy
    @idp_connection = IdpConnection.find params[:id]
    # attempt an eager cleanup
    EnsureConnectionRemovedJob.perform_later(
      @idp_connection.auth0_connection_id
    )
    @idp_connection.destroy
  end
end

class GarbageCollectAuth0Job < ApplicationJob
  def perform
    existing_connection_ids = Set.new(
      Auth0Client.all_connections.map(&:id)
    )
    connection_ids_in_use = Set.new(
      IdpConnection.pluck(:auth0_connection_id)
    )
    orphaned_connection_ids =
      existing_connection_ids -
      connection_ids_in_use

    orphaned_connection_ids.each do
      EnsureConnectionRemovedJob.perform_later(_1)      
    end
  end
end

# Enqueued by both the controller *and* the GC job
class EnsureConnectionRemovedJob < ApplicationJob
  def perform(auth0_connection_id)
    if Auth0.exists? auth0_connection_id
      Auth0.delete auth0_connection_id
    end
  end
end

This also brings the benefit of parallelism for expensive cleanups - you can have as many simultaneous cleanups as you have worker pool capacity. In fact this two-tier job pattern is something we use extensively at Census - a high-level job to make a list of work to do which fans out to low-level jobs that do the work.

Other Tricks & Next Steps

It’s a relatively small step from here to generalize this even further, to a generic orphaned resource cleaner that can work on any class of objects. While the code is too long for a blog post, you can imagine a GeneratesGarbage interface that objects can implement in order to tell a GC framework:

How to enumerate external resources
How to determine which of those resource are in use
How to remove those resources (possibly in bulk)
Special timing considerations - either not to remove resources too quickly (to allow archival or undo) or not to remove them too slowly (in order to meet retention SLAs)

Given these parameters, the frameworks handles scheduling, enumeration, load balancing, fan out, and monitoring of cleanups in a cross-cutting way. We’ve implemented a version of this internally at Census (though not yet covering all our resources, we’re working on it!) and it makes it much easier for product teams to make use of these resources in a more declarative fashion that talks about how they are supposed to be treated, without having to reinvent the cleanup wheel each time.

One final trick that can be useful (in combination with one or more of the patterns above) is to make use of external cleanup when it exists. For example, while we use a version of the above to clean up temporary objects that Census stores in S3, that data is so sensitive that we also make use of S3’s own lifecycle configuration rules to act as a “cleaner of last resort” in case Census’s own cleanup system has an extended outage.

Have you seen other patterns for external resource cleanup? gems or other (non-Rails) framework approaches? Let us know here or at @CensusDev on Twitter

Bradley Buda

Brad is a software engineer and one of the founders of Census. Before Census, he founded Meldium, an identity management startup, and has worked on enterprise software, distributed systems, and monitoring teams at AWS, Oracle BlueKai, and VMware. He is an unrepentant Rubyist and lives in San Francisco.

http://bradleybuda.com/