External Garbage Collection
How To Clean Up Your Trash (Examples In Ruby)
A common pattern in SaaS apps is the need to create, and then later clean up, external resources. What do I mean by âexternalâ? Basically, anything that isnât a row in your transactional database. In the normal course of running Census, we create and then later destroy lots of things:
Google Cloud Service Accounts for fine-grained access to customersâ BigQuery data
AWS S3 objects storing temporary data during a Reverse ETL run
Snapshot tables in customersâ data warehouses used to detect data changes
Auth0 connections so users can use their IdPs to sign in to Census
etcâŚ
Some of these things have very long lifetimes (the lifetime of a customerâs contract) while others have lifetimes measured in milliseconds, but they all share these properties:
They will eventually need to be removed.
Removal is expensive (takes more than a second in the p99 case).
Removal is not guaranteed to succeed - because we need to call an API or otherwise interact with an external system, we may need to retry removals multiple times before they take.
In this post Iâll share a few of the patterns that we have previously used in Census (and in other applications our engineers have built) and provide some tips about how to implement this in your own apps.
A related issue (that we wonât cover in this post) is how to perform cleanup of internal but costly resources (i.e. mass-deletion of items from your own database) - check out my colleague Nateâs post on how we do this at Census, with a bonus gem to help implement this in your own apps.
The Right Tool for the Job
Some of these approaches are more expensive than others (in terms of development cycles) so I donât think thereâs a one-size-fits-all pattern. When deciding how much effort to put into cleanup, there are a few considerations:
Does the external resource contain sensitive data (PII, etc)? If so, you probably donât want to keep it around for a moment longer than needed.
Does the external resource create costs (in dollars or complexity), and how bad are those costs? Keeping an S3 object around for a year is pennies for most objects, but if you have millions of them that will add up quickly.
Is there a limited-sized resource pool that you risk exhausting? For example, Google Cloud allows up to 100 service accounts per project (ask me how I know).
Who âownsâ the resource, you or your customer? This affects each of the criteria above - a table containing sensitive customer data is more critical to clean up if itâs owned by you than if itâs owned by your customer, whereas a cloud resource that costs non-trivial money might be okay to keep around if youâre bearing the cost, but not okay if your customer is.
Choosing Your Approach
Iâm fond of my cofounder Antonâs meta-framework for questions like this; to every problem, there are three possible solutions:
Do nothing
Do something
Do the right thing
Iâll outline a few different patterns for cleanup, all of which weâve used at some point in Census, under these three categories.
Do Nothing
For cheap, non-sensitive resources, the right option (especially if youâre just getting started) might be to kick the can down the road and deal with it later. You can get away with this for a surprisingly long amount of time. Itâs okay to generate some âresource pollutionâ as long as you have a credible plan for dealing with it down the road. We used exactly this approach for Google Service Accounts for the first ~18 months of Census - we didnât have that many customers who were disconnecting BigQuery from Census, so when a disconnect happened and we left an account behind, no big deal - we were not close to running out of them and those dangling accounts didnât cost any money nor did they present a security risk (this might not be true for all IAM resources! your milage may vary!).
The hardest part of this can be keeping track of your junk for later cleanup. Some kinds of external resources are easily enumerable, but it might not be easy to determine their âin-useâ status. We actually made this mistake - we named our accounts using random guids (i.e service-account-c6450bcf-e2ab-4bbc-b165-75ccca2b50db
) so when we finally did bite the bullet and start doing cleanup, we had to do an annoying cross-reference between the internal list of what was in use (stored in our Postgres) and the state of the world in Google Cloud. Other resources might not be easily enumerable (especially if they live in customer-controlled systems), so if you do elect the âdo it laterâ approach, you at lease might want to leave yourself a log table of what work needs to be done later.
Do Something
If you canât get away with doing nothing, there are increasing degrees of sophistication / complexity in cleanup strategies. All of the examples below are in pseudo-Ruby-on-Rails, using the earlier example of removing Auth0 SAML connections when they are no longer in use (i.e. if a customer disconnects their IdP)
Synchronous Cleanup
When itâs time for a resource to go away, attempt to delete it inline:
class IdpConnectionController < ApplicationController
def destroy
@idp_connection = IdpConnection.find params[:id]
# Make an API call to delete from Auth0
Auth0.delete @idp_connection.auth0_connection_id
@idp_connection.destroy
end
end
This will work most of the time, until it doesnât. The perils of performing a third-party API call from your own web server are well-known:
API calls are slow (to be more precise, they have high tail latency), so the performance of your controller action will be out of your control.
These calls can fail and because they donât exist within the cozy transactional confines of your own DB, they can leave your system in an indeterminate state (and ripple back to your customer).
These calls can fail silently (what if
delete_connection
returns atrue
/false
bit, instead of throwing on error? what if it returns a 200 but the destination service is ill-behaved?) and leave the system in a state where resources are leaked but nobody knows it.
Asynchronous Cleanup
The next obvious thing to do is to defer cleanup to a background task - in Rails, youâd use ActiveJob
:
class IdpConnectionController < ApplicationController
def destroy
@idp_connection = IdpConnection.find params[:id]
# Make sure to pass in the ID, not @idp_connection,
# because the @idp_connection will be gone by the
# time the job runs!
DeleteConnectionJob.perform_later(
@idp_connection.auth0_connection_id
)
@idp_connection.destroy
end
end
class DeleteConnectionJob < ApplicationJob
def perform(auth0_connection_id)
Auth0.delete auth0_connection_id
end
end
This is a notable improvement over our first approach, and a great answer for many use cases. Weâve solved the performance issue by moving the work to a background queue where it wonât block our HTTP request, and weâve ensured that the object within our own database is getting destroyed whether or not the Auth0 resource is. We still bear the risk that DeleteConnectionJob
will fail, loudly or silently, but if it fails loudly, weâll at least have a log of it in our queuing system (Sidekiq
or Resque
or otherwise) and we can go manually deal with those failures. If youâre using ActiveStorage
(Railsâ internal blob storage abstraction, usually backed by S3) you get this pattern for free - the framework gives you a purge_later
method that enqueues a built-in ActiveJob
implementation of the above pattern.
If At First You Donât Succeed
A slightly evolved version of this is to add retries to our cleanup job; retry can be implemented within-job, or by using the job queueing system:
class DeleteConnectionJob < ApplicationJob
# Use ActiveJob's external retry system
retry_on RuntimeError
def perform(auth0_connection_id)
attempts = 0
begin
Auth0.delete auth0_connection_id
rescue Auth0::RateLimitExceeded
# in-job retry logic for known exception
attempts +=1
if attempts < 5
sleep 2**attempts
retry
end
end
end
end
Retries are great when failures are uncorrelated (random) - a temporary outage on the remote system, network issues, or other cosmic rays prevent your first attempt but allow the next one. However, in the real world failures are often correlated, meaning that if an operation fails the first time, itâs much more likely to fail every time - a so-called âpoison pillâ.
One class poison pill errors is an easy one to solve - what if the resource has already been removed? We can evolve our job by subtly changing its responsibility from âdelete this resourceâ to âensure that this resource does not existâ:
class EnsureConnectionRemovedJob < ApplicationJob
# retry code omitted for brevity
def perform(auth0_connection_id)
if Auth0.exists? auth0_connection_id
Auth0.delete auth0_connection_id
end
end
end
This is a small improvement, but it does remove a source of false positive errors and itâs a stepping stone to our final, most evolved approach.
Do the Right Thing
Once youâre reframed the problem from âdelete this resourceâ to âthe system should ensure there are no orphaned resourcesâ, we can make another, more powerful change: instead of ensuring the âno orphaned external resourcesâ invariant for a single internal resource, try to ensure it across all internal resources. Instead of treating cleanup as an imperative action to take at a particular point in the workflow, turn it in to garbage collection - a never-ending process that exists as its own top-level control loop. Hereâs an example:
class GarbageCollectAuth0Job < ApplicationJob
def perform
existing_connection_ids = Set.new(
Auth0Client.all_connections.map(&:id)
)
connection_ids_in_use = Set.new(
IdpConnection.pluck(:auth0_connection_id)
)
orphaned_connection_ids =
existing_connection_ids -
connection_ids_in_use
orphaned_connection_ids.each { Auth0.delete(_1) }
end
end
Unlike our previous jobs, GarbageCollectAuth0Job
doesnât run in response to a particular input; instead, youâd use a scheduler to kick it off periodically (could be once a minute, could be once a week, depending on your needs). Weâve now moved to a fully GC-based system that can be monitored independently of the core application. Poison pills will still exist (and youâll need to monitor them and come up with some way to deal with them) but this approach is the most general and will cover you in the event of all kinds of issues (your post-destroy trigger didnât work for a while? no worries, the cleanup job will bail you out).
Best of Both Worlds
One thing weâve lost in the above implementation is âquick cleanupâ - if your GC period is long, you may have to wait a while before resources get removed, which may not always be acceptable to your product / business needs. The final form of our approach is a hybrid that acts as a GC loop that can be âpokedâ to do a quick, low-latency cleanup. Presenting the Cadillac of cleanup jobs:
class IdpConnectionController < ApplicationController
def destroy
@idp_connection = IdpConnection.find params[:id]
# attempt an eager cleanup
EnsureConnectionRemovedJob.perform_later(
@idp_connection.auth0_connection_id
)
@idp_connection.destroy
end
end
class GarbageCollectAuth0Job < ApplicationJob
def perform
existing_connection_ids = Set.new(
Auth0Client.all_connections.map(&:id)
)
connection_ids_in_use = Set.new(
IdpConnection.pluck(:auth0_connection_id)
)
orphaned_connection_ids =
existing_connection_ids -
connection_ids_in_use
orphaned_connection_ids.each do
EnsureConnectionRemovedJob.perform_later(_1)
end
end
end
# Enqueued by both the controller *and* the GC job
class EnsureConnectionRemovedJob < ApplicationJob
def perform(auth0_connection_id)
if Auth0.exists? auth0_connection_id
Auth0.delete auth0_connection_id
end
end
end
This also brings the benefit of parallelism for expensive cleanups - you can have as many simultaneous cleanups as you have worker pool capacity. In fact this two-tier job pattern is something we use extensively at Census - a high-level job to make a list of work to do which fans out to low-level jobs that do the work.
Other Tricks & Next Steps
Itâs a relatively small step from here to generalize this even further, to a generic orphaned resource cleaner that can work on any class of objects. While the code is too long for a blog post, you can imagine a GeneratesGarbage
interface that objects can implement in order to tell a GC framework:
How to enumerate external resources
How to determine which of those resource are in use
How to remove those resources (possibly in bulk)
Special timing considerations - either not to remove resources too quickly (to allow archival or undo) or not to remove them too slowly (in order to meet retention SLAs)
Given these parameters, the frameworks handles scheduling, enumeration, load balancing, fan out, and monitoring of cleanups in a cross-cutting way. Weâve implemented a version of this internally at Census (though not yet covering all our resources, weâre working on it!) and it makes it much easier for product teams to make use of these resources in a more declarative fashion that talks about how they are supposed to be treated, without having to reinvent the cleanup wheel each time.
One final trick that can be useful (in combination with one or more of the patterns above) is to make use of external cleanup when it exists. For example, while we use a version of the above to clean up temporary objects that Census stores in S3, that data is so sensitive that we also make use of S3âs own lifecycle configuration rules to act as a âcleaner of last resortâ in case Censusâs own cleanup system has an extended outage.
Have you seen other patterns for external resource cleanup? gems or other (non-Rails) framework approaches? Let us know here or at @CensusDev on Twitter