Time-Series Data With Ruby on Rails and Postgres
Counting things can be tough. It’s a metric you need to show your users, but if you’re not careful at best you end up creating a bottleneck in your app. At worst, you bring the whole site down as every available resource is waiting for metrics to aggregate.
There’s time series databases and other tools at your disposal for this. But those take effort to add to your infrastructure. And your data requirements may not be that large to begin with.
What if you just need some simple graphs of counts, but don’t want to kill your app? Here’s how we do counting without that stress.
A key function in Census is building Segments. Segments are slices of datasets, like “All users who made a purchase in the last 30 days”. They are often synced to ad tools like Facebook or Google Ads.
When defining a Segment it’s helpful to see an estimate of the size of the segment and how it changes over time. We also want to give users flexibility to query various granularities like average Segment size per day over the last month, or average Segment size per week over the last year.
But it’s slow to compute this information if there are a ton of size datapoints, and it’s easy to duplicate aggregation work here if we don’t have some sort of cache for previously computed ranges.
So, whenever you create, update, or sync a Segment we log a size metric. Here’s a schema for the metrics table:
create_table "size_metrics", force: :cascade do |t|
t.bigint "parent_object_that_has_metrics_id", null: false
t.integer "size", null: false
t.datetime "timestamp", default: -> { "CURRENT_TIMESTAMP" }, null: false
t.datetime "created_at", precision: 6, default: -> { "CURRENT_TIMESTAMP" }, null: false
t.datetime "updated_at", precision: 6, default: -> { "CURRENT_TIMESTAMP" }, null: false
t.index ["parent_object_that_has_metrics_id", "timestamp"], name: "index_size_metrics_on_parent_object_that_has_metrics_id_and_timestamp"
end
Size metrics are then aggregated into a size aggregations table in a cron job that runs regularly. Here’s a schema for the aggregations table:
create_table "size_aggregations", force: :cascade do |t|
t.bigint "parent_object_that_has_metrics_id", null: false
t.integer "value", null: false
t.string "timescale", null: false
t.datetime "timestamp", default: -> { "CURRENT_TIMESTAMP" }, null: false
t.datetime "created_at", precision: 6, default: -> { "CURRENT_TIMESTAMP" }, null: false
t.datetime "updated_at", precision: 6, default: -> { "CURRENT_TIMESTAMP" }, null: false
t.index ["parent_object_that_has_metrics_id", "timescale", "timestamp"], name: "index_size_aggs_on_parent_object_that_has_metrics_id_and_timescale_and_timestamp", unique: true
end
Here’s a cron job that rolls up metrics into daily aggregations:
class AggregateSizeMetricsJob < ActiveJob::Base
queue_as :default
def perform(*_args)
beginning_of_today = Time.now.utc.beginning_of_day
# Compute metrics aggregations for the last 30 days if they do not exist.
(0..29).each do |i|
beginning_of_day = beginning_of_today - i.days
# Consider skipping this day's aggregation, unless it's today so we update the aggregation as metrics come in.
if i.positive?
num_objects_with_aggregations =
SizeAggregation
.select("DISTINCT parent_object_that_has_metrics_id")
.where(timescale: "daily")
.where(timestamp: beginning_of_day)
.count
num_objects_with_metrics =
SizeMetric
.select("DISTINCT parent_object_that_has_metrics_id")
.where("timestamp >= ? AND timestamp < ?", beginning_of_day, beginning_of_day + 1.day)
.count
# Skip this iteration if we have an aggregation for each object that has a metric for the day.
# Otherwise re-aggregate the metrics for all objects for that day.
#
# I considered grabbing IDs for objects that have metrics but not aggregations and only aggregating those,
# but this post suggests that it may not be worth the effort:
# https://stackoverflow.com/questions/1013797/is-sql-in-bad-for-performance
next if num_objects_with_aggregations == num_objects_with_metrics
end
timescale = "daily"
SizeMetric
.group(:parent_object_that_has_metrics_id)
.select("parent_object_that_has_metrics_id", "AVG(size) as avg_size")
.where("timestamp >= ? AND timestamp < ?", beginning_of_day, beginning_of_day + 1.day)
.each do |aggregated_metric|
aggregation =
SizeAggregation.find_or_initialize_by(
parent_object_that_has_metrics_id: parent_object_that_has_metrics_id,
timescale: timescale,
timestamp: beginning_of_day
)
aggregation.value = Integer(aggregated_metric.avg_size)
aggregation.save!
end
end
end
end
end
end
And then when we need those values for a graph, our controller looks something like this:
def size_metrics
# Get the last 30 days of aggregated metrics
beginning_of_day_30_days_ago = 30.days.ago.utc.beginning_of_day
parent_object_that_has_metrics.
.source_size_aggregations
.where(timescale: "daily")
.where("timestamp > ?", beginning_of_day_30_days_ago)
.order(timestamp: :asc)
end
Problems we faced
Empty State
When a segment is first created does the user have to wait for the aggregation cron job to run to see any size data? When we first built this that was the case. Realizing this wasn’t a great user experience, we launch an asynchronous aggregation when a segment is first created.
Cron Job Is Down
What happens if the aggregation cron job fails to run or errors out for long periods of time? In this case we realized days worth of data could be empty. To remedy this, we updated the cron job to fill in any missing aggregations for the last 30 days for timescales we care about.
Looking ahead
Right now we’re only showing daily aggregations in Census, but down the line we want to support different types of aggregations.
What if an aggregation doesn’t exist for the time range requested by the client? Or what happens when we want a new granularity like weekly aggregations.
It’s not realistic for the cron job to precompute every permutation of range a user might want. Options include:
When we recognize an aggregation doesn’t exist, we can fire off an asynchronous job that will compute that aggregation soon but not as part of the current request lifecycle. This way the client won’t block and risk any request timeouts, but on a subsequent query of the same data it should get something. In this case we could give an indicator in the UI that this data will be available soon.
As an optimization to the aforementioned point, we could return aggregations of already aggregated data as a stopgap. For example, if we have daily aggregations over the requested time range we could roll them up into weekly ones and return that to the client. This should be faster than aggregating across the raw metrics table since there will be less datapoints and will ensure that the client has something to work with while the asynchronous job populates the proper data.
There are many ways we can continue to improve this. One day our data needs might grow large enough we’ll explore data partitioning in Postgres or use something specifically built for high throughput time-series. But the important thing to keep in mind: solve what our users need these metrics for without spending countless hours prematurely optimizing. Do you have any thoughts on counting metrics? Let us know at @CensusDev.
P.S. Help us make these tools. We’re hiring. Come work with us!