Comment System Data

I’d like to spend some time tonight thinking about what kind of data we will need to store and serve for the comment system. First, let’s take a look at the feature wish list and see if this helps derive some detail.

Up/Down Votes

This feature provides a crowd-sourced form of moderation - for free! Basically we want to provide a comment buoyancy, with comments having a higher ranking rising to the top of their siblings, and comments with negative ranking falling, potentially being removed altogether.

There are a couple of considerations here. Should anonymous up/down votes count for anything? If so, a different amount or some fraction of an OAuth’d user’s vote? These are relevant questions for determining what a reasonable range of values might be. If anonymous users are allowed to vote, this means that a page popular for drive by trolling could need to count a very large number of votes. At the same time, a user is not really going to want to see that a particular comment has some outlandish number of upvotes. As for the second point, if anonymous votes count as some fraction of authenticated user votes, we’ll need to account for fractional votes.

A 16-bit blob will be fine for this. Even assuming that we want to allow an anonymous vote to count for 1/8th of a full vote, we would still have 13-bits of space to track values. We could use the 3 least significant bits to indicate the fractional part - essentially a form of fixed-point number representation with a scale factor of 8. This would still allow (2^13 - 1) values. If that’s not enough, it’s probably time to reconsider your use of the comment system.

Threaded Comments

This requirement immediately makes me visualize a tree data structure, though we can make some optimizations given its particular use. Storing this data in a format that’s compact yet efficient to read seems tricky, so I might punt on this for a while and think on it.

I’m not sure I’d like to support full, N-level branching ala reddit. I think I’ll build in an arbitrary max-depth of three. That allows for a comment on a comment and a rebuttal. From that point on, do not allow further descendants, but add responses as siblings.

There are a few things to note about this:

  • Will be read depth first for some subset of the top level comments
  • There should be a max depth of three
  • Comments will be deleted, removed, and edited

Subscribe to updates

I want to allow a drive-by user to subscribe to updates for the comment stream. Subscribing means the user will be notified by email about updates. This should probably be batched to some degree. It wouldn’t do to provide a system for blowing up somebody’s inbox by registering them for a heated debate.

This implies some additional storage considerations at the top level to track email addresses that have requested a subscription.

Summary

Well, I wanted to take a look at some of the potential data requirements in this post and I ended up getting pretty far down in the weeds. It’s my project though, so why not dig into the details where it’s interesting? :-)

We’ll revisit these points in a later post and start brainstorming some storage formats that would be well suited for this.