Monday, November 15, 2010

5 Minute Intro to Cassandra

I recently got a chance to play with Cassandra, and it was actually pretty enlightening. If you haven't heard of it, it's the distributed NoSQL key-value store open-sourced by Facebook, supported by Apache, and used by Twitter, Reddit, and more. If you need more buzzwords and name-dropping, you'll have to go elsewhere.

Cassandra is actually a pretty different model that designing for SQL-based DBs, so if you've never used it, you'll probably have to spend some time reading about it. (That is why you're here, right?) There are a ton of great tutorials out there (and I'll list some of them at the bottom), but they all have one problem:

They assume you care about how Cassandra works.

I'm not going to assume that. I'm going to assume you're trying to store some massive pile of data in there and you want to kick off the migration script you're about to write before your 4:30 tee time. So, here goes.

THE BASICS

You should think about Cassandra as a giant hashtable. It has 4 or 5 levels, depending on how you're using it. Those levels are:

  1. KeySpace - This is the name of your application. It's hardcoded into the schema file (storage-conf.xml, as of v0.6). If you're coming from SQL, it's like a database.
  2. ColumnFamily - This is a name for a set of data that you have. This is also hardcoded into the schema file -- if you're coming from SQL, it's like a table.
  3. Key - This is the piece of data in your database that forms some logical unit, and you probably have a ton of these. Cassandra knows how to spread keys over multiple machines, so pick something that you have a lot of, like users.
  4. Columns - Your actual data, as columnname/value pairs. These are just key-value pairs, and you don't have to know any of them ahead of time -- no schema needed!
  5. SubColumns - If you mark your ColumnFamily as "Super" in the schema, (4) becomes SuperColumns, and instead of just any old bytestring as the value, you get another level of hash. There's also no schema here -- use any name you want for your SubColumns.

And that's it! You want to get a user 57's hashed password? Your query would look kind of like this:

['YourAppName']['ProfileInformation'][57]['password']

You want to see user 23's apartment number?

['YourAppName']['ProfileInformation'][57]['address']['aptNum']

It's pretty straightforward. Cassandra itself does a whole bunch of work to make sure you can do things like this, but the basic model is pretty easy.

OH GOD, THE PERFORMANCE BLOWS

OK, so that's not quite all you need to know. If you've done any amount of DB work before, you're probably wondering where to throw indexes. The astute will point out that hash tables don't get to have indexes, and they'll be right. Cassandra does sort some levels of the table for you, and this probably makes a big difference. Let's go through each level and look at what's efficient at each level.

  1. Keyspace - Part of the schema -- no order.
  2. ColumnFamily - Also part of the schema -- no order.
  3. Keys - Sorted lexicographically (string comparison) or randomly, and you choose this in the schema[1]. Remember that this is the primary mechanism for distributing your keys over multiple computers, so unless you can guarantee me that all your reads or all your writes will happen evenly over the keyspace, use the Random partitioner [2].
  4. (Super)Colums - Always sorted. Default sorting is by byte order (you can't turn this off), but you can have it sort by a couple other things, like Long (useful for times) or UTF8. Since it's sorted, you can query by ranges.
  5. SubColums - Always sorted, in the same way the SuperColumns can be sorted.

One common case is to set up users as keys, and timestamps of events (like tweets or incoming mail) as columns. That way, you can get the first 10 messages in a users inbox, or all tweets associated with a person or a Facebook wall, with a range query over columns.

Anything you'll want to query as a single element, you want to set up as a key, (Super)Column, or SubColumn. Anything you'll want to query as a range, you want to set up as a (Super)Column or SubColumn.

THAT'S IT! ALSO, LINKS

That's all you need to know to get started! What you actually do with it is up to you.

I promised you better resources (for if it looks like it's going to rain, or your golfing buddy bailed), so here's a few I found useful:


[1] Think real hard about this one. You can't change it later.
[2] Foursquare uses MongoDB and the equivalent of the Ordered partitioner, and they discovered that more recent users tended to use the service more than older users. This caused their "latest users" server to crash, taking down their whole operation. Do you still think your writes will be distributed evenly across your keys?

0 comments:

Post a Comment