Archive for the cassandra Category

Testing Cassandra write performance

With below tests I was trying to measure Cassandra write performance. First off this is probably not a good performance comparison for a number of reasons. Main one being I was running this on my laptop. If you ran this on server grade hardware with a tuned Cassandra setup you probably would get higher numbers. That being said, this was done to get a general idea about write performance in Cassandra ‘cos “You can’t do much without measuring”.

My setup was,

  • Cassandra running in a VirtualBox VM with default parameters. Only the data directories were changed
  • VM was running Kubuntu 11.04
  • JDK 1.6.0_27
  • JVM was not warmed up before carrying out the test. I started Cassandra with an empty keyspace for each case
  • I was testing the code hosted here

Here’s how the three column families look like.

REGData column family,

PropertyIndex column family,

TagIndex column family,

Here are the results,

The graphs looks very similar. However Cassandra 1.0.2 is has faster write speeds. Here are the raw data. T in the graph represent number of tags and P number of properties.

Sudden spike in write speed increase is a bit scary. That can be due to I/O bottleneck in the machine. At that point my disk started grinding heavily. During the time of the tests, I didn’t do any other disk intensive tasks. On server grade hardware with a few fast disks results might be different. Radically even.

CassandraSF 2011

I was at CassandraSF 2011 yesterday and was surprised by the 450 – 500 people showing up for the conference. The community and the momentum behind Cassandra at the moment is exciting. Jonathan Ellis kicked off the conference with his keynote speech where he laid out the progress so far and features that have been planned ahead for the project.

After Jonathan’s keynote, I went for the CloudSandra presentation where they’ve implemented a framework with Brisk. The presentation seem to be intriguing as they’ve developed a multi-tenant REST API for Cassandra. Although the meaning of multi-tenancy seemed to have interpreted differently as I’ve been used to the term. Multi-tenancy as I know it have been used to treat a particular organization (an entity as you will) as a tenant and you can have individual users inside that tenant. I felt the multi-tenancy in CloudSandra meant more like multi-user system. I may be wrong. Their usage of Apigee to play with the REST API was very cool as Apigee provide a neat UI for your REST APIs.

Next was an entertaining talk by Eric Evans. Was quite fun as Eric the guy who helped popularize the term NoSQL and the talk was regarding SQL like query language for Cassandra called CQL! It was mostly about high level info/rationale behind coming up with CQL and the current state and some hints as to how it might evolve in time to come.

After that there was a session about Brisk. An introduction and what Brisk is all about by Jake Luciani.

Then David Strauss did an entertaining talk where he demoed a highly available DNS server written using Cassandra. He showed how data propagetes to 3 machines hosted in 3 geographical areas. Very cool!

During Adrian Cockcroft‘s presentation he described how Netflix is using Cassandra, how they migrated to Cassandra from on Oracle DB. Also plethora of other operation aspects of Cassandra clusters. One point he stressed during the talk was, since they were running exclusively on EC2 nodes they had to take instance termination as a fact of life. So they write apps so that they’re resilient even couple of nodes go down. They also deliberately kill random instances and test if the system function as expected!

As the final session of the day attended the Cassandra internals by Gary Dusbabek walked the audience through the Cassandra code base. How everything is wired together and where critical components are at.