środa, 6 lipca 2016

Cassandra on SSD

This year’s Confitura was my first Confitura ever and my first conference in the role of a speaker. The conference was really great, congratulations to the organizers! My presentation was… OK, I suppose. I should’ve listened to my colleagues and:
  • never go with a live demo,
  • don’t use Linux on presentation (it is like asking for problems).
And there were, but now I know that keeping a cool head in such situations is more important than the OS. Anyway, it will be better next time. At least I’m not a virgin in that area anymore ;)

The good thing is that I received a lot of questions. I would like to refer to one of them, because I didn’t answer it in full. Unfortunately, I don’t remember the original version, but it can be paraphrased to something like:
"You showed many Cassandra optimizations on a spinning disk, but it’s 2016 and SSDs are basically a standard, so are those optimizations still useful?"
A few years ago I’ve watched a very interesting presentation about JVM optimization and one sentence was especially haunting: “Treat your storage as a cassette tape and you will gain maximum performance on it”. This metaphor is quite obvious in case of a spinning disk, but is it also true for SSD? As a matter of fact, yes it is!

SSD memory is organized in units of pages and blocks. Typically, an SSD page is 4 KiB in size and an SSD block has 128 flash pages (thus 512 KiB). Writing a page takes about 100 µs, which is very fast. The problem is when you want to edit some data, because the only way for an SSD to update an existing page is to:
  • copy the contents of the entire block into memory,
  • erase the block, 
  • and then write the contents of the old block + the updated page. 
This phenomenon is called write amplification. Erasing a block normally takes 2 ms, which is 20 times slower than normal write!  Performance loss is only one of the issues caused by this effect. Another serious problem is SSD endurance: more writes = shorter life of SSD.

Cassandra engineers were very clever (or lucky:)), because most of the optimizations, although designed to overcome problems with spinning disks, still apply to SSDs. Cassandra SSTables (which store all data) are immutable, meaning no updates. The compaction process, after combining (and cleaning) multiple SSTables into a single one, writes the result of compaction sequentially. For a sequential write workload, write amplification is equal to 1 - which means that there is no write amplification.

That was a really good question. I hope that the author collected a T-Shirt or a cup. If not, please contact me and I will try to fix this somehow :)

1 komentarz:

  1. I found that datastax has very good doc about planning hardware:

    BTW, I am still curious about performance of partitioning large amount of production data like 100GB :)