"Cassandra" is one of the various "Big Data" data storage and retrieval platforms that are available currently, especially for use with Cloud-based applications. Many companies with Web applications serving many simultaneous users that have significant data requirements are now utilizing the platform to serve as the data foundation for these applications. It helps to achieve the performance levels they require to support their large, active user bases, while also available at an attractive price (zero licensing costs, minimal hardware costs) as compared to some of the commercial offerings currently available.
Cassandra was originally developed by Facebook to handle the massive number of parallel reads and writes that were required by their user base when interacting with various Facebook pages, especially when searching. As you can imagine, every time you pull up your Facebook page, many different tables of data are accessed for various purposes to provide the content for a given page, with potentially millions of people accessing this same data at the same time. Performance was obviously a key challenge that required a non-traditional solution. Hence, Cassandra was born.
The key to the Cassandra approach is an elimination of the SQL query language (these are instead simple key-value pair text file writes with its own reduced query language), non-support of data joins, and elimination of other performance-heavy database "overhead" features that are present within Oracle, SQL Server, MySQL, and other traditional database plaforms. This feature reduction makes it ideal for storing and retrieving data at high speeds in Web applications with heavy data access loads. It is also architected and optimized for running in the Cloud within multi-tenant, heavy-use applications. Ultimately, it is a sacrifice of features and capability in exchange for speed and simplicity.
Cassandra is especially ideal for Web-scale applications where extensive/high levels of I/O (disk reads and writes) are required. This is different than Hadoop "Big Data" applications (another Big Data platform you might have heard of) for example, where the optimization is more around number-crunching and analytics rather than mass data reads and writes. In fact, these two popular Big Data platforms are more complementary than they are competitive.
Cassandra is an open-source platform available here:
Here are some benefits of Cassandra:
- Massively scalable
- High performance
- Highly reliable and available
- Redundant: distributed node approach eliminates failure/data loss (data is replicated across all nodes)
- No single point of failure
- All of the distributed data storage is abstracted away from the applications, so more distributed nodes can be added at any time for increased performance, and the interface to the data access remains the same and very simple
There are four primary reasons people use Cassandra:
- High volume performance needed for massive number of reads and writes in multi-tenant Web applications.
- Data architecture is fairly simple, not requiring extensive querying capabilities.
- Cost versus commercial database platforms (commercial providers for RDBMS and storage platforms would charge $$$$ for anything near these kinds of performance results).
- Cassandra works across commodity hardware (PCs) - no high end RAID servers, etc. required, keeping costs on the hardware side very low as well.
The popularity of Cassandra is exploding. At the Cassandra Summit last week in Santa Clara hosted by DataStax (one of the commercial entities focused on Cassandra, along with Acunu), there were over 800 attendees, four times as many that were at the initial event two years ago, and twice as many as last year. Over 1000 companies (such as NetFlix) have Cassandra-based applications in production now. Other companies with Cassandra in production environments include Constant Contact, Twitter, Digg, Walmart Labs, and Cisco Webex. It’s clearly finding usage scenarios, growing in popularity and catching on.
Here is a simple tutorial on building a Cassandra-based application from scratch (will help to understand the basics of implementing it) that demonstrate how easy it is to put to use:
So if you are developing Web applications that have a significant associated data requirement, it might be time to give Cassandra a look.