"Cassandra" is one of the various "Big Data" data storage and retrieval platforms that are available currently, especially for use with Cloud-based applications. Many companies with Web applications serving many simultaneous users that have significant data requirements are now utilizing the platform to serve as the data foundation for these applications. It helps to achieve the performance levels they require to support their large, active user bases, while also available at an attractive price (zero licensing costs, minimal hardware costs) as compared to some of the commercial offerings currently available.
Cassandra was originally developed by Facebook to handle the massive number of parallel reads and writes that were required by their user base when interacting with various Facebook pages, especially when searching. As you can imagine, every time you pull up your Facebook page, many different tables of data are accessed for various purposes to provide the content for a given page, with potentially millions of people accessing this same data at the same time. Performance was obviously a key challenge that required a non-traditional solution. Hence, Cassandra was born.
The key to the Cassandra approach is an elimination of the SQL query language (these are instead simple key-value pair text file writes with its own reduced query language), non-support of data joins, and elimination of other performance-heavy database "overhead" features that are present within Oracle, SQL Server, MySQL, and other traditional database plaforms. This feature reduction makes it ideal for storing and retrieving data at high speeds in Web applications with heavy data access loads. It is also architected and optimized for running in the Cloud within multi-tenant, heavy-use applications. Ultimately, it is a sacrifice of features and capability in exchange for speed and simplicity.
Cassandra is especially ideal for Web-scale applications where extensive/high levels of I/O (disk reads and writes) are required. This is different than Hadoop "Big Data" applications (another Big Data platform you might have heard of) for example, where the optimization is more around number-crunching and analytics rather than mass data reads and writes. In fact, these two popular Big Data platforms are more complementary than they are competitive.
Cassandra is an open-source platform available here:
Here are some benefits of Cassandra:
- Massively scalable
- High performance
- Highly reliable and available
- Redundant: distributed node approach eliminates failure/data loss (data is replicated across all nodes)
- No single point of failure
- All of the distributed data storage is abstracted away from the applications, so more distributed nodes can be added at any time for increased performance, and the interface to the data access remains the same and very simple
There are four primary reasons people use Cassandra:
- High volume performance needed for massive number of reads and writes in multi-tenant Web applications.
- Data architecture is fairly simple, not requiring extensive querying capabilities.
- Cost versus commercial database platforms (commercial providers for RDBMS and storage platforms would charge $$$$ for anything near these kinds of performance results).
- Cassandra works across commodity hardware (PCs) - no high end RAID servers, etc. required, keeping costs on the hardware side very low as well.
The popularity of Cassandra is exploding. At the Cassandra Summit last week in Santa Clara hosted by DataStax (one of the commercial entities focused on Cassandra, along with Acunu), there were over 800 attendees, four times as many that were at the initial event two years ago, and twice as many as last year. Over 1000 companies (such as NetFlix) have Cassandra-based applications in production now. Other companies with Cassandra in production environments include Constant Contact, Twitter, Digg, Walmart Labs, and Cisco Webex. It’s clearly finding usage scenarios, growing in popularity and catching on.
Here is a simple tutorial on building a Cassandra-based application from scratch (will help to understand the basics of implementing it) that demonstrate how easy it is to put to use:
So if you are developing Web applications that have a significant associated data requirement, it might be time to give Cassandra a look.
Late last week, Amazon released an update
to its DynamoDB
service, a fully managed NoSQL
offering for efficiently handling extremely large amounts of data in Web-scale (generally meaning very high user volume) application environments. The DynamoDB offering was originally launched in beta back in January, so this is its first update since then.
The update is a "batch write/update" capability, enabling multiple data items to be written or updated in a single API call. The idea is to reduce Internet latency by minimizing trips back and forth to Amazon's various physical data storage entities from the calling application. According to Amazon, this was in response to developer forum feedback requests.
This update to help address what was already an initial key selling point of DynamoDB tells us that latency is still a significant challenge for cloud-based storage. After all, one of the key attributes of DynamoDB when first launched was speed and performance consistency, something that their NoSQL precursor to DynamoDB, SimpleDB
, was unable to deliver, at least according to some developers and users who claimed data retrieval response times ran unacceptably into the minutes. This also could have been a primary reason for SimpleDB's lower adoption rates. Amazon is well aware of these performance challenges, and hence the significance of its first DynamoDB update.
Another key tenant of DynamoDB is that it is a managed offering, meaning the details of data management requirements such as moving data from one distributed data store to another is completely abstracted away from the developer. This is great news, as complexity of cloud environments was proving to be too challenging for many developers trying to leverage cloud storage capabilities. The masses were scratching their heads as to how to overcome storage performance bottlenecks, attain replication, achieve response latency consistency, and perform other operations-related data management challenges when it was in their purview to do so. By the way, management complexity will likely still be a major challenge for other NoSQL vendors, and there are many "big data" startups offering products in this category, who do not offer the same level of abstraction that DynamoDB offers. It will be interesting to see if the launch of DynamoDB becomes a significant threat to many of these startups.
We learned this reduction of complexity lesson at StrikeIron
within our own niche offerings as well. We gained a much bigger uptake of our simpler, more granular Web services APIs, such as email verification
, address verification
, and other products such as reverse address and telephone lookups
as single, individual services, rather than complex services with many different methods and capabilities. This proved true even if the the more complex services provided more advanced power within a single API. In other words, simplified remote controls for television sets are probably still the best idea for maximum television adoption, as initial confusion and frustration tends to be inversely proportional to the adoption of any technology.
Another interesting point is that this is the fifth class of database product offerings in Amazon's portfolio. Along with DynamoDB, there is also still the aforementioned SimpleDB, a schemaless NoSQL offering for "smaller" datasets. There is also the original S3
offering with a simple Web service based interface for storing, retrieving, and deleting data objects in a straightforward key/value pair format. Next, there is Amazon RDS
for managed, relational database capabilities that utilize traditional SQL for manipulating data and is more applicable for traditional applications. Finally, there are the various Amazon Machine Image (AMI) offerings on EC2
(Oracle, MySQL, etc.) for those who don't want a managed relational database and would rather have complete control over their instances (and not have to utilize their own hardware) and the RDBMs that run on them.
This tells us that the world is far from one-size-fits-all cloud database management systems, and we can all expect to be operating in hybrid storage environments that will vary from application to application for quite some time to come. I suppose that's good news for those who make a living on the operations teams of information technology.
And along with each new database offering from Amazon also comes a different business model. In the case of DynamoDB for example, Amazon has introduced the concept of "read and write capacity units", where charges will be based on the combination of frequency of usage and physical data size. This demonstrates that the business models are still somewhat far from optimal, and will likely change again in the future. Clearly they are not yet quite right for the major vendors trying to figure it all out as business model adjustments in the Cloud are not just limited to Amazon.
In summary, following the Amazon database release timeline over the years yields some interesting information, namely that speed/latency, reduction of complexity, the likelihood of hybrid compute and storage environments for some time to come, and ever-changing cloud business models are the primary focus of cloud vendors responding to the needs of their users. And as any innovator knows, the challenges are where the opportunities are.
2011 has been the year of the Cloud database. The idea of shared database resources and the abstraction of underlying hardware seems to be catching on. Just like Web and application servers, paying-as-you-go and eliminating unused database resources, licenses, hardware, and all of the associated cost is proving to have attractive enough business models that the major vendors are betting on it in significant ways.
The recent excitement has not been limited to just the fanfare around "big data" technologies. Lately, most of the major announcements have come around the traditional relational, table-driven SQL environments Web applications make use of much more widely than the key-value pair data storage mechanisms "NoSQL" technology uses for Web-scale data-intensive applications such as Facebook, NetFlix, etc.
Here are some of the new Cloud database offerings for 2011:
Saleforce.com has launched Database.com, enabling developers in other Cloud server environments such as Amazon's EC2 and the Google App Engine to utilize its database resources, not just users of Salesforce's CRM and Force.com platforms. You can also build applications in PHP or on the Android platform and utilize Database.com resources. The idea is to reach a broader set of developers and application types than just CRM-centric applications.
At Oracle Open World a couple of weeks ago, Oracle announced the Oracle Database Cloud Service, a hosted database offering running Oracle's 11gR2 database platform available in a monthly subscription model, accessible either via JDBC or its own REST API.
Earlier this month, Google announced Google Cloud SQL, a database service that will be available as part of its App Engine offering based on MySQL, complete with a Web-based administration panel.
Amazon, to complement its other Cloud services and highly used EC2 infrastructure, has made the Amazon Relational Database Service (RDS) available to enable SQL capabilities from Cloud applications, giving you a choice of underlying database technology to use such as MySQL or Oracle. It is currently in beta.
Microsoft also has its SQL Azure Cloud Database offering available in the Cloud, generally positioned as suited for applications that use the Microsoft stack for developers that will want to leverage some of the benefits of the Cloud.
Some of the above offerings have only been announced so far, and not actually launched. Or, they have limited preview access available now. Also, even the business models in some of these cases have not even been completely divulged, or if so are very likely to change.
Clearly there is a considerable marketshare land grab existing now. All of the major vendors are recognizing that traditional-SQL Cloud storage infrastructure will be an important technology going forward. Adding a solid database layer to the Cloud architecture story seems like an important step in the continuing enterprise and commercial software move to the Cloud, and these new vendor offerings should in turn accelerate this move.
So, is this really the wave of the future? Some of the major questions that will have to be answered include those around latency. When data requests have to hop from a client application, then to the application server, to the database, and then back to the server and client, even multiple times within a single request, it can result in quite a performance hit. Likely, these machines exist far from each other geographically and might really slow things done, annoying an end-user with the slow page loads. This is probably why most infrastructure providers realize that they have to have the corresponding database capabilities available and accessed natively to reduce this latency. However, performance, along with security issues (perceived or otherwise) still could be a significant barrier to mainstream adoption.
Also, most of the relational database environments that exist in the Cloud only have a subset of SQL capabilities available and in some cases can be quite limited. For example, many of these Cloud SQL platforms don't support cross-table joins, at least not yet. This is a very common requirement for SQL applications. The lack of support is primarily because joins can consume a lot of resources, another performance-killer in shared environments.
Once most of this storage and Cloud database infrastructure gets in place however, incorporating more content-oriented data services such as customer data verification will become commonplace and easy to leverage. We may even see them incorporated into the database offerings themselves as they look to differentiate themselves from vendor to vendor. Cloud-based database offerings have the advantage of making much larger libraries of data-oriented add-on capabilities available right out of the box, so the story here is much more than just cost.
While SQL Cloud offering announcements are all the rage in 2011, 2012 will undoubtedly tell the adoption tale. No doubt these offerings will be ideal and cost-effective for many use cases out there. But will demand be large enough quickly enough to support all of these vendors and drive the innovation at a speed that will make these platforms viable in the near future for enterprise and commercial applications? The answer is likely yes, but the next twelve months or so will give us a lot of the supporting data to measure the extent of the trend.
One thing that's clear as we pass the halfway point of 2010 is that the Cloud Computing movement is not only gaining momentum, but the usage trends of the Web that are driving Cloud Computing are only increasing in influence and contributing to its momentum at a faster pace than ever.
For example, Facebook's Chief Technical Officer reported last month that they were seeing as many as one million photos being served up per second
through the entirety of their Web-based social application, and that they expect this to increase ten-fold
over the next twelve months.
Also, how many of us watched some streaming World Cup soccer games over the past month as Spain proved supreme in South Africa? Or at least highlights on YouTube and various other video outlets? Currently, it is estimated that 50% of all Web traffic is video. That's not surprising, but with High Definition (HD) Web technology and the like emerging, video is expected to represent 90% of all traffic in just a few years. This is going to require bandwidth levels that were largely unthinkable years ago.
On another front, mobile infrastructure is not keeping pace with demand. Right now, some estimates have shown mobile infrastructure requirements growing at about 50% per year, while actual mobile network infrastructure capacity is only growing at 20% per year. This is going to be a real problem, and one of the reasons some mobile carriers such as AT&T have begun capping usage and introducing fees for premium levels of bandwidth that were standard issue up until now, and other carriers may likely follow suit. It's the only way to help curtail demand to meet capacity in their eyes.
So what does all of this mean?
One of the reasons we have Cloud Computing in the first place is that innovative Web companies such as Amazon and Google had to build out enough computing capacity to handle peak periods of Web traffic and activity, especially Amazon during its Christmas holiday crunch.
As a result, they found themselves not only experts at building out distributed computing capacity, load-balancing, and data synchronization, but also found that most of the time they had all sorts of computing power that they had invested in for peak periods "shelved" and not in use, far from cost-optimized. This led them to think of ways to monetize this excess capacity (servers and disk space lying around idle) and led to some of the early thinking and innovation around Web-based centralized computing. The same is true with Google and others with all of their excess Web computing power, as they looked for ways to monetize large, excess amounts of capacity and leverage their expertise at building out server farms and developing highly-distributed, yet high performing levels of computing.
This same necessity-is-the-mother-of-invention phenomenon is playing out now as Facebook develops new technology to serve up its millions of photos per second, and is spawning new data storage and retrieval technology such as the NoSQL paradigm shift, with new non-SQL and "not only" SQL architecture approaches such as Cassandra, BigTable, Neptune, Google Fusion Tables, and Dynamo that are more finely tuned to the needs of Web-scale Cloud Computing.
In parallel, the bandwidth demands of video and mobile infrastructure are seeding new innovation around capacity and distribution of bandwidth as well, including much more efficient and easier to implement elastic computing capabilities to handle these variable bandwidth demands as much of mobile's required computing requirements are moved to and answered via the Web (and this also makes SmartPhones ideal Cloud Computing clients, also pushing the paradigm).
While not only mind-boggling and exciting, these trends are the cornerstones of a revolution already in progress. All of this demand-driven innovation is only causing more and more build-out of the foundation from which the future Internet and "Cloud" will emerge. A few years from now, we will look back and see how the Web computing demands of today, whether from Facebook, Google, Twitter, or others, enabled a whole new generation of Web applications to emerge. And of course, huge amounts of data were gobbled up in the process, a lot of which will have come from StrikeIron's own data delivery innovation in the Cloud.
No doubt about it, the Cloud is a good place to be.