Subscribe by Email

Your email:

More Info

free-trials
contact-us

StrikeIron Blog

Current Articles | RSS Feed RSS Feed

Cassandra Emerging as the Premier Web-scale Database Platform

  
  
cassandra logo

"Cassandra" is one of the various "Big Data" data storage and retrieval platforms that are available currently, especially for use with Cloud-based applications. Many companies with Web applications serving many simultaneous users that have significant data requirements are now utilizing the platform to serve as the data foundation for these applications. It helps to achieve the performance levels they require to support their large, active user bases, while also available at an attractive price (zero licensing costs, minimal hardware costs) as compared to some of the commercial offerings currently available.

Cassandra was originally developed by Facebook to handle the massive number of parallel reads and writes that were required by their user base when interacting with various Facebook pages, especially when searching. As you can imagine, every time you pull up your Facebook page, many different tables of data are accessed for various purposes to provide the content for a given page, with potentially millions of people accessing this same data at the same time. Performance was obviously a key challenge that required a non-traditional solution. Hence, Cassandra was born.

The key to the Cassandra approach is an elimination of the SQL query language (these are instead simple key-value pair text file writes with its own reduced query language), non-support of data joins, and elimination of other performance-heavy database "overhead" features that are present within Oracle, SQL Server, MySQL, and other traditional database plaforms. This feature reduction makes it ideal for storing and retrieving data at high speeds in Web applications with heavy data access loads. It is also architected and optimized for running in the Cloud within multi-tenant, heavy-use applications. Ultimately, it is a sacrifice of features and capability in exchange for speed and simplicity.

Cassandra is especially ideal for Web-scale applications where extensive/high levels of I/O (disk reads and writes) are required. This is different than Hadoop "Big Data" applications (another Big Data platform you might have heard of) for example, where the optimization is more around number-crunching and analytics rather than mass data reads and writes. In fact, these two popular Big Data platforms are more complementary than they are competitive.

Cassandra is an open-source platform available here:
http://cassandra.apache.org/

Here are some benefits of Cassandra:

- Massively scalable
- High performance
- Highly reliable and available
- Redundant: distributed node approach eliminates failure/data loss (data is replicated across all nodes)
- No single point of failure
- All of the distributed data storage is abstracted away from the applications, so more distributed nodes can be added at any time for increased performance, and the interface to the data access remains the same and very simple

There are four primary reasons people use Cassandra:

- High volume performance needed for massive number of reads and writes in multi-tenant Web applications.





















Amazon's NoSQL and Database Evolution: What Can Be Learned

  
  
DynamoDBLate last week, Amazon released an update to its DynamoDB service, a fully managed NoSQL offering for efficiently handling extremely large amounts of data in Web-scale (generally meaning very high user volume) application environments. The DynamoDB offering was originally launched in beta back in January, so this is its first update since then.

The update is a "batch write/update" capability, enabling multiple data items to be written or updated in a single API call. The idea is to reduce Internet latency by minimizing trips back and forth to Amazon's various physical data storage entities from the calling application. According to Amazon, this was in response to developer forum feedback requests.

This update to help address what was already an initial key selling point of DynamoDB tells us that latency is still a significant challenge for cloud-based storage. After all, one of the key attributes of DynamoDB when first launched was speed and performance consistency, something that their NoSQL precursor to DynamoDB, SimpleDB, was unable to deliver, at least according to some developers and users who claimed data retrieval response times ran unacceptably into the minutes. This also could have been a primary reason for SimpleDB's lower adoption rates. Amazon is well aware of these performance challenges, and hence the significance of its first DynamoDB update.

Another key tenant of DynamoDB is that it is a managed offering, meaning the details of data management requirements such as moving data from one distributed data store to another is completely abstracted away from the developer. This is great news, as complexity of cloud environments was proving to be too challenging for many developers trying to leverage cloud storage capabilities. The masses were scratching their heads as to how to overcome storage performance bottlenecks, attain replication, achieve response latency consistency, and perform other operations-related data management challenges when it was in their purview to do so. By the way, management complexity will likely still be a major challenge for other NoSQL vendors, and there are many "big data" startups offering products in this category, who do not offer the same level of abstraction that DynamoDB offers. It will be interesting to see if the launch of DynamoDB becomes a significant threat to many of these startups.

We learned this reduction of complexity lesson at StrikeIron within our own niche offerings as well. We gained a much bigger uptake of our simpler, more granular Web services APIs, such as email verification, address verification, and other products such as reverse address and telephone lookups as single, individual services, rather than complex services with many different methods and capabilities. This proved true even if the the more complex services provided more advanced power within a single API. In other words, simplified remote controls for television sets are probably still the best idea for maximum television adoption, as initial confusion and frustration tends to be inversely proportional to the adoption of any technology.

Another interesting point is that this is the fifth class of database product offerings in Amazon's portfolio. Along with DynamoDB, there is also still the aforementioned SimpleDB, a schemaless NoSQL offering for "smaller" datasets. There is also the original S3 offering with a simple Web service based interface for storing, retrieving, and deleting data objects in a straightforward key/value pair format. Next, there is Amazon RDS for managed, relational database capabilities that utilize traditional SQL for manipulating data and is more applicable for traditional applications. Finally, there are the various Amazon Machine Image (AMI) offerings on EC2 (Oracle, MySQL, etc.) for those who don't want a managed relational database and would rather have complete control over their instances (and not have to utilize their own hardware) and the RDBMs that run on them.

This tells us that the world is far from one-size-fits-all cloud database management systems, and we can all expect to be operating in hybrid storage environments that will vary from application to application for quite some time to come. I suppose that's good news for those who make a living on the operations teams of information technology.

And along with each new database offering from Amazon also comes a different business model. In the case of DynamoDB for example, Amazon has introduced the concept of "read and write capacity units", where charges will be based on the combination of frequency of usage and physical data size. This demonstrates that the business models are still somewhat far from optimal, and will likely change again in the future. Clearly they are not yet quite right for the major vendors trying to figure it all out as business model adjustments in the Cloud are not just limited to Amazon.

In summary, following the Amazon database release timeline over the years yields some interesting information, namely that speed/latency, reduction of complexity, the likelihood of hybrid compute and storage environments for some time to come, and ever-changing cloud business models are the primary focus of cloud vendors responding to the needs of their users. And as any innovator knows, the challenges are where the opportunities are.


Cloud Landscape: Cloud Databases Emerging Everywhere

  
  
clouddb

2011 has been the year of the Cloud database. The idea of shared database resources and the abstraction of underlying hardware seems to be catching on. Just like Web and application servers, paying-as-you-go and eliminating unused database resources, licenses, hardware, and all of the associated cost is proving to have attractive enough business models that the major vendors are betting on it in significant ways.

The recent excitement has not been limited to just the fanfare around "big data" technologies. Lately, most of the major announcements have come around the traditional relational, table-driven SQL environments Web applications make use of much more widely than the key-value pair data storage mechanisms "NoSQL" technology uses for Web-scale data-intensive applications such as Facebook, NetFlix, etc.

Here are some of the new Cloud database offerings for 2011:

Saleforce.com has launched Database.com, enabling developers in other Cloud server environments such as Amazon's EC2 and the Google App Engine to utilize its database resources, not just users of Salesforce's CRM and Force.com platforms. You can also build applications in PHP or on the Android platform and utilize Database.com resources. The idea is to reach a broader set of developers and application types than just CRM-centric applications.

At Oracle Open World a couple of weeks ago, Oracle announced the Oracle Database Cloud Service, a hosted database offering running Oracle's 11gR2 database platform available in a monthly subscription model, accessible either via JDBC or its own REST API.

Earlier this month, Google announced Google Cloud SQL, a database service that will be available as part of its App Engine offering based on MySQL, complete with a Web-based administration panel.

Amazon, to complement its other Cloud services and highly used EC2 infrastructure, has made the Amazon Relational Database Service (RDS) available to enable SQL capabilities from Cloud applications, giving you a choice of underlying database technology to use such as MySQL or Oracle. It is currently in beta.

Microsoft also has its SQL Azure Cloud Database offering available in the Cloud, generally positioned as suited for applications that use the Microsoft stack for developers that will want to leverage some of the benefits of the Cloud.

Driving Factors of Cloud Computing Only Becoming More Emphatic

  
  
One thing that's clear as we pass the halfway point of 2010 is that the Cloud Computing movement is not only gaining momentum, but the usage trends of the Web that are driving Cloud Computing are only increasing in influence and contributing to its momentum at a faster pace than ever.

For example, Facebook's Chief Technical Officer reported last month that they were seeing as many as one million photos being served up per second through the entirety of their Web-based social application, and that they expect this to increase ten-fold over the next twelve months.

Also, how many of us watched some streaming World Cup soccer games over the past month as Spain proved supreme in South Africa? Or at least highlights on YouTube and various other video outlets? Currently, it is estimated that 50% of all Web traffic is video. That's not surprising, but with High Definition (HD) Web technology and the like emerging, video is expected to represent 90% of all traffic in just a few years. This is going to require bandwidth levels that were largely unthinkable years ago.

On another front, mobile infrastructure is not keeping pace with demand. Right now, some estimates have shown mobile infrastructure requirements growing at about 50% per year, while actual mobile network infrastructure capacity is only growing at 20% per year. This is going to be a real problem, and one of the reasons some mobile carriers such as AT&T have begun capping usage and introducing fees for premium levels of bandwidth that were standard issue up until now, and other carriers may likely follow suit. It's the only way to help curtail demand to meet capacity in their eyes.

So what does all of this mean?

One of the reasons we have Cloud Computing in the first place is that innovative Web companies such as Amazon and Google had to build out enough computing capacity to handle peak periods of Web traffic and activity, especially Amazon during its Christmas holiday crunch.

As a result, they found themselves not only experts at building out distributed computing capacity, load-balancing, and data synchronization, but also found that most of the time they had all sorts of computing power that they had invested in for peak periods "shelved" and not in use, far from cost-optimized. This led them to think of ways to monetize this excess capacity (servers and disk space lying around idle) and led to some of the early thinking and innovation around Web-based centralized computing. The same is true with Google and others with all of their excess Web computing power, as they looked for ways to monetize large, excess amounts of capacity and leverage their expertise at building out server farms and developing highly-distributed, yet high performing levels of computing.

This same necessity-is-the-mother-of-invention phenomenon is playing out now as Facebook develops new technology to serve up its millions of photos per second, and is spawning new data storage and retrieval technology such as the NoSQL paradigm shift, with new non-SQL and "not only" SQL architecture approaches such as Cassandra, BigTable, Neptune, Google Fusion Tables, and Dynamo that are more finely tuned to the needs of Web-scale Cloud Computing.

In parallel, the bandwidth demands of video and mobile infrastructure are seeding new innovation around capacity and distribution of bandwidth as well, including much more efficient and easier to implement elastic computing capabilities to handle these variable bandwidth demands as much of mobile's required computing requirements are moved to and answered via the Web (and this also makes SmartPhones ideal Cloud Computing clients, also pushing the paradigm).

While not only mind-boggling and exciting, these trends are the cornerstones of a revolution already in progress. All of this demand-driven innovation is only causing more and more build-out of the foundation from which the future Internet and "Cloud" will emerge. A few years from now, we will look back and see how the Web computing demands of today, whether from Facebook, Google, Twitter, or others, enabled a whole new generation of Web applications to emerge. And of course, huge amounts of data were gobbled up in the process, a lot of which will have come from StrikeIron's own data delivery innovation in the Cloud.

No doubt about it, the Cloud is a good place to be.
All Posts