The general premise of data warehousing hasn't changed much over the years. The idea is still to aggregate as much relevant data as possible from multiple sources, centralize it in a repository of some kind, catalog it, and then utilize it for reporting and analytics to make better business decisions. An effective data warehousing strategy seamlessly enables trend analysis, predictive analytics, forecasting, decision support, and just about anything else we now categorize under the umbrella of "data science."
As you may know, StrikeIron is an Informatica Cloud partner. We recently won another customer account that will be using the StrikeIron Contact Record Verification suite to clean their records as they move between Salesforce.com, a proprietary marketing database, and Eloqua via Informatica Cloud. To help this customer get started, we wanted to be able to run Informatica Cloud on a Mac as well as have a test platform that was remotely accessible from anywhere.
The update is a "batch write/update" capability, enabling multiple data items to be written or updated in a single API call. The idea is to reduce Internet latency by minimizing trips back and forth to Amazon's various physical data storage entities from the calling application. According to Amazon, this was in response to developer forum feedback requests.
This update to help address what was already an initial key selling point of DynamoDB tells us that latency is still a significant challenge for cloud-based storage. After all, one of the key attributes of DynamoDB when first launched was speed and performance consistency, something that their NoSQL precursor to DynamoDB, SimpleDB, was unable to deliver, at least according to some developers and users who claimed data retrieval response times ran unacceptably into the minutes. This also could have been a primary reason for SimpleDB's lower adoption rates. Amazon is well aware of these performance challenges, and hence the significance of its first DynamoDB update.
Another key tenant of DynamoDB is that it is a managed offering, meaning the details of data management requirements such as moving data from one distributed data store to another is completely abstracted away from the developer. This is great news, as complexity of cloud environments was proving to be too challenging for many developers trying to leverage cloud storage capabilities. The masses were scratching their heads as to how to overcome storage performance bottlenecks, attain replication, achieve response latency consistency, and perform other operations-related data management challenges when it was in their purview to do so. By the way, management complexity will likely still be a major challenge for other NoSQL vendors, and there are many "big data" startups offering products in this category, who do not offer the same level of abstraction that DynamoDB offers. It will be interesting to see if the launch of DynamoDB becomes a significant threat to many of these startups.
We learned this reduction of complexity lesson at StrikeIron within our own niche offerings as well. We gained a much bigger uptake of our simpler, more granular Web services APIs, such as email verification, address verification, and other products such as reverse address and telephone lookups as single, individual services, rather than complex services with many different methods and capabilities. This proved true even if the the more complex services provided more advanced power within a single API. In other words, simplified remote controls for television sets are probably still the best idea for maximum television adoption, as initial confusion and frustration tends to be inversely proportional to the adoption of any technology.
Another interesting point is that this is the fifth class of database product offerings in Amazon's portfolio. Along with DynamoDB, there is also still the aforementioned SimpleDB, a schemaless NoSQL offering for "smaller" datasets. There is also the original S3 offering with a simple Web service based interface for storing, retrieving, and deleting data objects in a straightforward key/value pair format. Next, there is Amazon RDS for managed, relational database capabilities that utilize traditional SQL for manipulating data and is more applicable for traditional applications. Finally, there are the various Amazon Machine Image (AMI) offerings on EC2 (Oracle, MySQL, etc.) for those who don't want a managed relational database and would rather have complete control over their instances (and not have to utilize their own hardware) and the RDBMs that run on them.
This tells us that the world is far from one-size-fits-all cloud database management systems, and we can all expect to be operating in hybrid storage environments that will vary from application to application for quite some time to come. I suppose that's good news for those who make a living on the operations teams of information technology.
And along with each new database offering from Amazon also comes a different business model. In the case of DynamoDB for example, Amazon has introduced the concept of "read and write capacity units", where charges will be based on the combination of frequency of usage and physical data size. This demonstrates that the business models are still somewhat far from optimal, and will likely change again in the future. Clearly they are not yet quite right for the major vendors trying to figure it all out as business model adjustments in the Cloud are not just limited to Amazon.
In summary, following the Amazon database release timeline over the years yields some interesting information, namely that speed/latency, reduction of complexity, the likelihood of hybrid compute and storage environments for some time to come, and ever-changing cloud business models are the primary focus of cloud vendors responding to the needs of their users. And as any innovator knows, the challenges are where the opportunities are.
As the "Cloud" has evolved and matured from its roots the past few years, the alternatives for deploying a cloud-based solution have been almost entirely proprietary and commercial. They typically have required at least a credit card to even get started "renting" servers and storage that might be needed for only short periods of time and to achieve more flexible scalability models. With the success and momentum of OpenStack, an open source cloud operating system for deploying, hosting, and managing public and private clouds within a data center, this appears to be changing.
The OpenStack project, launched initially with code contributions from Rackspace and NASA, provides the software components for making cloud management functionality available from within any data center, including one's own, similar to what Amazon, VMWare, Microsoft and other cloud vendors are now offering commercially. Deploying OpenStack enables cloud-based applications and systems utilizing virtual capacity to be launched without the associated run-time fees the current slate of vendors require, as all of the software is freely distributable and accessible.
At first glance, this seems to be an ideal solution for larger enterprise IT organizations to offer up traditional cloud functionality, such as virtual servers and storage availability, to its constituents within the organization and without the fear of vendor lock-in and and ever-increasing vendor costs. This approach also provides for access to implementation details and the ability to customize based on specialized needs - also important in many scenarios and something not typically or easily offered by the larger commercial vendors. So the benefits to the private cloud space to those who find it appropriate to build and manage their own cloud environments are clear.
However, Rackspace itself just announced making public cloud services available using OpenStack, and others are likely to follow in the not-too-distant future, leveraging community-developed innovation in the areas of scalability, performance, and high availability that might ultimately be difficult for any single proprietary vendor to match. This should enable public service providers, especially in niche markets, to proliferate as well.
Major high tech vendors are also backing and aligning with OpenStack. In addition to Rackspace and NASA, Deutsche Telekom, AT&T, IBM, Dell, Cisco, and RedHat all have much to gain from the success of OpenStack and have announced as partners, code contributers, and sources of funding. Commercial distributions have already emerged such as StackOps. Funding for OpenStack-oriented companies has begun from the venture community, and events such as the OpenStack Design Summit and Conference this week in San Francisco are getting larger and selling out quickly.
All of the foundational pieces are in place for OpenStack to have quite a run towards achieving its goal of becoming the universal cloud platform of the future and the leaders of the "open era" of the Cloud. This is an exciting development for companies like StrikeIron and our cloud-based data-as-a-service and real-time customer data validation offerings, as the data layer of the Cloud will become even more promising and fertile as OpenStack continues to accelerate organizations towards easier adoption of cloud computing models and all of its benefits.
Much of cloud computing terminology is based on the notion of ‘as a Service’ (or ‘aaS’).
In a report last week, the Open Data Center Alliance published that its members plan to triple Cloud deployments in the next two years according to a recent membership survey. This significantly outpaces the adoption forecasts from several different analyst firms and is another indicator where the I.T. industry is headed.
Of course, there are different ways to measure Cloud adoption, and while adoption rates may always be debated, there is little question of the Cloud's growing significance in I.T. Even though some Cloud forecasts combine infrastructure-as-a-Service (IAAS) with Software-as-a-Service (SAAS) and others keep them separate, in either case the trending is upward.
So here are four primary reasons why this trend is occurring and likely to continue for a long time to come:
- Cost. When deploying to the Cloud, one only has to deploy the needed I.T. resources at any given time. Capacity can be added or reduced as needed and whenever necessary. With this cost-savings "elastic" approach, usage spikes can be handled as well as increased resource demand over time. It's the difference between renting a server by-the-minute versus committing to two-year contracts with a data center provider at maximum capacity requirements. The latter, traditional approach front-loads application costs and requires significant capital expenditure. These heavy up-front costs go away in pay-for-what-you-use Cloud scenarios, including the ability to get things up and running more cheaply. Many startups deploying to the Cloud are spending less money on hardware and software investments than just a few years ago and getting up and running faster.
- Abstraction. Cloud deployments hide the details of the hardware, bandwidth resourcing, underlying software, load management, and ongoing maintenance of the given platform. This frees up resources to focus on one's own business rather than endless architecture meetings and decisions - unnecessary for a large majority of applications. This is why Salesforce.com has found success. Customers no longer have to deal with software upgrades for sales people, database choices, syncing data from laptops to servers, hardware deployment decisions, etc. It's just easier in a Cloud SAAS model.
- Innovation. An organization can leverage the innovation and expertise of those who specialize in a given Cloud-based platform such as within data-as-a-service offerings like StrikeIron provides. This continual innovation can be leveraged as a Cloud platform becomes more advanced without any effort of the organization's own resources. The platform improves daily, and these incremental improvements are put to use immediately for the benefit of customers and without company-wide software upgrades and rollouts. Instead, it's built-in and essentially automatic with the Cloud model. Another example is Amazon's EC2, where an increasing number of new features and capabilities can be leveraged without application redeployment.
- Platform Independence. When deploying to the Cloud, many different types of devices and clients can leverage the application via APIs or other interfaces, from PCs, tablets, smart phones, and other systems, as all communication between machines is via the ubiquitous Web, available just about any time anywhere. This makes interoperability easier, and extensive "middleware" investments of the past to make things work together can be dramatically reduced. This is one of the primary reasons why tablets such as the iPad for example have grown considerably in adoption now versus ten years ago – they work with the Cloud and can access a broad array of useful applications from just about anywhere.
These benefits of the Cloud aren't going away, and this is why the adoption trend is accelerating upward.
2011 has been the year of the Cloud database. The idea of shared database resources and the abstraction of underlying hardware seems to be catching on. Just like Web and application servers, paying-as-you-go and eliminating unused database resources, licenses, hardware, and all of the associated cost is proving to have attractive enough business models that the major vendors are betting on it in significant ways.
The recent excitement has not been limited to just the fanfare around "big data" technologies. Lately, most of the major announcements have come around the traditional relational, table-driven SQL environments Web applications make use of much more widely than the key-value pair data storage mechanisms "NoSQL" technology uses for Web-scale data-intensive applications such as Facebook, NetFlix, etc.
Here are some of the new Cloud database offerings for 2011:
Saleforce.com has launched Database.com, enabling developers in other Cloud server environments such as Amazon's EC2 and the Google App Engine to utilize its database resources, not just users of Salesforce's CRM and Force.com platforms. You can also build applications in PHP or on the Android platform and utilize Database.com resources. The idea is to reach a broader set of developers and application types than just CRM-centric applications.
At Oracle Open World a couple of weeks ago, Oracle announced the Oracle Database Cloud Service, a hosted database offering running Oracle's 11gR2 database platform available in a monthly subscription model, accessible either via JDBC or its own REST API.
Earlier this month, Google announced Google Cloud SQL, a database service that will be available as part of its App Engine offering based on MySQL, complete with a Web-based administration panel.
Amazon, to complement its other Cloud services and highly used EC2 infrastructure, has made the Amazon Relational Database Service (RDS) available to enable SQL capabilities from Cloud applications, giving you a choice of underlying database technology to use such as MySQL or Oracle. It is currently in beta.
Microsoft also has its SQL Azure Cloud Database offering available in the Cloud, generally positioned as suited for applications that use the Microsoft stack for developers that will want to leverage some of the benefits of the Cloud.
I attended the Data 2.0 Conference this week in San Francisco. There is a lot to be excited about in this emerging, growing and quickly-accelerating industry. However there are still some significant obstacles that have to be overcome for the vision of the data-driven world and the “great data highway in the sky” to truly be realized.
Amazon's new SES (Simple Email Service) product is a scalable, transaction-based offering for programmatically sending large amounts of email. This is accomplished using Amazon's Web-scale architecture, most especially for applications that already use EC2 (server rental) and S3 (storage rental). By utilizing SES you are essentially leveraging the "Cloud" to send emails from applications and Web sites rather than investing in your own software and hardware infrastructure to do so. This process substantially reduces cost and complexity as do most Cloud services and in this case requires only a simple API call. There is no network configuration or email server setup required in this process.
The "Cloud" has been seeing a lot of momentum this past year, and one place where that is readily apparent is in the stock price of companies making major strategic investments in Cloud technology and associated offerings, as well as aggressive go-to-market plans with those offerings.
Most vendors, not surprisingly, line up behind the approach that best suits their product offerings.
For example, SAAS vendors (Salesforce, NetSuite, SuccessFactors) say that multi-tenant applications are the Cloud, citing the need for a business solution with shared, multi-tenant software resources, including databases, are needed to truly make the Cloud useful. Yet many of these vendors are often criticized for not providing "open" models, so still some long-term questions remain. Yes, these Clouds are easy get into, but how do you get out of them if necessary?
The infrastructure-as-a-service crowd (Amazon's EC2, Google App Engine, Rackspace) will suggest that only infrastructure is the "true" Cloud, meaning essentially renting clean servers by the minute and storage by the byte represent the original "open" Cloud vision, enabling applications to be moved from Cloud to Cloud without difficulty. However, this is just servers and storage in the end (at least for now), so the user still has to build everything themselves. Ok for some, not entirely useful for most.
And of course the enterprise software folks (Oracle, SAP, IBM) often claim that the Cloud can and should be "Private" because it's a better security model and enables you to manage it within the organization. This enables them to capitalize on the hype of the Cloud without having to change too much of their actual offerings. Of course the challenge with this model is the lack of sharing licenses or hardware across organizations becomes quite expensive, and quite frankly we have had this model before under other names such as "mainframe", "client-server" and other "in-house" architectures. Sure, there is some incremental innovation and usefulness, but it's not too much different than what has always been offered, just another iteration.
So while there are valid use cases for each of the above scenarios, there is one thing I want to point out with Public versus Private Cloud discussions when businesses are unsure which route to go. It goes all the way back to the birth of the Cloud as a concept itself.
The reason we even have the Cloud in the first place is that heavily-trafficked Web sites such as Google and Amazon found they had to build massive, high performance, scalable systems to be able to handle the processing load at peak times (Amazon at Christmas for example). This meant that during non-peak times, they found themselves with lots of excess, unused computing capacity.
This of course spawned the idea that they could leverage this excess capacity, as well as their expertise in managing high-performance, distributed, "Web scale" computing technology as an additional line of revenue, and possibly launching a brand new industry of opportunities. Hence, the Cloud was born.
The one key piece of this Cloud concept is "expertise". This is something that you get in Public Cloud environments that you don't get in Private Clouds. With Private Clouds, you get all of the hardware and software (and the corresponding purchased licenses) that you need, but you don't have a team of experts that have been running that platform for years monitoring, managing, and supporting that platform in real-time while you use it, including having visibility into it as it runs. By definition you therefore don't have engineers supporting the success of your application systems on a minute-by-minute basis.
This real-time team of experts, and their associated expertise developed over time, is something you get inherently in the Public Cloud scenario. The folks who run these systems have as their core mission in life to keep the platform up and running, battle test it over time, improve it, enhance it, test it, analyze operational data, review performance charts, improve and enhance it again, and on and on, day after day.
Although a bit overused, the electric generator is a good example of demonstrating the difference. If you have your own electrical generators powering your home, it doesn't matter that thousands of other people have one just like it in their homes. If it goes down, you are on your own, and it's your responsibility to keep the electricity flowing from room to room. But if you plug into the electric grid run by your local power company, and there is an outage while you are having dinner somewhere, likely it will be fixed before you even get home from the restaurant. And you might not even notice there was a problem since you weren't at home (you were out dining in the "Dinner Cloud" and outsourcing the washing of dishes). This is because the system was monitored, a problem was detected, and a team was ready to spring into action once the outage occurred.
How long would it have taken to call the generator repairman to get him scheduled to come out with a power outage in your own generator? There's a reason electricity grids have evolved the way they have.
Oh, and all of the innovation occuring behind the scenes at the power company on a day to day basis? It comes to you automatically, often while you sleep, as opposed to a new giant chunk of hardware arriving every 18-24 months that you have to figure out how to configure and get up and running again.
So how is this relevant to StrikeIron?
Well, the same is also true in our case. While we are more the Software-as-a-Service variety of Cloud Computing (and in our case "data-as-a-service"), we recognize that users have a choice in the way to obtain the type of functionality we offer. A lot of the powerful capabilities we have such as our Cloud-managed Contact Record Verification Suite, such as real-time telephone, address, and email verification, could also be purchased and brought in-house as software applications and raw data sources, and a similar result could be achieved in terms of better, more usable customer data assets. The approach would just be a heck of a lot different.
In the latter scenario, all of the verification reference data would have to be managed and maintained internally. One would have to acquire the software and data files, and then get the functionality up and running. It would then have to be designed and delivered in such a way to be able to handle the various loads of data verification that might appear from different applications at different times, and often in high volume scenarios. Also, all of the other expertise around availability, testing, updating, and the usual effort associated with in-house solutions would have to be developed internally.
With us, all we do day in and day out is focus on verifying and delivering our real-time data verification capabilities to thousands of applications simultaneously with a very high level of performance at all times, delivering 24x7x365. All you need to do, just like the electric company, is plug into us. All of the data management, updating, software maintenance, and performance testing and improving is done by us, with all of the heavy lifting abstracted from you.
Since we launched our system in 2005, we have constantly improved our finely-tuned delivery and fault-tolerant capabilities, including load-balancing, high speed data I/O, redundancy, external monitoring, and everything else we have to provide to be able to support our customers and their production applications. And we are getting smarter and better about how we go about it every day. This expertise is something that each and every one of our customers gets to leverage with every single call to our system. This is why we have only had minutes of downtime over the last four years.
So could in-house solutions provide the same end result? Maybe in the sense that yes you could end up with good clean customer data somehow on your own. But at what cost, effort, and with what missed opportunities? Focus on your core business, and leave the external data verification effort to us. We will keep the lights on. Guaranteed.
Informatica's Cloud Data Integration Solutions enable the management of data from the cloud or within the cloud for a broad range of applications and use cases.
For example, data can be moved using Informatica's Cloud Data Integration solution from legacy systems during an initial implementation of Salesforce.com's sales force automation product so an organization can leverage existing data when they launch Salesforce.com internally. Also, it can be used to import lead data from a broad range of file formats into Salesforce.com for existing implementations, greatly simplifying this process.
Also, data can be synchronized from other applications such as between Oracle's eBusiness Suite, Microsoft Dynamics, SAP, SAS, JD Edwards, Siebel, Peoplesoft, Pivotal and other enterprise applications, cloud-based applications and enterprise data sources, as well as many different databases and data formats. It can even be integrated into Amazon's EC2 cloud platform. This is achieved via "adapters" that Informatica has developed over time, many of which come from their PowerCenter solution. These adapters enable the moving of data assets between and/or into these different applications that are present within an organization.
Use cases include data/back office synchronization, data importing and exporting, customer master synchronization (MDM), CRM integration, data replication, data archiving, and legacy application retirement to name a few.
Now, as all this data is being moved around between applications and sources, this is an ideal time to ensure validity of the data, as well as ensuring it is accurate, standardized, complete, and current. This is where StrikeIron comes in with its Contact Record Verification Suite, validating, correcting, and enhancing addresses, phone numbers, email addresses, and other geographic information. This ensures that the best possible data is running at the core of all of an organization's business processes.
StrikeIron's Contact Record Verification Suite is essentially a set of cloud-based plug-ins that Informatica has integrated into its platform as another "adapter" or "plug-in" that can sit in the middle of data moving between applications to or in the cloud, including multiple data formats and databases.
In this particular video, the phone number validation component is being used as part of the Informatica Cloud Data Integration Solution to validate phone numbers and add other geographic information such as zip code and time zone as lead data is being moved into Salesforce.com from an Excel spreadsheet:
For example, Facebook's Chief Technical Officer reported last month that they were seeing as many as one million photos being served up per second through the entirety of their Web-based social application, and that they expect this to increase ten-fold over the next twelve months.
Also, how many of us watched some streaming World Cup soccer games over the past month as Spain proved supreme in South Africa? Or at least highlights on YouTube and various other video outlets? Currently, it is estimated that 50% of all Web traffic is video. That's not surprising, but with High Definition (HD) Web technology and the like emerging, video is expected to represent 90% of all traffic in just a few years. This is going to require bandwidth levels that were largely unthinkable years ago.
On another front, mobile infrastructure is not keeping pace with demand. Right now, some estimates have shown mobile infrastructure requirements growing at about 50% per year, while actual mobile network infrastructure capacity is only growing at 20% per year. This is going to be a real problem, and one of the reasons some mobile carriers such as AT&T have begun capping usage and introducing fees for premium levels of bandwidth that were standard issue up until now, and other carriers may likely follow suit. It's the only way to help curtail demand to meet capacity in their eyes.
So what does all of this mean?
One of the reasons we have Cloud Computing in the first place is that innovative Web companies such as Amazon and Google had to build out enough computing capacity to handle peak periods of Web traffic and activity, especially Amazon during its Christmas holiday crunch.
As a result, they found themselves not only experts at building out distributed computing capacity, load-balancing, and data synchronization, but also found that most of the time they had all sorts of computing power that they had invested in for peak periods "shelved" and not in use, far from cost-optimized. This led them to think of ways to monetize this excess capacity (servers and disk space lying around idle) and led to some of the early thinking and innovation around Web-based centralized computing. The same is true with Google and others with all of their excess Web computing power, as they looked for ways to monetize large, excess amounts of capacity and leverage their expertise at building out server farms and developing highly-distributed, yet high performing levels of computing.
This same necessity-is-the-mother-of-invention phenomenon is playing out now as Facebook develops new technology to serve up its millions of photos per second, and is spawning new data storage and retrieval technology such as the NoSQL paradigm shift, with new non-SQL and "not only" SQL architecture approaches such as Cassandra, BigTable, Neptune, Google Fusion Tables, and Dynamo that are more finely tuned to the needs of Web-scale Cloud Computing.
In parallel, the bandwidth demands of video and mobile infrastructure are seeding new innovation around capacity and distribution of bandwidth as well, including much more efficient and easier to implement elastic computing capabilities to handle these variable bandwidth demands as much of mobile's required computing requirements are moved to and answered via the Web (and this also makes SmartPhones ideal Cloud Computing clients, also pushing the paradigm).
While not only mind-boggling and exciting, these trends are the cornerstones of a revolution already in progress. All of this demand-driven innovation is only causing more and more build-out of the foundation from which the future Internet and "Cloud" will emerge. A few years from now, we will look back and see how the Web computing demands of today, whether from Facebook, Google, Twitter, or others, enabled a whole new generation of Web applications to emerge. And of course, huge amounts of data were gobbled up in the process, a lot of which will have come from StrikeIron's own data delivery innovation in the Cloud.
No doubt about it, the Cloud is a good place to be.
The reasons for this growth are the advantages that cloud computing provides, including faster deployment, smoother scalability, pay-for-what-you-use business models, and no capital expenditure on the hardware and software that comprises the architecture. Amazon, Microsoft, IBM, Google, Opsource, and Rackspace are all companies offering public cloud infrastructure for rent, and a myriad of vendors have lined up to add layers of capabilities on top of these offerings such as RightScale, and the ecosystems that can take advantage of these architectures such as StrikeIron's are continuing to invest in the space as well. Unfortunately Sun's promising efforts in this space have been discontinued by Oracle for one reason or another.
This public computing resource trend has been great for startups because new companies can launch on cloud infrastructure "virtually" overnight, without the traditional costs tied to software, hardware, and the management of those resources, which traditionally has required them to seek and spend time on obtaining private funding. Reducing startup "start friction" has in turn created a bubbling sea of innovation as of late.
However, there has been more reluctance in the enterprise space to move to the "Cloud" because of worries about security and losing control when utilizing these public resources. There are just some highly-valued sets of data and mission-critical business processes that many organizations just don't want to put in the hands of a third party.
As a result, many of these companies are now building out their own "private cloud" infrastructure that mirrors the public clouds in functionality. This "member-only" infrastructure can then be shared across business units and geographies in an effort to eliminate IT redundancy, reduce costs, and increase efficiency, just as public clouds do for the masses.
Because of this trend, many of the cloud infrastructure providers are now offering virtual private capabilities. For example, Amazon's Virtual Private Cloud (Amazon VPC) is in an effort to provide a "hybrid" solution for enterprises building out a private cloud where some public computing resources can be utilized where it makes sense to do so.
What's still not clear though is what actual separation of data on the actual public cloud servers really occurs, rendering the concept by some as an exercise in marketing, at least so far. However, the enterprise market for cloud computing is potentially huge, so I am expecting a lot more to occur in this space.
There definitely are solid cases to be made for both public and private clouds (as well as hybrid solutions), so my guess is these two will co-exist for quite some time, and the line as to what separates the two will be somewhat blurred (as usual). The end result will be that whatever route or combination of routes companies employ in the new age of the Cloud, these efforts will leave more resources available for actual innovation rather than infrastructure management and a repetitive IT exercises, and that can only be good for us all, right?