I attended the Data 2.0 Conference this week in San Francisco. There is a lot to be excited about in this emerging, growing and quickly-accelerating industry. However there are still some significant obstacles that have to be overcome for the vision of the data-driven world and the “great data highway in the sky” to truly be realized.
First, there is the exciting stuff. New companies continue to emerge and grow in the space in multiple categories, including broad data sharing sites (FluidInfo, InfoChimps), purveyors of proprietary and hard-to-capture data (Navteq, Metamarkets), API infrastructure providers (WebServius, Apigee, Mashery), specific data category providers (SimpleGEO, Rapleaf, Socrata, DataSift), providers of API-based solutions (StrikeIron, Xignite) and slick data visualization tools (too numerous to list).
Also, the companies that have been in this space for five or more years are becoming larger, more sophisticated and are in many cases continuing to raise significant amounts of capital from investors looking to capitalize on the data megatrend. Even the stalwarts such as Microsoft with their DataMarket are eyeing future fruitful harvests in this space. Twitter also announced the commercial licensing of its entire “fire hose” of Tweets at the event, a move that the providers of analytics tools are hailing.
More and more public, government-sourced data is coming online everyday as fodder for this machine-to-machine information feeding frenzy. This data is coming from every level of government too; cities such as San Francisco (datasf.org), state governments such as Oregon (data.oregon.gov announced two weeks ago), and the federal government's data.gov initiative(rumors that funding might be cut are false according to keynote speaker and the charismatic, sometimes controversial industry-insider Vivek Wadhwa).
All of these government-sourced data assets are being made available to the general public in the hopes of civically engaging the creativity among us to innovate and create public value without the traditional budgetary costs. This has already led to a proliferation of applications such as online live maps of San Francisco municipal transportation schedules (including for the iPhone and Android platforms), as well as municipal vendor contracts available online for public discussion (it's amazing how many vendors consistently come in late and over-budget get rewarded new contracts over and over) like Chicago’s citypayments.org site.
Finally, the platforms (the fertile Cloud and all that is happening there with the Google AppEngine, Microsoft’s Azure, and Amazon’s various cloud offerings) and especially the devices (like smartphones and the iPad) that can make use of this data are also marching forward at a breathtaking pace.
This is all very exciting for those of us whom data represents a livelihood.
However, the significant challenge around the accessibility and usability of these vast seas of data is that it is still largely a complex, IT-oriented developer's world. Most of the access to these data sources is either via API, or available in structures and formats as varied as the data itself. This limits the applicability of these valuable data sources to a very small group of dedicated engineers and leaves us all with only a modicum of the true potential of this space.
Sure, there are API protocol standards such as REST and SOAP, but these only scratch the service. Most single-API vendors introduce a new set of behaviors, data structures, response codes and a new business model with each new API. This adds greatly to the complexity for anyone looking to put these data sources to use, both initially and ongoing. Until we can find a way to normalize the great data highway, it's going to be a bumpy road. Those that know me know I’ve been preaching this for years and have applied much of it to StrikeIron’s various data and API offerings; however it can be a difficult proposition to get adopted across the industry.
The consumption complexity issue is demonstrated by the term "mashup". Several years ago, this was the term-du-jour by an industry claiming that non-developers could combine datasets in interesting and exciting new ways without the assistance of their IT organizations. This term however has all but shriveled and died. Why? Because non-developer's couldn’t do it. The tools were cumbersome, complex, and represented whole new learning curves that most people simply don't have the time or patience for, as well as the lack of standards surrounding the datasets themselves. In fact, I never heard the term mentioned a single time at the event. A few years back, it would have been in nearly every discussion. May the term “mashup” rest in peace.
Until we surpass this hurdle of data consumption complexity and the vendors in the space only pay lip service to these challenges (prevalent on several of the panels at the event), the data-driven world will only be a shell of what it could be.
One thing that's clear as we pass the halfway point of 2010 is that the Cloud Computing movement is not only gaining momentum, but the usage trends of the Web that are driving Cloud Computing are only increasing in influence and contributing to its momentum at a faster pace than ever.
For example, Facebook's Chief Technical Officer reported last month that they were seeing as many as one million photos being served up per second
through the entirety of their Web-based social application, and that they expect this to increase ten-fold
over the next twelve months.
Also, how many of us watched some streaming World Cup soccer games over the past month as Spain proved supreme in South Africa? Or at least highlights on YouTube and various other video outlets? Currently, it is estimated that 50% of all Web traffic is video. That's not surprising, but with High Definition (HD) Web technology and the like emerging, video is expected to represent 90% of all traffic in just a few years. This is going to require bandwidth levels that were largely unthinkable years ago.
On another front, mobile infrastructure is not keeping pace with demand. Right now, some estimates have shown mobile infrastructure requirements growing at about 50% per year, while actual mobile network infrastructure capacity is only growing at 20% per year. This is going to be a real problem, and one of the reasons some mobile carriers such as AT&T have begun capping usage and introducing fees for premium levels of bandwidth that were standard issue up until now, and other carriers may likely follow suit. It's the only way to help curtail demand to meet capacity in their eyes.
So what does all of this mean?
One of the reasons we have Cloud Computing in the first place is that innovative Web companies such as Amazon and Google had to build out enough computing capacity to handle peak periods of Web traffic and activity, especially Amazon during its Christmas holiday crunch.
As a result, they found themselves not only experts at building out distributed computing capacity, load-balancing, and data synchronization, but also found that most of the time they had all sorts of computing power that they had invested in for peak periods "shelved" and not in use, far from cost-optimized. This led them to think of ways to monetize this excess capacity (servers and disk space lying around idle) and led to some of the early thinking and innovation around Web-based centralized computing. The same is true with Google and others with all of their excess Web computing power, as they looked for ways to monetize large, excess amounts of capacity and leverage their expertise at building out server farms and developing highly-distributed, yet high performing levels of computing.
This same necessity-is-the-mother-of-invention phenomenon is playing out now as Facebook develops new technology to serve up its millions of photos per second, and is spawning new data storage and retrieval technology such as the NoSQL paradigm shift, with new non-SQL and "not only" SQL architecture approaches such as Cassandra, BigTable, Neptune, Google Fusion Tables, and Dynamo that are more finely tuned to the needs of Web-scale Cloud Computing.
In parallel, the bandwidth demands of video and mobile infrastructure are seeding new innovation around capacity and distribution of bandwidth as well, including much more efficient and easier to implement elastic computing capabilities to handle these variable bandwidth demands as much of mobile's required computing requirements are moved to and answered via the Web (and this also makes SmartPhones ideal Cloud Computing clients, also pushing the paradigm).
While not only mind-boggling and exciting, these trends are the cornerstones of a revolution already in progress. All of this demand-driven innovation is only causing more and more build-out of the foundation from which the future Internet and "Cloud" will emerge. A few years from now, we will look back and see how the Web computing demands of today, whether from Facebook, Google, Twitter, or others, enabled a whole new generation of Web applications to emerge. And of course, huge amounts of data were gobbled up in the process, a lot of which will have come from StrikeIron's own data delivery innovation in the Cloud.
No doubt about it, the Cloud is a good place to be.