Subscribe by Email

Your email:

free-trials

contact-us

StrikeIron Blog

Current Articles | RSS Feed RSS Feed

Data Warehousing 2013: A Changing Landscape

  
  
  
  
  

The general premise of data warehousing hasn't changed much over the years. The idea is still to aggregate as much relevant data as possible from multiple sources, centralize it in a repository of some kind, catalog it, and then utilize it for reporting and analytics to make better business decisions. An effective data warehousing strategy seamlessly enables trend analysis, predictive analytics, forecasting, decision support, and just about anything else we now categorize under the umbrella of "data science."

The premise is not different these days, but rather, it is more the shifting nature of the data sources that the warehouse must draw from to capture as much useful information as possible. It's the data that's changed, not the goal.

First, there is the rapid proliferation of social-generated data in all of its unstructured forms, making the data extraction and transformation components of loading data to the warehouse more difficult than it has been in the past. But this isn't really groundbreaking for 2013, as social data and the creation of various Big Data technologies its growth has spawned, such as Hadoop, have been emerging for several years now.

Instead, what will likely be significantly different in 2013 is the acceleration of the deployment of a multitude of SaaS applications within the enterprise, especially in the larger, often slower to adopt, companies that populate the Fortune 2000. As the deal sizes grow in size, the SaaS footprint is clearly becoming significantly bigger.

This is where it becomes interesting. It's not just that an organization has several different SaaS applications such as Salesforce, Workday, and Success Factors in place and in use across the enterprise, with a single instance of each in use by all. Instead, due to the nature of the easier adoption of these SaaS applications, many of them have come in through the back door departmentally and at different times rather than through a centralized IT-controlled proliferation. This means that multiple instances of the same application are popping up everywhere.

For example, there are large enterprises that now have 10, 20, or even 50+ instances of Salesforce running across the entire organization. Each instance has its own set of customization of data collection and storage, separate add-on applications installed, different data feeding these applications, and unique implementation approaches. This could result in the old adage of solving old problems while creating new ones.

Some questions that could be asked are what kind of data collection and ETL challenges will this cause for those wishing to leverage a data warehousing strategy? Is the fact that the operational data from these various SaaS applications is stored and maintained by different vendors, each of which who is incentivized to keep it that way, make things easier or more difficult for data warehousing and the analysis it enables? Will data fragmentation and the resultant data integration strategies scale across all of these instances of SaaS applications? It will be interesting to see organizations meet the "SaaS sprawl" challenge, especially as it relates to cross-enterprise data collection strategy.

Furthermore, SaaS applications have taken an ever-increasing hold of the enterprise as of late with larger and larger deals. With the Cloud and SaaS applications a major part of their 2013 strategies, Oracle, SAP, IBM, and the more traditional software vendors have taken notice. SAP's Business ByDesign, Oracle's Fusion Applications, and recent SaaS acquisitions will surely add to what could become a hodge podge of SaaS applications across the enterprise.

To meet these challenges currently, cloud data warehousing offerings from companies like BitYota and Amazon's Redshift are beginning to emerge with a core theme of the cloud as the centralized data storage repository. ETL and data integration solutions such as Informatica's Cloud and Dell's Boomi are racing to meet these traditional data warehousing requirements in the cloud paradigm. Also, the traditional data cleansing requirements of data warehousing are being met with their cloud-based counterparts for better, more usable data in these new age warehouses. One thing that will never change is that bad data will always equal bad analysis, and the need for making investments in data quality strategies will continue to exist.

As the landscape of SaaS continues its rapid expansion, and the data within these applications continues to burgeon, 2013 will definitely be a pivotal year in the dawn of a new class of data warehousing technologies.

Running Informatica Cloud on Amazon Web Services

  
  
  
  
  

As you may know, StrikeIron is an Informatica Cloud partner. We recently won another customer account that will be using the StrikeIron Contact Record Verification suite to clean their records as they move between Salesforce.com, a proprietary marketing database, and Eloqua via Informatica Cloud. To help this customer get started, we wanted to be able to run Informatica Cloud on a Mac as well as have a test platform that was remotely accessible from anywhere.

Running Informatica Cloud on AWS accomplished both of these goals. We could run the secure agent on the EC2 instance and then access the Informatica Cloud web front end from a Mac or any of our customer's computers without worrying about firewalls, etc.

This tutorial will go step-by-step through how to create an AWS EC2 Windows Server instance and install the Informatica Cloud Secure agent.

The first step is to create your Amazon AWS account on this page by clicking the “Sign Up” button in the top right corner. The instance created in this tutorial will run in the free tier so if you are a new user, it should not cost you anything. Once your account is created and approved we are ready to start.

Create the instance:

1) Log into your AWS account at: https://console.aws.amazon.com/console/home

2) You  should be on the AWS Management console screen. Click the EC2 icon EC2 Icon. This will take you to the EC2 Console Dashboard.

3) Click the “Launch Instance” button to display the Create New Instance Dialog.

Launch Instance 

4) Make sure the Quick Launch Wizard radio button is selected. There are three key pieces of information you will enter on this screen:

  • In the “Name your Instance” field type "InfaCloudTest” or whatever you would like to call this instance.

  • In the “Choose Your Key Pair” section, select the "Create New" radio button and name your security key pair “InfaCloudTest”. The key pair is used to create a secure password for your remote desktop. Click “Download” to download your PEM file to your computer. Note the location as you will need it later.

  • Finally, you will select the instance configuration. Choose the “Microsoft Server 2008 Base” with the 64 bit option selected.

Your "Create New Instance" dialog box should now look like this:

Create New Instance

5) Click “Continue” to see the next step in the "Create a New Instance" process.

6) The next dialog should look like the following. You should not need to change anything but there are two important settings to note. First, make sure the Shutdown behavior is set to “Stop”. “Stop” means that if you shutdown the instance, all of your data will persist – just like a normal PC. If this option is set to “Terminate” your instance will be effectively formatted and will also disappear from your instance table next time Amazon does a cleanup sweep.

The next important item is the Security Group. Amazon creates a default security group for you. Depending on what endpoints you connect to, you may need to open up ports in the security group later.

Create New Instance

7) Click “Launch” to continue. You will receive a confirmation box saying that your instance is launching. Click “Close”.

8) You will be taken back to the EC2 Management Console. On the top-right hand side, you will see a section called “My Resources”. It should now show that you have 1 running instance (you may need to wait up to 2 minutes then click refresh for it to show up).

Running Instance 

9) Click “1 Running Instance” and you will be taken to the “My Instances” page as seen below. Click the check box to the left of your instance name (InfaCloudTest) to display the instance information in the bottom pane. Take a look at this information which includes the full domain name, security groups, and elastic IP if you have linked one (note: we do not need an elastic IP for running Informatica Cloud).

My Instances 

10) Right click on the instance and select “Connect” as seen below:

Connect To Instance

 

11) You will see a dialog box like below which contains the remote desktop login details for your instance.

Console Connect 

12) Click the “Retrieve Password Link”. You may get a warning saying “Not Available Yet”. If so, you will need to wait up to 15 minutes. 

13) Click “Choose File” and find the PEM file you downloaded in step 4.

14) Click “Decrypt Password”. This will display a dialog box with the login information.

15) Note the Public DNS, username, and password as you will need this information to Remote Desktop into the machine. You can download a shortcut file to a Remote Desktop Instance as well.

16) Now open your Microsoft Remote Desktop Application. This will be in the Application Folder if you are on a Mac (RDS comes with Office or you can download from: http://www.microsoft.com/mac/remote-desktop-client) or access via "Program Files | Accessories | Remote Desktop Connection" if you are on a PC.

17) For the computer name, enter the Public DNS entry (note: this will change each time you stop and restart and instance).

18) Remote Desktop will pop up a login box. Enter “Administrator” as the User Name and the password you copied from step 15 above. Leave the domain field blank. Click the “Add this information in your keychain” if you are on a Mac to remember your password.

19) You may receive a warning that the server name on the certificate is invalid. Click “Connect”.

Ignore Certificate Incorrect

20) You should now be logged into your AWS Windows instance and see a Windows desktop.

Installing Informatica Cloud:

21) Start up Internet Explorer. Select “Don’t use recommended settings” if prompted. Internet Explorer comes with very tight security settings on Windows Server so I suggest you navigate to http://google.com/chrome and download Chrome to save some time and frustration. You will likely have to add several google domains to the Trusted Sites list when prompted to download.

22) Navigate to www.informaticacloud.com and click “Login Here” in the top right corner.

23) Login using your Informatica Cloud credentials.

24) Click “Configuration”. Click “Agents”.

25) Click the yellow “Download Agent” button.

Download Agent 

26) Select “Windows” as the platform and click “Download”.

Download Secure Agent

27) When the agent_install.exe dialog is complete, click agent_install.exe and “Run” in the Windows security box.

28) Select the default values for the Informatica Cloud Agent install wizard and click “Done” when complete.

29) Enter your Informatica Cloud credentials and click “Register” in the setup box.

Secure Agent Signin 

30) After approximately 30 seconds, you should see that the Secure Agent is up and running on the Windows Server. 

31) You should see the Agent populate on the Informatica Cloud site in the Configuration | Agents section.

32) If you are going to use files or database on the AWS Windows Server, you will also need to add a connection to the EC2 instance. For example, to read/write flat files on the Windows Server, in the Informatica Cloud web app, click “Configuration”, and then “Connections”. Click the yellow “New” button:

Add New Connection 

33) Create a target directory on the Windows Server, "c:\infacloud" in this case, and fill out the new connection information as seen below:

Add New Connection

Your Informatica Cloud instance is now ready. You can create Contact Validation, Data Synchronization, and other tasks.

I hope you found this tutorial helpful. Please leave any questions or comments below or feel free to drop us an email at info@strikeiron.com 

Amazon's NoSQL and Database Evolution: What Can Be Learned

  
  
  
  
  
Late last week, Amazon released an update to its DynamoDB service, a fully managed NoSQL offering for efficiently handling extremely large amounts of data in Web-scale (generally meaning very high user volume) application environments. The DynamoDB offering was originally launched in beta back in January, so this is its first update since then.

The update is a "batch write/update" capability, enabling multiple data items to be written or updated in a single API call. The idea is to reduce Internet latency by minimizing trips back and forth to Amazon's various physical data storage entities from the calling application. According to Amazon, this was in response to developer forum feedback requests.

This update to help address what was already an initial key selling point of DynamoDB tells us that latency is still a significant challenge for cloud-based storage. After all, one of the key attributes of DynamoDB when first launched was speed and performance consistency, something that their NoSQL precursor to DynamoDB, SimpleDB, was unable to deliver, at least according to some developers and users who claimed data retrieval response times ran unacceptably into the minutes. This also could have been a primary reason for SimpleDB's lower adoption rates. Amazon is well aware of these performance challenges, and hence the significance of its first DynamoDB update.

Another key tenant of DynamoDB is that it is a managed offering, meaning the details of data management requirements such as moving data from one distributed data store to another is completely abstracted away from the developer. This is great news, as complexity of cloud environments was proving to be too challenging for many developers trying to leverage cloud storage capabilities. The masses were scratching their heads as to how to overcome storage performance bottlenecks, attain replication, achieve response latency consistency, and perform other operations-related data management challenges when it was in their purview to do so. By the way, management complexity will likely still be a major challenge for other NoSQL vendors, and there are many "big data" startups offering products in this category, who do not offer the same level of abstraction that DynamoDB offers. It will be interesting to see if the launch of DynamoDB becomes a significant threat to many of these startups.

We learned this reduction of complexity lesson at StrikeIron within our own niche offerings as well. We gained a much bigger uptake of our simpler, more granular Web services APIs, such as email verification, address verification, and other products such as reverse address and telephone lookups as single, individual services, rather than complex services with many different methods and capabilities. This proved true even if the the more complex services provided more advanced power within a single API. In other words, simplified remote controls for television sets are probably still the best idea for maximum television adoption, as initial confusion and frustration tends to be inversely proportional to the adoption of any technology.

Another interesting point is that this is the fifth class of database product offerings in Amazon's portfolio. Along with DynamoDB, there is also still the aforementioned SimpleDB, a schemaless NoSQL offering for "smaller" datasets. There is also the original S3 offering with a simple Web service based interface for storing, retrieving, and deleting data objects in a straightforward key/value pair format. Next, there is Amazon RDS for managed, relational database capabilities that utilize traditional SQL for manipulating data and is more applicable for traditional applications. Finally, there are the various Amazon Machine Image (AMI) offerings on EC2 (Oracle, MySQL, etc.) for those who don't want a managed relational database and would rather have complete control over their instances (and not have to utilize their own hardware) and the RDBMs that run on them.

This tells us that the world is far from one-size-fits-all cloud database management systems, and we can all expect to be operating in hybrid storage environments that will vary from application to application for quite some time to come. I suppose that's good news for those who make a living on the operations teams of information technology.

And along with each new database offering from Amazon also comes a different business model. In the case of DynamoDB for example, Amazon has introduced the concept of "read and write capacity units", where charges will be based on the combination of frequency of usage and physical data size. This demonstrates that the business models are still somewhat far from optimal, and will likely change again in the future. Clearly they are not yet quite right for the major vendors trying to figure it all out as business model adjustments in the Cloud are not just limited to Amazon.

In summary, following the Amazon database release timeline over the years yields some interesting information, namely that speed/latency, reduction of complexity, the likelihood of hybrid compute and storage environments for some time to come, and ever-changing cloud business models are the primary focus of cloud vendors responding to the needs of their users. And as any innovator knows, the challenges are where the opportunities are.

DynamoDB

OpenStack - Open Cloud Operating System Gaining Momentum

  
  
  
  
  

As the "Cloud" has evolved and matured from its roots the past few years, the alternatives for deploying a cloud-based solution have been almost entirely proprietary and commercial. They typically have required at least a credit card to even get started "renting" servers and storage that might be needed for only short periods of time and to achieve more flexible scalability models. With the success and momentum of OpenStack, an open source cloud operating system for deploying, hosting, and managing public and private clouds within a data center, this appears to be changing.

The OpenStack project, launched initially with code contributions from Rackspace and NASA, provides the software components for making cloud management functionality available from within any data center, including one's own, similar to what Amazon, VMWare, Microsoft and other cloud vendors are now offering commercially. Deploying OpenStack enables cloud-based applications and systems utilizing virtual capacity to be launched without the associated run-time fees the current slate of vendors require, as all of the software is freely distributable and accessible.

At first glance, this seems to be an ideal solution for larger enterprise IT organizations to offer up traditional cloud functionality, such as virtual servers and storage availability, to its constituents within the organization and without the fear of vendor lock-in and and ever-increasing vendor costs. This approach also provides for access to implementation details and the ability to customize based on specialized needs - also important in many scenarios and something not typically or easily offered by the larger commercial vendors. So the benefits to the private cloud space to those who find it appropriate to build and manage their own cloud environments are clear.

However, Rackspace itself just announced making public cloud services available using OpenStack, and others are likely to follow in the not-too-distant future, leveraging community-developed innovation in the areas of scalability, performance, and high availability that might ultimately be difficult for any single proprietary vendor to match. This should enable public service providers, especially in niche markets, to proliferate as well.

Major high tech vendors are also backing and aligning with OpenStack. In addition to Rackspace and NASA, Deutsche Telekom, AT&T, IBM, Dell, Cisco, and RedHat all have much to gain from the success of OpenStack and have announced as partners, code contributers, and sources of funding. Commercial distributions have already emerged such as StackOps. Funding for OpenStack-oriented companies has begun from the venture community, and events such as the OpenStack Design Summit and Conference this week in San Francisco are getting larger and selling out quickly.

All of the foundational pieces are in place for OpenStack to have quite a run towards achieving its goal of becoming the universal cloud platform of the future and the leaders of the "open era" of the Cloud. This is an exciting development for companies like StrikeIron and our cloud-based data-as-a-service and real-time customer data validation offerings, as the data layer of the Cloud will become even more promising and fertile as OpenStack continues to accelerate organizations towards easier adoption of cloud computing models and all of its benefits.

openstack logo

Don’t be an aaS

  
  
  
  
  

Much of cloud computing terminology is based on the notion of ‘as a Service’ (or ‘aaS’).

The ‘as a Service’ tag has migrated to several new uses. Here is my attempt at a set of definitions (and please comment if you disagree):

  • SaaS (Software as a Service) – I mainly see this as an application that runs in the cloud and requires the user to download no (or very little, maybe a browser plugin) software to use the application. (e.g. SalesForce, Cisco WebEx, Google Apps)
  • DaaS (Data as a Service)* – This is providing data over the cloud either as the result of a query (is the email address me@acme.com valid) or involving a data transformation (correct the address 101 First Ave, Mytown, NC 2513). (e.g. StrikeIron!)
  • PaaS (Platform as a Service) - Providing a platform for running applications, data storage abstraction, etc. One step up the software stack then IaaS (e.g. Google App Engine, Force.com/Heroku, PHP Fog)
  • IaaS (Infrastructure as a Service) – Providing a virtual machine and storage mechanisms that can be loaded with operating systems and software (custom, open source, commercial, etc). (e.g. Rackspace, Amazon AWS, GoGrid)

Slide1

There are some proprietary aaS’s as well. My favorite is HP’s Everything as a Service. I am not sure what this really is but it sounds impressive.

Clear as mud? There is certainly some overlap between the different technologies but at the end the trend is clear. Leverage the efficiencies of scale, lower the barrier of entry, and speed up the time for implementation.

*DaaS can also refer to “Desktop as a Service” and “Database as a Service” in several sources.

 

Enterprise Cloud Adoption Accelerates - Four Reasons Why

  
  
  
  
  

In a report last week, the Open Data Center Alliance published that its members plan to triple Cloud deployments in the next two years according to a recent membership survey. This significantly outpaces the adoption forecasts from several different analyst firms and is another indicator where the I.T. industry is headed.

Of course, there are different ways to measure Cloud adoption, and while adoption rates may always be debated, there is little question of the Cloud's growing significance in I.T. Even though some Cloud forecasts combine infrastructure-as-a-Service (IAAS) with Software-as-a-Service (SAAS) and others keep them separate, in either case the trending is upward.

So here are four primary reasons why this trend is occurring and likely to continue for a long time to come:

- Cost. When deploying to the Cloud, one only has to deploy the needed I.T. resources at any given time. Capacity can be added or reduced as needed and whenever necessary. With this cost-savings "elastic" approach, usage spikes can be handled as well as increased resource demand over time. It's the difference between renting a server by-the-minute versus committing to two-year contracts with a data center provider at maximum capacity requirements. The latter, traditional approach front-loads application costs and requires significant capital expenditure. These heavy up-front costs go away in pay-for-what-you-use Cloud scenarios, including the ability to get things up and running more cheaply. Many startups deploying to the Cloud are spending less money on hardware and software investments than just a few years ago and getting up and running faster.

- Abstraction. Cloud deployments hide the details of the hardware, bandwidth resourcing, underlying software, load management, and ongoing maintenance of the given platform. This frees up resources to focus on one's own business rather than endless architecture meetings and decisions - unnecessary for a large majority of applications. This is why Salesforce.com has found success. Customers no longer have to deal with software upgrades for sales people, database choices, syncing data from laptops to servers, hardware deployment decisions, etc. It's just easier in a Cloud SAAS model.

- Innovation. An organization can leverage the innovation and expertise of those who specialize in a given Cloud-based platform such as within data-as-a-service offerings like StrikeIron provides. This continual innovation can be leveraged as a Cloud platform becomes more advanced without any effort of the organization's own resources. The platform improves daily, and these incremental improvements are put to use immediately for the benefit of customers and without company-wide software upgrades and rollouts. Instead, it's built-in and essentially automatic with the Cloud model. Another example is Amazon's EC2, where an increasing number of new features and capabilities can be leveraged without application redeployment.

- Platform Independence. When deploying to the Cloud, many different types of devices and clients can leverage the application via APIs or other interfaces, from PCs, tablets, smart phones, and other systems, as all communication between machines is via the ubiquitous Web, available just about any time anywhere. This makes interoperability easier, and extensive "middleware" investments of the past to make things work together can be dramatically reduced. This is one of the primary reasons why tablets such as the iPad for example have grown considerably in adoption now versus ten years ago – they work with the Cloud and can access a broad array of useful applications from just about anywhere.

These benefits of the Cloud aren't going away, and this is why the adoption trend is accelerating upward.

describe the image

Cloud Landscape: Cloud Databases Emerging Everywhere

  
  
  
  
  

2011 has been the year of the Cloud database. The idea of shared database resources and the abstraction of underlying hardware seems to be catching on. Just like Web and application servers, paying-as-you-go and eliminating unused database resources, licenses, hardware, and all of the associated cost is proving to have attractive enough business models that the major vendors are betting on it in significant ways.

The recent excitement has not been limited to just the fanfare around "big data" technologies. Lately, most of the major announcements have come around the traditional relational, table-driven SQL environments Web applications make use of much more widely than the key-value pair data storage mechanisms "NoSQL" technology uses for Web-scale data-intensive applications such as Facebook, NetFlix, etc.

Here are some of the new Cloud database offerings for 2011:

Saleforce.com has launched Database.com, enabling developers in other Cloud server environments such as Amazon's EC2 and the Google App Engine to utilize its database resources, not just users of Salesforce's CRM and Force.com platforms. You can also build applications in PHP or on the Android platform and utilize Database.com resources. The idea is to reach a broader set of developers and application types than just CRM-centric applications.

At Oracle Open World a couple of weeks ago, Oracle announced the Oracle Database Cloud Service, a hosted database offering running Oracle's 11gR2 database platform available in a monthly subscription model, accessible either via JDBC or its own REST API.

Earlier this month, Google announced Google Cloud SQL, a database service that will be available as part of its App Engine offering based on MySQL, complete with a Web-based administration panel.

Amazon, to complement its other Cloud services and highly used EC2 infrastructure, has made the Amazon Relational Database Service (RDS) available to enable SQL capabilities from Cloud applications, giving you a choice of underlying database technology to use such as MySQL or Oracle. It is currently in beta.

Microsoft also has its SQL Azure Cloud Database offering available in the Cloud, generally positioned as suited for applications that use the Microsoft stack for developers that will want to leverage some of the benefits of the Cloud.

"Marketecture?"

Some of the above offerings have only been announced so far, and not actually launched. Or, they have limited preview access available now. Also, even the business models in some of these cases have not even been completely divulged, or if so are very likely to change.

Clearly there is a considerable marketshare land grab existing now.  All of the major vendors are recognizing that traditional-SQL Cloud storage infrastructure will be an important technology going forward. Adding a solid database layer to the Cloud architecture story seems like an important step in the continuing enterprise and commercial software move to the Cloud, and these new vendor offerings should in turn accelerate this move.

Latency?

So, is this really the wave of the future? Some of the major questions that will have to be answered include those around latency. When data requests have to hop from a client application, then to the application server, to the database, and then back to the server and client, even multiple times within a single request, it can result in quite a performance hit. Likely, these machines exist far from each other geographically and might really slow things done, annoying an end-user with the slow page loads. This is probably why most infrastructure providers realize that they have to have the corresponding database capabilities available and accessed natively to reduce this latency. However, performance, along with security issues (perceived or otherwise) still could be a significant barrier to mainstream adoption.

Also, most of the relational database environments that exist in the Cloud only have a subset of SQL capabilities available and in some cases can be quite limited. For example, many of these Cloud SQL platforms don't support cross-table joins, at least not yet. This is a very common requirement for SQL applications. The lack of support is primarily because joins can consume a lot of resources, another performance-killer in shared environments.

Next?

Once most of this storage and Cloud database infrastructure gets in place however, incorporating more content-oriented data services such as customer data verification will become commonplace and easy to leverage.  We may even see them incorporated into the database offerings themselves as they look to differentiate themselves from vendor to vendor. Cloud-based database offerings have the advantage of making much larger libraries of data-oriented add-on capabilities available right out of the box, so the story here is much more than just cost.

While SQL Cloud offering announcements are all the rage in 2011, 2012 will undoubtedly tell the adoption tale. No doubt these offerings will be ideal and cost-effective for many use cases out there. But will demand be large enough quickly enough to support all of these vendors and drive the innovation at a speed that will make these platforms viable in the near future for enterprise and commercial applications? The answer is likely yes, but the next twelve months or so will give us a lot of the supporting data to measure the extent of the trend.

clouddb

Current State of the Data Ecosystem - Data 2.0 Conference

  
  
  
  
  

I attended the Data 2.0 Conference this week in San Francisco. There is a lot to be excited about in this emerging, growing and quickly-accelerating industry. However there are still some significant obstacles that have to be overcome for the vision of the data-driven world and the “great data highway in the sky” to truly be realized.

First, there is the exciting stuff. New companies continue to emerge and grow in the space in multiple categories, including broad data sharing sites (FluidInfo, InfoChimps), purveyors of proprietary and hard-to-capture data (Navteq, Metamarkets), API infrastructure providers (WebServius, Apigee, Mashery), specific data category providers (SimpleGEO, Rapleaf, Socrata, DataSift), providers of API-based solutions (StrikeIron, Xignite) and slick data visualization tools (too numerous to list).

Also, the companies that have been in this space for five or more years are becoming larger, more sophisticated and are in many cases continuing to raise significant amounts of capital from investors looking to capitalize on the data megatrend. Even the stalwarts such as Microsoft with their DataMarket are eyeing future fruitful harvests in this space. Twitter also announced the commercial licensing of its entire “fire hose” of Tweets at the event, a move that the providers of analytics tools are hailing.

More and more public, government-sourced data is coming online everyday as fodder for this machine-to-machine information feeding frenzy. This data is coming from every level of government too; cities such as San Francisco (datasf.org), state governments such as Oregon (data.oregon.gov announced two weeks ago), and the federal government's data.gov initiative(rumors that funding might be cut are false according to keynote speaker and the charismatic, sometimes controversial industry-insider Vivek Wadhwa).

All of these government-sourced data assets are being made available to the general public in the hopes of civically engaging the creativity among us to innovate and create public value without the traditional budgetary costs. This has already led to a proliferation of applications such as online live maps of San Francisco municipal transportation schedules (including for the iPhone and Android platforms), as well as municipal vendor contracts available online for public discussion (it's amazing how many vendors consistently come in late and over-budget get rewarded new contracts over and over) like Chicago’s citypayments.org site.

Finally, the platforms (the fertile Cloud and all that is happening there with the Google AppEngine, Microsoft’s Azure, and Amazon’s various cloud offerings) and especially the devices (like smartphones and the iPad) that can make use of this data are also marching forward at a breathtaking pace.

This is all very exciting for those of us whom data represents a livelihood.

However, the significant challenge around the accessibility and usability of these vast seas of data is that it is still largely a complex, IT-oriented developer's world. Most of the access to these data sources is either via API, or available in structures and formats as varied as the data itself. This limits the applicability of these valuable data sources to a very small group of dedicated engineers and leaves us all with only a modicum of the true potential of this space.

Sure, there are API protocol standards such as REST and SOAP, but these only scratch the service. Most single-API vendors introduce a new set of behaviors, data structures, response codes and a new business model with each new API. This adds greatly to the complexity for anyone looking to put these data sources to use, both initially and ongoing. Until we can find a way to normalize the great data highway, it's going to be a bumpy road. Those that know me know I’ve been preaching this for years and have applied much of it to StrikeIron’s various data and API offerings; however it can be a difficult proposition to get adopted across the industry.

The consumption complexity issue is demonstrated by the term "mashup". Several years ago, this was the term-du-jour by an industry claiming that non-developers could combine datasets in interesting and exciting new ways without the assistance of their IT organizations. This term however has all but shriveled and died. Why? Because non-developer's couldn’t do it. The tools were cumbersome, complex, and represented whole new learning curves that most people simply don't have the time or patience for, as well as the lack of standards surrounding the datasets themselves. In fact, I never heard the term mentioned a single time at the event. A few years back, it would have been in nearly every discussion. May the term “mashup” rest in peace.

Until we surpass this hurdle of data consumption complexity and the vendors in the space only pay lip service to these challenges (prevalent on several of the panels at the event), the data-driven world will only be a shell of what it could be.

Intelligent Use of Amazon's Simple Email Service (SES) Using StrikeIron Email Verification

  
  
  
  
  

Amazon's new SES (Simple Email Service) product is a scalable, transaction-based offering for programmatically sending large amounts of email.  This is accomplished using Amazon's Web-scale architecture, most especially for applications that already use EC2 (server rental) and S3 (storage rental). By utilizing SES you are essentially leveraging the "Cloud" to send emails from applications and Web sites rather than investing in your own software and hardware infrastructure to do so. This process substantially reduces cost and complexity as do most Cloud services and in this case requires only a simple API call. There is no network configuration or email server setup required in this process.

However, there are some significant restrictions to consider that Amazon has imposed on the user. The SES service will only let you start out with a limited quota until you build a "good reputation" within their system. The initial limit is 200 emails per day. This will increase substantially once you build your reputation within Amazon’s service.

One criteria used to "build your reputation" within Amazon is based on number of bouncebacks or emails that could not be delivered because the email address is invalid or has been disabled. Having a clean, verified email list prior to and during your ongoing use of the service is extremely important to minimize the number of bouncebacks you receive. If your use of SES returns a large number of bouncebacks from non-working email addresses, your quota will not be raised and you may be disqualified from using the system. Multiple bouncebacks can really hurt your reputation and will prevent you from being able to fully maximize Amazon's SES product.

Fortunately, you can use another Cloud-based service (available from StrikeIron) for verifying the validity of an email address before using Amazon's SES service (or any other email service). It is another simple API check that will indicate if a given email address is not valid (an actual non-intrusive, real-time check across the Web without ever sending an actual email). This is exactly one of the primary uses of StrikeIron's Email Verification Service- building email service provider reputation by significantly reducing bouncebacks and staying off of spam lists which can kill your ability to communicate with customers and prospective customers electronically.

Email Verification example

As email technology becomes more sophisticated, so should those of us who make use of these technologies especially when it is so easy. The business upside can be dramatic and provide great results for companies.

Cloud Companies' Share Price Increase Dramatic Versus Dow

  
  
  
  
  

The "Cloud" has been seeing a lot of momentum this past year, and one place where that is readily apparent is in the stock price of companies making major strategic investments in Cloud technology and associated offerings, as well as aggressive go-to-market plans with those offerings.

To demonstrate this, take a look at the one-year stock price increase of eight major cloud vendors versus the Dow Jones Industrial Average. These eight growth companies were selected because of their software-as-a-service (SAAS) or infrastructure-as-a-service (IAAS) focus. They are Informatica (INFA), Salesforce.com (CRM), Amazon (AMZN), Netsuite (N), Rackspace (RAX), Success Factors (SFSF), Akamai (AKAM), and VMWare (VMW). These securities have seen on average an 81% price increase over the past year, versus a paltry 6% versus the Dow Jones Industrial Average (which at least has gone up).

aug 27 resized 600

Will it continue? There is still a long way to go in this space, so probably so.
All Posts