Wednesday, April 1, 2015

Moore’s Law : Data Warehousing and Business Intelligence


Moore’s Law:
I hope it’s safe to mention that Moore’s Law for Data Storage has become as significant as Einstein’s Mass-Energy Equivalence for physics. For a quick recap of Moore’s law: Gordon Moore, co-founder of Intel, had observed for years and predicted that number of transistors per square inch on Integrated Circuits will double for every 18 months while the cost of the ICs decrease. In this post, we will examine how was Moore’s law observed in Data Warehousing and Business Intelligence over last few years.


Conventionally, Data Warehousing (EDW) applications were known for requiring more data storage and computing power than other applications in an organization. So, let’s look at how the cost of data storage has varied over years -


  • From 1993 to 2013, the cost of a terabyte of disk storage has dropped from $2 million to less than $100. (This is the cost of the drives alone. Storage-management hardware and software are separate.)
  • Dynamic RAM (DRAM), the volatile memory of the computer, has seen a similar drop in price ($250 for every 4 megabytes of 8-bit memory versus $450 for every 32 gigabytes of 64-bit memory). Today’s servers can be fitted with 64,000 times as much memory for only twice the cost of 20 years ago.
  • Cost per gigabyte is only part of the story. There is a similar dramatic increase in density (from 1 MB to 8 GB per chip), allowing for smaller installations, less space, and using less energy.
  • A proprietary Unix server with four Intel 486 chips, 384 MB of RAM, and 32GB of storage cost $650,000 in 1993, greater than $1 million in constant dollars. Today’s mid range laptops of the same capacity cost between $2,000 and $4,000.


A pictorial representation of these factors will be like:


moores_Infographic_2-1.jpg


With the slump in storage costs over years, many IT firms have started offering Storage and computing power as a service (cloud computing) for Data Warehousing and Business Intelligence.


Cloud computing for DW/BI:
Cloud computing has emerged as one of the hot topics of the last few years with the promise of affordable, “pay as you go” computing infrastructure designed to minimize both the up front investment in infrastructure, and the lead time required to deploy compute resources for new projects. With so many big players in IT providing cloud services, the time to provision even moderately complex environments can be reduced to under an hour, with entry-level costs at less than one dollar per hour. However, cloud-based environments for big data analytics, or more specifically, data warehousing analytics for structured data, are not appropriate for all use cases.


Here are some of the metrics that can be used to determine whether a DW/BI solution needs Cloud computing power or not.
  • Total volume of data
  • Volume of data to be loaded daily
  • Sensitivity of data/Regulatory and compliance requirements
  • Scope of Analytics (e.g. mart or full-scale EDW)
  • Primary environment use (e.g. dev/test/production)


Assuming a business has all the characters mentioned above, let us look at some of the major players in offering cloud based services for DW/BI. The following services did not only provide storage and computing power but also changed the way organizations worked with their data and analytics.


Some cloud services for DW/BI:
Amazon Redshift:
One cannot get away without mentioning Amazon Redshift cloud services for Data Warehousing. The service was one of the initial players in this area and still offers the best services at a highly competitive pricing, starts at as low as 85c/hour.
What makes Redshift a good choice:
  • Fast - Optimized for DW and scalable
  • Cheap - No upfront costs, pay-for-use
  • Simple - Get started in minutes and auto-backed up
  • Secure - Encrypted, Isolated network with Cloudtrail
  • Compatible - With many SQL databases


IBM DB2 with BLU acceleration:
IBM DB2 with BLU Acceleration is the next generation database technology that changes the game for in-memory computing. Delivering a combination of innovations from IBM Research & Development labs, BLU Acceleration provides breakthrough performance by delivering instant insight from real-time operational data and historical data. 
Salient features:
  • Fast - claims 35x faster analytics with in-memory processing
  • Simple - Transactions and analytics together
  • Agile - Available on-premise and cloud alike with extensive SQL compatibility


Future of DW/BI:
With 80% of data in unstructured format, extracting, transforming and loading data from unstructured data sources and analyzing that data will be the next big thing for data warehousing and Business Intelligence applications. As we have discussed in previous blog posts, the DW2.0 is already making it’s way to this goal and these applications might be able to process petabytes of data on cloud to provide business insights.


References:



Wednesday, March 4, 2015

Visualization - The Effective Way

Overview:

The enormously-grown data and capability to analyze this humongous data demanded visualizations or visual reports that are more intuitive and interactive than the traditional reports. There are many BI reporting tools came up with the new and effective styles of visualization including the advanced geospatial visualization. However, what kind of visualizations are best suited to present a dataset? Does one style depict all business metrics effectively? Traditionally,  Pivot tables, Histograms and Pie Charts were among the other few graphs used by almost all the businesses and different departments in a business. In the last few years, The businesses found other visualizations like heat maps, tree maps and network visualizations more comprehensible. 

In this post, let us randomly (not really) choose 3 business vignettes and discuss about the best ways to visualize metrics in those businesses. 

Human Resources Management: 

Human Resource Management, recently evolved as Human Capital Management with many additional features, will be the first business vignette we look at. Before exploring the best visualization, let us examine what are the metrics or data the reports consist of in this domain. One of the most common reports in this realm will be Employee Turnover Report (staffing analysis) along with Employee Transactional Reports, Employee Skills, Employee demographics including age and Time & Labor reports.

When it comes to the Employee Turnover Report, I believe the best way to look at it will be a line graph depicting the staffing activities by Department a period of time. Tools like Tableau® has an option to create beautiful dashboards including staffing trends, performance by groups etc.



source: http://www.tableau.com/solutions/hr-analytics

In this example, the line graphs display the number of hires/rehires, promotions, terminations and transfers differentiated by color making it easy to pursue. The stacked histograms below explains the performance details by supervisor, percentage of employees’ performances and terminations in each section of performances respectively. 

An extended version of this report will be an ‘employment report’. Looking at a time-series plot of the employment trends is a quick and better way to grasp the employment trends over a period of time, generally published by expert firms like ADP LLC and Forbes. 


source: http://www.adpemploymentreport.com/2015/February/NER/NER-February-2015.aspx

Healthcare: 

Healthcare is one of the complex systems in United States that needs informative visualizations for a variety of users including PCPs, Hospitals, Insurance Carriers and government. The metrics in this domain include monetary details like average expenses on healthcare, premiums or hospital bill amounts along with other measures like percent of people insured. Further analysis based on the geographical locations make healthcare analytics more interesting. 

Though all the numerical values can be shown in histograms effectively, I prefer the bubble plots which are aesthetically appealing to eyes. 

Here is an example of aggregated healthcare expenses shown in bubble plot.


source: http://www.knowledgevaluechain.com/wp-content/uploads/2013/07/Bubble-chart.png

Another chart, I found very interesting is relating the analysis to geographical locations and presenting in a geospatial heat map like below. 


source: http://www.pewtrusts.org/en/multimedia/data-visualizations/2014/medicaid-spending-growth

E-Commerce:

E-commerce is another business which has adapted the new Big Data and Data Analytics technologies at their nascent stages. This industry has many quantitative and qualitative metrics to be visualized in the plots. The quantitative metrics include session time, conversion rate, average order value, churn rate, cost per clicks (CPC) and customer lifetime value (CLV) and not to forget the traditional financial and revenue numbers. 

I see histogram, line graph or scatter plot as feasible and good visualizations for presenting quantities like conversion rate over time grouped by day and time. However, this heat map made it simple and more attractive. 


source: https://plot.ly/~hianalytics/16/google-analytics-organic-traffic-conversion-rate-heatmap-jan-2014.png

All the other rates and numerical parameters can be presented in a similar plot. Adding to the list is a word cloud which presents the keywords that attract more customers to the website. Here is an example: 


source: http://searchenginewatch.com/IMG/854/245854/enterprise-search-2013-word-cloud.jpg?1358369836

Final word: 

The views expressed in this blog are purely my personal opinions and I am sure there are many other interesting and attractive plots to display the same or similar metrics. 

Wear your creativity hat and generate amazing visualizations!!


References: 
The Data Warehouse Lifecycle Toolkit - Ross & Kimball
http://www.adpemploymentreport.com/2015/February/NER/NER-February-2015.aspx
http://www.tableau.com/solutions/hr-analytics
http://www.aquire.com/software/workforce_analytics_old
http://www.knowledgevaluechain.com/wp-content/uploads/2013/07/Bubble-chart.png
http://www.pewtrusts.org/en/multimedia/data-visualizations/2014/medicaid-spending-growth
http://blog.bigcommerce.com/6-vital-ecommerce-metrics/
https://plot.ly/~hianalytics/16/google-analytics-organic-traffic-conversion-rate-heatmap-jan-2014.png

http://searchenginewatch.com/IMG/854/245854/enterprise-search-2013-word-cloud.jpg?1358369836

Wednesday, February 18, 2015

Data Warehousing on Structured and Unstructured Data


Introduction to Structured and Unstructured data

As we were exploring and evaluating the BI tools in our previous blog post, we came across many tools boasting of compatibility with semi-structured and unstructured data. How is this unstructured data different from relational data (structured data)? Why are the organizations looking for tools that work with unstructured data? How disparate are the DW/BI applications for two varieties of Data? Let’s move on to answer few, if not all, of these questions.

In the simplest of definitions, Structured Data is referred as highly organized data. Structured data is data that is represented by numbers, tables, rows, columns, attributes etc. This organization of data makes inserting and searching data from these structured datasets easy and convenient. Traditionally, this data is stored in databases, majorly relational databases. Some of the examples of structured data will be point-of-sale data, call detail records, web server logs etc. With the evolution of unstructured and big data, experts estimate that structured data accounts only for 20% of the total data that is available today. 

Structured Data


Contrasting to structured data, Unstructured Data is the data that is not organized in a predefined manner. In a general sense, The structured data refers to the data stored in databases while the unstructured data refers to all the other data. It is more often referred as “Big Data”. The lack of structure makes it difficult to insert or search the data from these unstructured data sources. Contrary to the complexity of working with the unstructured data, it has become vital in every organization’s analysis operations. 

Unstructured Data


Data Availability for Organizations

The majority of technology industry experts estimate that 80 percent of the data in the world is unstructured right now. This gives an approximate estimation that companies produce and have access to unstructured data 4 times that they have to structured data. It will be intriguing for any data enthusiast to understand how such huge volumes of data is created. Let us examine few sources and types of data generated in an organization: 

Computer- or machine-generated: The data that is generated by a machine or computer without any human interaction. This type of structured data can include i. sensor data like RFIDs, GPS logs, ii. web log data including server, network and connection logs, iii.  financial data such as stock-trading data.  

Human generated: The data generated by human interactions with computers. This data includes i. Input data like demographic information and/or surveys, ii. Click stream data generated by tracking user clicks on websites. 

Contrary to this, by definition and nature of unstructured data, it is difficult to categorize the sources of it. However, most of the sources and types of the unstructured data will be in form of word doc’s, PDFs and other text files, audio files, presentations, images and videos.

Here is graph showing the variety of data and it’s volume over the last 5 years and predicted to the next 5 years. 



Classic reporting and data analysis: Data Warehousing

A data warehouse is a federated repository for all the data that an enterprise's various business systems collect. The repository may be physical or logical. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access. 

Traditionally, the organizations were successful in applying the Data Warehouse and OLAP technologies to build decision support systems for organizing and analyzing the huge amounts of structured data that companies store in their databases. However, with the evolving role of “big data” or unstructured data in organizations analytics, it is essential for Enterprise Data Warehouse (EDW) applications to extend the services to work with this type of data. Interestingly, all analytics tools convert the unstructured and semi-structured data to some what structured data before analyzing and reporting on it. 


Limitations of Data Warehousing

Here is a quick look into limitations of data warehousing in analyzing different types of data - 
Structured Data
  • Required data not captured - The source systems and/or OLTP systems might not capture all the important data that is required for a warehouse to generate useful reports.
  • Data homogenization - The basic principle of data warehouse requires data in similar data format from all the sources. Sometimes, this will result in leaving out some important data points which are not conforming with other data types.

Processing unstructured data in Data Warehouse will have more limitations along with the above list - 
Unstructured Data
  • The Data in textual format, images and other unstructured formats is hard to be read and analyzed by traditional Data Warehousing applications. 
  • There is lot more data to clean, filter, normalize and transform and load into Enterprise Data Warehouse.
  • The algorithms for textual analysis will be more complex than those of numerical analysis. 

Future of Data Warehousing

Since 80% of the current data in the world is unstructured and most of it is constituted by text, Data warehousing applications should have a means of processing this text data to provide analysis and reports. Integrating unstructured text data with data warehouse would involve multiple steps including building an unstructured database, load data into database, create relational data structure and create probabilistic foundation for matching unstructured and structured data. 

The DW/BI industry realized the need for this data warehouse technology which experts call as Data Warehouse 2.0.

Here is a pictorial representation of DW 2.0: 




And the industry experts predict the following will be future of Data Warehousing: 

  • Operational data warehouses: Real time analytics
  • Processing data and analytics in cloud will be a requirement soon
  • Big data projects will start with data warehouse optimization

Conclusion

Though the evolution of unstructured data resulted in a complex process for data warehousing, the process still has it’s potential to expand it's services and adopt to the latest technologies. 


References: 
Integrating Unstructured Data In The Data Warehouse : Unlocking Business Value by Krish Krishnan

Tuesday, February 3, 2015

A closer look into 5 BI reporting tools

As more and more companies incline to explore the power of BI reporting and Analytics, what reporting tools are leading the market by providing most beneficial solutions to the organizations. Let us try to examine few of the well-known BI reporting tools and the features they offer to users. The views expressed in this blog are my personal and does not relate, in anyway, to the tools/companies mentioned in this blog and any other tools in the same domain. 

Firstly, let us have a look at the features of a reporting tool that we consider in this post. 

Reporting from multiple sources: Ability to import data from multiple sources and generating reports easily. This means to be able to connect variety of data sources like flat files (e.g., CSV file, TXT file), most stand alone databases like Oracle® SQL and recently evolved semi-structrued or unstructured (big data) databases like Postegre® SQL and MongoDB® alike. It also considers the convenience of generating reports for a moderately knowledgable reporter. 

Accessibility: The availability of reporting tool for a wide range of users including small, medium and large companies, data enthusiasts, researchers and students. 

Interactive reports: The reports and dashboards gives the opportunity to the user to change the inputs and modify the reports to their needs. A reporting analyst or a developer is not required to customize the report and generate, what I call, sub-reports - reports generated from the subset of data represented/visualized. 

Value for price: In business terminology, ROI (Return On Investment) is the equivalent of this feature. It refers to the value the user gains for the price of the product/service. Please note that the value in blog is my perceived (subjective) value. 

Security: In a BI reporting environment, it is evident that security is of high priority as the data might be related to the entire organization and most of the times, contain the financial, marketing strategic information. So, we will consider security as one of the features we are reviewing in the list.  

As the BI industry emerges rapidly and with new tools introduced at a gallop, It will not be feasible for us to review all the tools. We will consider 5, as the blog title suggests, BI tools for our review which have established their footprint in the domain. We will talk about each of the features listed above with respect to each of the tools.


Tableau
An intuitive BI software for any user irrespective of the technical knowledge of the user. The platform is simple and easy-to-use with drag and drop convenience. It also lets the user connect to a variety of data sources including flat files, relational databases and semi structured databases. Tableau comes in different versions, namely Tableau® Public is a free version for everyone while Tableau® Desktop is a version capable of generating all kinds of reports and available on purchase of license. There is a Tableau® Server version which runs in a cloud environment and accessible to multiple users within an organization. An interesting fact about Tableau® Desktop is it’s availability of 1 year free subscription for students. 

With it’s novel design, Tableau® has created an easy-to-use software with the capability of interactive reports and simply customizable dashboards. Tableau’s online tutorials on the website makes it easy for anyone to be comfortable with product within few hours. With the features Tableau has to offer, I believe it is more than full value for the price it comes with. It also comes with a nice set of security setup including Access, Object, Data and Transmission security for the Server version. 



OBIEE
A well-established name in the BI industry will be the Business Intelligence Enterprise Edition from Oracle®. The platform is robust with many components including BI Server, BI Publisher and Hyperion interactive reporting. OBIEE also lets you connect to multiple sources of data, primarily structured. OBIEE can also work with semi-structured and unstructured data with some complexity added at the Business Model layer (a.k.a logical layer). Few universities has access to Oracle Software Delivery Cloud which may offer a license for free making the software accessible to students. Unlike Tableau®, which can run on an average desktop machine, OBIEE might need higher configuration to run without glitches. 

Oracle delivers multiple elements along with OBIEE (suite) making the interactive reports part of the package. Along with visually appealing reports generated by OBIEE working on BMM and Presentation layers, Hyperion SQR reporting component works directly on the ERP databases to create text and small graphic reports. However, OBIEE might need a BI reporting analyst or a programmer analyst to create the logical layer and making data ready for presentation in presentation layer. 



MicroStrategy
As per the company’s website and few users, MicroStrategy Business Intelligence’s goal is to leverage data to help organizations find timely. The software supports connectivity to wide range of data sources which are mainly relational databases and flat files. It provides less support for the  semi-structured and unstructured data, if not 'No support'. I did not learn about a student license for MicroStrategy software apart from the standard license for enterprises, which makes MicroStrategy less accessible to researchers and students. 

The software’s ability to connect to multiple transactional databases including ERP and SCM systems makes it convenient for the user to generate dynamic reports with latest of the data represented. It’s also possible to generate interactive reports, what the company calls MicroStrategy Web reports. A distinguishing feature of MicroStrategy will be it’s pad and mobile apps that makes enterprise level reporting software available on-the-go. With the latest release of MicroStrategy9s, the company takes the security of data to the next level. 



Pentaho
As the website boasts of it’s capability to blend any data, Pentaho® offers connectivity to many data sources including big data tools. It’s visual tools to do any task eliminates the necessity for coding and complexity. The software’s capability to generate reports from basic reporting to predictive analytics makes this one of the best BI reporting tools in market at present. As the company does not offer any free licenses to researchers and students, the tool is accessible across the communities. Like many other BI tools, it’s capacity to generate interactive reports makes it easy to use by less technically-knowledgeable managers. 

As the company offers the product at a competitive price, and the product’s capability to work with big data sets make it valuable and profitable for the user. Though company provides the security as a separate product Pentaho Security for the server, it does not come with the default product. 



Tibco Spotfire
Spotfire is one of the self-discovery BI tools in the market that helps users consume data, even big data, conveniently. Spotfire’s ability 
to connect to structured data and unstructured data alike makes it a one of the BI tools choices for working with big data. Similar to Tableau®, Spotfire comes in Desktop, Cloud and Platform versions. According to onthehub.com, Tibco® provides a free license for students making the software accessible by researchers and students along with corporate companies that use an enterprise license. Spotfire also provides the flexibility of generating sub-reports/interactive reports on your fingertips. 

With competitive price and feature-rich software, Spotfire also provides significant returns for the investment. Spotfire’s ability to provide security at row level even takes care of security for your data. 



Finally, let me try to quantify the importance of features and how well the above mentioned softwares will fulfill each of the features. 

 

Here is a pictorial representation of weightage for the features included in this report.


With the weights for each feature considered, the total score for each tool will be:



Table with numerical values for reference:



Verdict: 
Considering the features we discussed, it seems that Tableau® has more to offer for its users.  However, this may not be the right solution for your industry and organizational needs. It is advised to research on more tools before choosing the right BI software for your organization. 

References: 
http://www.tableausoftware.com
http://www.oracle.com/us/solutions/business-analytics/business-intelligence/enterprise-edition/overview/index.html
http://www.microstrategy.com/us/
http://www.pentaho.com
http://spotfire.tibco.com
http://www.docurated.com/all-things-productivity/50-best-business-intelligence-tools
http://www.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sb
http://onthehub.com/download/free-software/tibco