Wednesday, February 18, 2015

Data Warehousing on Structured and Unstructured Data


Introduction to Structured and Unstructured data

As we were exploring and evaluating the BI tools in our previous blog post, we came across many tools boasting of compatibility with semi-structured and unstructured data. How is this unstructured data different from relational data (structured data)? Why are the organizations looking for tools that work with unstructured data? How disparate are the DW/BI applications for two varieties of Data? Let’s move on to answer few, if not all, of these questions.

In the simplest of definitions, Structured Data is referred as highly organized data. Structured data is data that is represented by numbers, tables, rows, columns, attributes etc. This organization of data makes inserting and searching data from these structured datasets easy and convenient. Traditionally, this data is stored in databases, majorly relational databases. Some of the examples of structured data will be point-of-sale data, call detail records, web server logs etc. With the evolution of unstructured and big data, experts estimate that structured data accounts only for 20% of the total data that is available today. 

Structured Data


Contrasting to structured data, Unstructured Data is the data that is not organized in a predefined manner. In a general sense, The structured data refers to the data stored in databases while the unstructured data refers to all the other data. It is more often referred as “Big Data”. The lack of structure makes it difficult to insert or search the data from these unstructured data sources. Contrary to the complexity of working with the unstructured data, it has become vital in every organization’s analysis operations. 

Unstructured Data


Data Availability for Organizations

The majority of technology industry experts estimate that 80 percent of the data in the world is unstructured right now. This gives an approximate estimation that companies produce and have access to unstructured data 4 times that they have to structured data. It will be intriguing for any data enthusiast to understand how such huge volumes of data is created. Let us examine few sources and types of data generated in an organization: 

Computer- or machine-generated: The data that is generated by a machine or computer without any human interaction. This type of structured data can include i. sensor data like RFIDs, GPS logs, ii. web log data including server, network and connection logs, iii.  financial data such as stock-trading data.  

Human generated: The data generated by human interactions with computers. This data includes i. Input data like demographic information and/or surveys, ii. Click stream data generated by tracking user clicks on websites. 

Contrary to this, by definition and nature of unstructured data, it is difficult to categorize the sources of it. However, most of the sources and types of the unstructured data will be in form of word doc’s, PDFs and other text files, audio files, presentations, images and videos.

Here is graph showing the variety of data and it’s volume over the last 5 years and predicted to the next 5 years. 



Classic reporting and data analysis: Data Warehousing

A data warehouse is a federated repository for all the data that an enterprise's various business systems collect. The repository may be physical or logical. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access. 

Traditionally, the organizations were successful in applying the Data Warehouse and OLAP technologies to build decision support systems for organizing and analyzing the huge amounts of structured data that companies store in their databases. However, with the evolving role of “big data” or unstructured data in organizations analytics, it is essential for Enterprise Data Warehouse (EDW) applications to extend the services to work with this type of data. Interestingly, all analytics tools convert the unstructured and semi-structured data to some what structured data before analyzing and reporting on it. 


Limitations of Data Warehousing

Here is a quick look into limitations of data warehousing in analyzing different types of data - 
Structured Data
  • Required data not captured - The source systems and/or OLTP systems might not capture all the important data that is required for a warehouse to generate useful reports.
  • Data homogenization - The basic principle of data warehouse requires data in similar data format from all the sources. Sometimes, this will result in leaving out some important data points which are not conforming with other data types.

Processing unstructured data in Data Warehouse will have more limitations along with the above list - 
Unstructured Data
  • The Data in textual format, images and other unstructured formats is hard to be read and analyzed by traditional Data Warehousing applications. 
  • There is lot more data to clean, filter, normalize and transform and load into Enterprise Data Warehouse.
  • The algorithms for textual analysis will be more complex than those of numerical analysis. 

Future of Data Warehousing

Since 80% of the current data in the world is unstructured and most of it is constituted by text, Data warehousing applications should have a means of processing this text data to provide analysis and reports. Integrating unstructured text data with data warehouse would involve multiple steps including building an unstructured database, load data into database, create relational data structure and create probabilistic foundation for matching unstructured and structured data. 

The DW/BI industry realized the need for this data warehouse technology which experts call as Data Warehouse 2.0.

Here is a pictorial representation of DW 2.0: 




And the industry experts predict the following will be future of Data Warehousing: 

  • Operational data warehouses: Real time analytics
  • Processing data and analytics in cloud will be a requirement soon
  • Big data projects will start with data warehouse optimization

Conclusion

Though the evolution of unstructured data resulted in a complex process for data warehousing, the process still has it’s potential to expand it's services and adopt to the latest technologies. 


References: 
Integrating Unstructured Data In The Data Warehouse : Unlocking Business Value by Krish Krishnan

Tuesday, February 3, 2015

A closer look into 5 BI reporting tools

As more and more companies incline to explore the power of BI reporting and Analytics, what reporting tools are leading the market by providing most beneficial solutions to the organizations. Let us try to examine few of the well-known BI reporting tools and the features they offer to users. The views expressed in this blog are my personal and does not relate, in anyway, to the tools/companies mentioned in this blog and any other tools in the same domain. 

Firstly, let us have a look at the features of a reporting tool that we consider in this post. 

Reporting from multiple sources: Ability to import data from multiple sources and generating reports easily. This means to be able to connect variety of data sources like flat files (e.g., CSV file, TXT file), most stand alone databases like Oracle® SQL and recently evolved semi-structrued or unstructured (big data) databases like Postegre® SQL and MongoDB® alike. It also considers the convenience of generating reports for a moderately knowledgable reporter. 

Accessibility: The availability of reporting tool for a wide range of users including small, medium and large companies, data enthusiasts, researchers and students. 

Interactive reports: The reports and dashboards gives the opportunity to the user to change the inputs and modify the reports to their needs. A reporting analyst or a developer is not required to customize the report and generate, what I call, sub-reports - reports generated from the subset of data represented/visualized. 

Value for price: In business terminology, ROI (Return On Investment) is the equivalent of this feature. It refers to the value the user gains for the price of the product/service. Please note that the value in blog is my perceived (subjective) value. 

Security: In a BI reporting environment, it is evident that security is of high priority as the data might be related to the entire organization and most of the times, contain the financial, marketing strategic information. So, we will consider security as one of the features we are reviewing in the list.  

As the BI industry emerges rapidly and with new tools introduced at a gallop, It will not be feasible for us to review all the tools. We will consider 5, as the blog title suggests, BI tools for our review which have established their footprint in the domain. We will talk about each of the features listed above with respect to each of the tools.


Tableau
An intuitive BI software for any user irrespective of the technical knowledge of the user. The platform is simple and easy-to-use with drag and drop convenience. It also lets the user connect to a variety of data sources including flat files, relational databases and semi structured databases. Tableau comes in different versions, namely Tableau® Public is a free version for everyone while Tableau® Desktop is a version capable of generating all kinds of reports and available on purchase of license. There is a Tableau® Server version which runs in a cloud environment and accessible to multiple users within an organization. An interesting fact about Tableau® Desktop is it’s availability of 1 year free subscription for students. 

With it’s novel design, Tableau® has created an easy-to-use software with the capability of interactive reports and simply customizable dashboards. Tableau’s online tutorials on the website makes it easy for anyone to be comfortable with product within few hours. With the features Tableau has to offer, I believe it is more than full value for the price it comes with. It also comes with a nice set of security setup including Access, Object, Data and Transmission security for the Server version. 



OBIEE
A well-established name in the BI industry will be the Business Intelligence Enterprise Edition from Oracle®. The platform is robust with many components including BI Server, BI Publisher and Hyperion interactive reporting. OBIEE also lets you connect to multiple sources of data, primarily structured. OBIEE can also work with semi-structured and unstructured data with some complexity added at the Business Model layer (a.k.a logical layer). Few universities has access to Oracle Software Delivery Cloud which may offer a license for free making the software accessible to students. Unlike Tableau®, which can run on an average desktop machine, OBIEE might need higher configuration to run without glitches. 

Oracle delivers multiple elements along with OBIEE (suite) making the interactive reports part of the package. Along with visually appealing reports generated by OBIEE working on BMM and Presentation layers, Hyperion SQR reporting component works directly on the ERP databases to create text and small graphic reports. However, OBIEE might need a BI reporting analyst or a programmer analyst to create the logical layer and making data ready for presentation in presentation layer. 



MicroStrategy
As per the company’s website and few users, MicroStrategy Business Intelligence’s goal is to leverage data to help organizations find timely. The software supports connectivity to wide range of data sources which are mainly relational databases and flat files. It provides less support for the  semi-structured and unstructured data, if not 'No support'. I did not learn about a student license for MicroStrategy software apart from the standard license for enterprises, which makes MicroStrategy less accessible to researchers and students. 

The software’s ability to connect to multiple transactional databases including ERP and SCM systems makes it convenient for the user to generate dynamic reports with latest of the data represented. It’s also possible to generate interactive reports, what the company calls MicroStrategy Web reports. A distinguishing feature of MicroStrategy will be it’s pad and mobile apps that makes enterprise level reporting software available on-the-go. With the latest release of MicroStrategy9s, the company takes the security of data to the next level. 



Pentaho
As the website boasts of it’s capability to blend any data, Pentaho® offers connectivity to many data sources including big data tools. It’s visual tools to do any task eliminates the necessity for coding and complexity. The software’s capability to generate reports from basic reporting to predictive analytics makes this one of the best BI reporting tools in market at present. As the company does not offer any free licenses to researchers and students, the tool is accessible across the communities. Like many other BI tools, it’s capacity to generate interactive reports makes it easy to use by less technically-knowledgeable managers. 

As the company offers the product at a competitive price, and the product’s capability to work with big data sets make it valuable and profitable for the user. Though company provides the security as a separate product Pentaho Security for the server, it does not come with the default product. 



Tibco Spotfire
Spotfire is one of the self-discovery BI tools in the market that helps users consume data, even big data, conveniently. Spotfire’s ability 
to connect to structured data and unstructured data alike makes it a one of the BI tools choices for working with big data. Similar to Tableau®, Spotfire comes in Desktop, Cloud and Platform versions. According to onthehub.com, Tibco® provides a free license for students making the software accessible by researchers and students along with corporate companies that use an enterprise license. Spotfire also provides the flexibility of generating sub-reports/interactive reports on your fingertips. 

With competitive price and feature-rich software, Spotfire also provides significant returns for the investment. Spotfire’s ability to provide security at row level even takes care of security for your data. 



Finally, let me try to quantify the importance of features and how well the above mentioned softwares will fulfill each of the features. 

 

Here is a pictorial representation of weightage for the features included in this report.


With the weights for each feature considered, the total score for each tool will be:



Table with numerical values for reference:



Verdict: 
Considering the features we discussed, it seems that Tableau® has more to offer for its users.  However, this may not be the right solution for your industry and organizational needs. It is advised to research on more tools before choosing the right BI software for your organization. 

References: 
http://www.tableausoftware.com
http://www.oracle.com/us/solutions/business-analytics/business-intelligence/enterprise-edition/overview/index.html
http://www.microstrategy.com/us/
http://www.pentaho.com
http://spotfire.tibco.com
http://www.docurated.com/all-things-productivity/50-best-business-intelligence-tools
http://www.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sb
http://onthehub.com/download/free-software/tibco