Introduction to Structured and Unstructured data
As we were exploring and evaluating the BI tools in our previous blog post, we came across many tools boasting of compatibility with semi-structured and unstructured data. How is this unstructured data different from relational data (structured data)? Why are the organizations looking for tools that work with unstructured data? How disparate are the DW/BI applications for two varieties of Data? Let’s move on to answer few, if not all, of these questions.
In the simplest of definitions, Structured Data is referred as highly organized data. Structured data is data that is represented by numbers, tables, rows, columns, attributes etc. This organization of data makes inserting and searching data from these structured datasets easy and convenient. Traditionally, this data is stored in databases, majorly relational databases. Some of the examples of structured data will be point-of-sale data, call detail records, web server logs etc. With the evolution of unstructured and big data, experts estimate that structured data accounts only for 20% of the total data that is available today.
Structured Data
Contrasting to structured data, Unstructured Data is the data that is not organized in a predefined manner. In a general sense, The structured data refers to the data stored in databases while the unstructured data refers to all the other data. It is more often referred as “Big Data”. The lack of structure makes it difficult to insert or search the data from these unstructured data sources. Contrary to the complexity of working with the unstructured data, it has become vital in every organization’s analysis operations.
Unstructured Data
Data Availability for Organizations
The majority of technology industry experts estimate that 80 percent of the data in the world is unstructured right now. This gives an approximate estimation that companies produce and have access to unstructured data 4 times that they have to structured data. It will be intriguing for any data enthusiast to understand how such huge volumes of data is created. Let us examine few sources and types of data generated in an organization:
Computer- or machine-generated: The data that is generated by a machine or computer without any human interaction. This type of structured data can include i. sensor data like RFIDs, GPS logs, ii. web log data including server, network and connection logs, iii. financial data such as stock-trading data.
Human generated: The data generated by human interactions with computers. This data includes i. Input data like demographic information and/or surveys, ii. Click stream data generated by tracking user clicks on websites.
Contrary to this, by definition and nature of unstructured data, it is difficult to categorize the sources of it. However, most of the sources and types of the unstructured data will be in form of word doc’s, PDFs and other text files, audio files, presentations, images and videos.
Here is graph showing the variety of data and it’s volume over the last 5 years and predicted to the next 5 years.
Classic reporting and data analysis: Data Warehousing
A data warehouse is a federated repository for all the data that an enterprise's various business systems collect. The repository may be physical or logical. Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access.
Traditionally, the organizations were successful in applying the Data Warehouse and OLAP technologies to build decision support systems for organizing and analyzing the huge amounts of structured data that companies store in their databases. However, with the evolving role of “big data” or unstructured data in organizations analytics, it is essential for Enterprise Data Warehouse (EDW) applications to extend the services to work with this type of data. Interestingly, all analytics tools convert the unstructured and semi-structured data to some what structured data before analyzing and reporting on it.
Limitations of Data Warehousing
Here is a quick look into limitations of data warehousing in analyzing different types of data -
Structured Data
- Required data not captured - The source systems and/or OLTP systems might not capture all the important data that is required for a warehouse to generate useful reports.
- Data homogenization - The basic principle of data warehouse requires data in similar data format from all the sources. Sometimes, this will result in leaving out some important data points which are not conforming with other data types.
Processing unstructured data in Data Warehouse will have more limitations along with the above list -
Unstructured Data
- The Data in textual format, images and other unstructured formats is hard to be read and analyzed by traditional Data Warehousing applications.
- There is lot more data to clean, filter, normalize and transform and load into Enterprise Data Warehouse.
- The algorithms for textual analysis will be more complex than those of numerical analysis.
Future of Data Warehousing
Since 80% of the current data in the world is unstructured and most of it is constituted by text, Data warehousing applications should have a means of processing this text data to provide analysis and reports. Integrating unstructured text data with data warehouse would involve multiple steps including building an unstructured database, load data into database, create relational data structure and create probabilistic foundation for matching unstructured and structured data.
The DW/BI industry realized the need for this data warehouse technology which experts call as Data Warehouse 2.0.
Here is a pictorial representation of DW 2.0:
And the industry experts predict the following will be future of Data Warehousing:
- Operational data warehouses: Real time analytics
- Processing data and analytics in cloud will be a requirement soon
- Big data projects will start with data warehouse optimization
Conclusion
Though the evolution of unstructured data resulted in a complex process for data warehousing, the process still has it’s potential to expand it's services and adopt to the latest technologies.
References:
Integrating Unstructured Data In The Data Warehouse : Unlocking Business Value by Krish Krishnan