Patterns swirl all around us. We are immersed in them. Some are easily observable, like the dawn of a new day or a baseball player collecting a base hit while some are much more subtle, only to be found if we search for them. Every occurrence has a pattern. But here is the thing… if we were able to capture enough data about the pattern we want to study, we could use that data to predict its occurrence in the future. Think about that for a moment. The ability to do this is the essence of data science.
While many of principles and methods of data science have been in use for decades, a perfect storm of cheaper, more powerful hardware, an explosion in the volume of data that is being collected by the likes of Google, Facebook, Amazon and the realization that there is huge value in mining that data for profit has led to the explosive growth of this practice. A formal definition of data science would describe it as a collection of principles and processes that are conducted in an effort to identify useful (and not always obvious) patterns in data. But how does this apply to the running of your business in the real world? To better understand that, we need to first take a closer look at how we used data, pre-data science, to drive our decision making.
The foundation of data capture for the vast majority of companies are their online transactional processing (OLTP) systems. Regardless of your industry, these systems will record every transaction your company makes, along with all of that transaction's attributes. For example, for a sales transaction the product(s), quantities, price, location, and any host of other attributes you might find valuable are recorded. Reports can be generated from your OLTP, which can give you a sum of all sales for a given day. Or, if your company's product is process-oriented like the origination of a mortgage loan, your OLTP will track the progress of the loan through the various steps. In this example, your OLTP could provide you with operational reports that call out loans that experiencing delays in the process or might have errors based on rulesets you have defined in your reports.
For many reasons, however, running reports and conducting analysis from your OLTP is not optimal. Transactional systems are very good at capturing complete and accurate data sets for each transaction through referential integrity rules. An example of a referential integrity rule would be that you cannot enter a loan number for a mortgage loan into your system without also entering a borrower name, property address, loan amount, note rate, etc. Additional transaction system rules that you put in place would also in this case prevent you from entering invalid values, like a negative note rate or a loan amount of zero. The implementation of referential integrity rules results in a very complex database structure with numerous tables and not so intuitive ways for your analysts to draw data from them all.
The answer to this limitation was the data warehouse. A data warehouse is a data repository that is designed specifically to support the analysis of data. Typically, your data warehouse would take nightly 'snapshots' of your OLTP systems (and other pertinent data sources) and load them into a much more user-friendly database structure. The data warehouse would have the appearance of having much fewer tables than your OLTP, with table and field names which the business analyst could easily understand. The data warehouse would also resolve formatting issues across the various data sources that are being loaded into them as well as 'cleansing' the data. The process of cleansing includes dealing with things like erroneous values and missing values. The creation of the data warehouse gave the business units (as opposed to only IT) the ability to query and analyze data using their domain knowledge. A well-designed data warehouse is an absolute necessity for any business. It will serve as the foundation of analytic work within your business.
As such, the analytic friendly design of data warehouses supported the implementation of online analytical processing applications (OLAP), which allow end-users to interactively analyze data. With OLAP, they now had the ability to select aggregated fields (i.e.-sales amount) and drill down on them by various attributes (i.e.- sales district, sales branch, salesperson). OLAP is in turn facilitated the development of a multitude of powerful business intelligence tools with increased interactive/analytic capabilities, advanced data visualization (charts and graphs) functionality and dashboards.
A company that has a well-designed data warehouse and fully developed business intelligence functionality has everything in place to analyze things… that have occurred in the past. But, back to our patterns, how do we take the next step and use the data we have accumulated over the years to help us predict future occurrences? We employ data science.
The genesis of today's data science started with web-based companies collecting ever more data about both their customers and their behavior. This data in hand, they began performing predictive behavior analysis on them. The most relatable example of this would be the "Frequently bought together" suggestion you see on Amazon, which is a clustering algorithm. Another familiar example is the "Friend suggestion" on Facebook. The predictive analysis is the component of data science that businesses are most interested in. It however is only one part of the umbrella term, data science.
In addition to the predictive analysis and the algorithms they use, data science also includes all of the logistical and technical challenges of the real-time running of these algorithms on petabytes of data, that are stored across numerous servers (Big Data). Being able to process huge amounts of data quickly, requires applications that can manage the simultaneous processing of data across multiple servers (HADOOP Map Reduce, Spark). Other significant components of data science include, data sourcing and preparation (this critical process typically takes the most amount of time and resources during an implementation), machine learning (ML), data visualization and in some cases artificial intelligence (AI).
Data science, in its pursuit of predictive analysis, also looks at a much larger scope of data. With our OLTP's and data warehouses we are looking at structured data, that is neatly held in tables for us. In data science, we have the tools and methods to also examine unstructured data. Examples of unstructured data would include: emails, word documents, spreadsheets, texts, videos and webpages. Think about the potential advantages of doing this within the context of your business.
Many of you who are reading this article may not need data science in its entirety. But clearly, there are components to it, such as the predictive analysis and the expanding of your data sets to unstructured data, that companies of any size could benefit from. Many of the predictive analysis methods that are used in big data are embedded as functions within the RDMS's that your data warehouses are being housed on (Oracle, SQL Server). Conceivably, your company could fairly easily start running predictive analysis routines straight from your data warehouse.
I hope you have found this summary of data science useful. I would urge you to think creatively as you look at how your company could benefit from data science. Please feel free to email me with questions or set up an introductory session, if you would like to dig deeper. Stay tuned for a future blog post where we do a deeper dive on the tools of predictive analysis.