The Advent of Analytics Engineering

Data Science has become an exploding field in recent years, and depending on whether you are focusing on machine learning, artificial intelligence, or citizen data science, the discipline of data science is creating very high expectations.

There is indeed much promise for data science, where predictive models and decision engines can target skin cancer in patient imagery, presciently recommend a new product that piques your interest, or power your self-driving car to evade a potential accident.

However, promise requires much effort for it to be realized. It takes a lot of work and brand new engineering disciplines that are not yet mature or even employed on a wide scale. As there is greater recognition of the value of data science, and the generation of data is increasing at exponential rates, this engineering effort is starting and will grow beyond its adolescence soon.

This is why we are at the advent of a new engineering discipline that can truly realize the promise of data science – a discipline that I call “analytics engineering”.

Consider the parallels.

Computer science became an established academic discipline in the 1950s and 1960s, and serves as a foundation for today’s technology industry, articulating the theory and application of algorithms, computation, logic, and information representation in building real computing devices. The applications of computer science have included code breaking in World War II, the creation of ENIAC and the IBM mainframe.

However, something happened in the 1970s to democratize the development of software – the announcement of the Altair. At the time, Intel sold a microprocessor for $10,000, but the Altair was only $400. At this price, the microcomputer became accessible to individuals – geeks who wanted to build their own computer. Club started meeting in Silicon Valley, such as the Homebrew Computer Club and the Altair Users Group, to show what could be done with these computers and how they could be programmed.

Hackers took hold of an industry and an explosion of innovation ensued. Steve Jobs and Steve Wozniak had formed Apple, Bill Gates and Paul Allen had launched Microsoft, and the personal computer was born.

Eventually, as industries were created and matured, a strong engineering discipline came to computer programming – a discipline we now refer to as software engineering.

Data Science includes the analysis of data and applying solid approaches to gain meaning or insight from the data. There are several fundamentals to the field of data science, which I’ve elaborated on before.

That said, the maturity of data science as a discipline is following a similar trajectory as that of software. Top universities such as Columbia, Cornell, and University of California-Berkeley now offer programs and degrees in Data Science, establishing the academic discipline.

With prototyping languages such as R and Python, which are downloadable free, anyone can literally start programming, working with data, and applying data science principles. The barrier to entry for becoming a data scientist is now nearly zero.

However, just because someone can do something doesn’t mean they can do it well. Becoming a true practitioner is important, and learning the disciplines of a craft through experience and hard work is a must. Additionally, firms that leverage data science capabilities cannot afford to deploy an 24/7 operational capability based on a model developed for free on someone’s laptop. More engineering specialty is required, which is where the industry is heading, just like software and other engineering disciplines.

Throughout the history of innovation, this maturity curve has followed a common path, being part of a great surge in capability and creativity, supported by solid engineering practice. With mechanical engineering, the Industrial Revolution was launched. Electrical engineering led to advancements such as electricity, radio, and television. Of course, software engineering has dawned the age of computing and the internet. What will analytics engineering bring? Possibly what’s needed to support the age of artificial intelligence.

Analytics Engineering encompasses a key set of specialties that are not yet in common practice. There is a promise for data science that all one needs to do is “just push a button and the models get developed”. Others say that there are many different models we could try, so if we try a thousand different models on the data, we’ll evaluate all of them and “pick the best one”.

These approaches are symptoms of technology hype, so we should take them with a grain of salt.

For example, even after many years of developing computer applications, software even today isn’t written with just a “push of a button”. Engineering practices are still needed (and need to be followed) for quality software to be shipped. Sure – hackers can prototype something quickly and demonstrate truly innovative capabilities. However, for this to scale, be reliable, and ultimately operational without frequent failures, engineering disciplines need to be employed.

In this new age, true analytics engineering disciplines are what is needed, tailored to the needs of analytics and decision modeling.

Data Science isn’t magic, and never will be.  Yet, more focused analytics engineering disciplines can be developed to become part of decision model development and improvement moving forward.  The promise of data science, machine learning, and artificial intelligence will depend on this trend, which makes this an exciting time for the industry.

Why is this important for data science? Imagine a ROC curve where the false positive rate is very low, say 1% at an acceptably high true positive rate.

Are we satisfied?  Consider the case where a decision model, say to identify high risk customers in a financial institution, is run on a database of 1 million customers.  A false positive rate of 1% would still yield a database of 10,000 customers that would need to be reviewed, and purely in error, since these are falsely flagged as high risk customers.  When you are working at scale, with millions if not billions of tests being run through our decision models, the performance of these models needs to demonstrate incredibly low false positive rates to be worth using.

As Analytics Engineering matures, here are some of the developments that we can expect:

– New metrics will be developed to compare model performance in more accurate ways, superseding effective yet crude metrics such as Area Under the Curve (or AUC).

– New analysis techniques will be leveraged to focus on insights gained from the tails of statistical distributions, which are the true drivers of false positive rates in decision models.

– Tools and technologies will be created and matured to manage models, control versions, and track audit changes in model development and deployment.

– Standards, similar to CMMI or Agile in the software engineering world, will be developed and gain traction to provide for more explicit best practices around the creation, management, and engineering of decision models.

Companies such as Netfilx, Tesla, Apple, Amazon, Google, Facebook, and others are already developing these disciplines in-house, as the success of their respective business models demand this advancement.  However, other businesses will need to leverage these capabilities soon to keep pace.

It’s an exciting time to recognize and help define what this new engineering discipline will become.  For data science, it’s currently like the Wild West of old – wide expanses, plenty of room to “stake your claim”, and a rush to “get in on” the field that is hot.  That said, we aren’t all cowboys and the West is being now tamed.

Welcome to the advent of Analytics Engineering.

The Fundamentals of Data Science

 

Two of the biggest buzzwords in our industry are “big data” and “data science”. Big Data seems to have a lot of interest right now, but Data Science is fast becoming a very hot topic.

I think there’s room to really define the science of data science – what are those fundamentals that are needed to make data science truly a science we can build upon?

Below are my thoughts for an outline for such a set of fundamentals:

Fundamentals of Data Science

Introduction

The easiest thing for people within the big data / analytics / data science disciplines is to say “I do data science”. However, when it comes to data science fundamentals, we need to ask the following critical questions: What really is “data”, what are we trying to do with data, and how do we apply scientific principles to achieve our goals with data?

– What is Data?
– The Goal of Data Science
– The Scientific Method

Probability and Statistics

The world is a probabilistic one, so we work with data that is probabilistic – meaning that, given a certain set of preconditions, data will appear to you in a specific way only part of the time.  To apply data science properly, one must become familiar and comfortable with probability and statistics.

– The Two Characteristics of Data
– Examples of Statistical Data
– Introduction to Probability
– Probability Distributions
– Connection with Statistical Distributions
– Statistical Properties (Mean, Mode, Median, Moments, Standard Deviation, etc.)
– Common Probability Distributions (Discrete, Binomial, Normal)
– Other Probability Distributions (Chi-Square, Poisson)
– Joint and Conditional Probabilities
– Bayes’ Rules
– Bayesian Inference

Decision Theory

This section is one of the key fundamentals of data science.  Whether applied in scientific, engineering, or business fields, we are trying to make decisions using data.  Data itself isn’t useful unless it’s telling us something, which we’re making a decision about what it is telling us.  How do we come up with those decisions? What are the factors that go into this decision making process?  What is the best method for making decisions with data?  This section tells us…

– Hypothesis Testing
– Binary Hypothesis Test
– Likelihood Ratio and Log Likelihood Ratio
– Bayes Risk
– Neyman-Pearson Criterion
– Receiver Operating Characteristic (ROC) Curve
– M-ary Hypothesis Test
– Optimal Decision Making

Estimation Theory

Sometimes we make characterizations of data – averages, parameter estimates, etc.  Estimation from data is essentially an extension of decision making, a natural next section from Decision Theory.

– Estimation as Extension of M-ary Hypothesis Test
– Unbiased Estimation
– Minimum Mean Square Error (MMSE)
– Maximum Likelihood Estimation (MLE)
– Maximum A Posteriori Estimation (MAP)
– Kalman Filter

Coordinate Systems

To bring various data elements together into a common decision making framework, we need to know how to align the data.  Knowledge of coordinate systems and how they are used becomes important to lay a solid foundation for bringing disparate data together.

– Introduction to Coordinate Systems
– Euclidian Spaces
– Orthogonal Coordinate Systems
– Properties of Orthogonal Coordinate Systems (angle, dot product, coordinate transformations,
etc.)
– Cartesian Coordinate System
– Polar Coordinate System
– Cylindrical Coordinate System
– Spherical Coordinate System
– Transformations Between Coordinate Systems

Linear Transformations

Once we understand coordinate systems, we can learn why to transform the data to get at the underlying information.  This section describe how we can transform our data into other useful data products through various types of transformations, including the popular Fourier transform.

– Introduction to Linear Transformations
– Properties of Linear Transformations
– Matrix Multiplication
– Fourier Transform
– Properties of Fourier Transforms (time-frequency relationship, shift invariance, spectral
properties, Parseval’s Theorem, Convolution Theorem, etc.)
– Discrete and Continuous Fourier Transforms
– Uncertainty Principle and Aliasing
– Wavelet and Other Transforms

Effects of Computation on Data

An often overlooked aspect of data science is the impact the algorithms we apply have on the information we are seeking to find. Merely applying algorithms and computations to create analytics and other data products has an impact on the effectiveness data-driven decision making ability.  This section take us on a journey of advanced aspects of data science.

– Mathematical Representation of Computation
– Reversible Computations (Bijective Mapping)
– Irreversible Computations
– Impulse Response Functions
– Transformation of Probability Distributions (due to addition, subtraction, multiplication,
division, arbitrary computations, etc.)
– Impacts on Decision Making

Prototype Coding / Programming

One of the key elements to data science is the willingness of practitioners to “get their hands dirty” with data.  This means being able to write programs that access, process, and visualize data in important languages in science and industry. This section takes us on a tour of these important elements.

– Introduction to Programming
– Data Types, Variables, and Functions
– Data Structures (Arrays, etc.)
– Loops, Comparisons, If-Then-Else
– Functions
– Scripting Languages vs. Compilable Langugages
– SQL
– SAS
– R
– Python
– C++

Graph Theory

Graphs are ways to illustrate connections between different data elements, and they are important in today’s interconnected world.

– Introduction to Graph Theory
– Undirected Graphs
– Directed Graphs
– Various Graph Data Structures
– Route and Network Problems

Algorithms

Key to data science is understanding the use of algorithms to compute important data-derived metrics.  Popular data manipulation algorithms are included in this section.

– Introduction to Algorithms
– Recursive Algorithms
– Serial, Parallel, and Distributed Algorithms
– Exhaustive Search
– Divide-and-Conquer (Binary Search)
– Gradient Search
– Sorting Algorithms
– Linear Programming
– Greedy Algorithms
– Heuristic Algorithms
– Randomized Algorithms
– Shortest Path Algorithms for Graphs

Machine Learning

No data science fundamentals course would be complete without exposure to machine learning.  However, it’s important to know that these techniques build upon the fundamentals described in previous sections.  This section gives practitioners an understanding of useful and popular machine learning techniques and why they are applied.

– Introduction to Machine Learning
– Linear Classifiers (Logistic Regression, Naive Bayes Classifier, Support Vector Machines)
– Decision Trees (Random Forests)
– Bayesian Networks
– Hidden Markov Models
– Expectation-Maximization
– Artificial Neural Networks and Deep Learning
– Vector Quantization
– K-Means Clustering

Question:  Do you have any thoughts on the fundamentals of data science? You can leave a comment below.

A Data Science Lesson from Richard Feynman

Richard Feynman

Richard Feynman

Richard Feynman is one of the greatest scientific minds, and what I love about him, aside from his brilliance, is his perspective on why we perform science.   I’ve been reading the compilation of short works of Feynman titled The Pleasure of Finding Things Out, and I recently came across a section that really hit home with me.

In the world of data science, much is made about the algorithms used to work with data, such as random forests or k-mean clustering.  However, I believe there is a missing component – one that deals the fundamentals underlying data science, and that is the real science of data science.

10 Things To Know When Hiring Data Scientists

I’ve been performing data science before there was a field called “data science“, so I’ve had the opportunity to work with and hire a lot of great people.  But if you’re trying to hire a data scientist, how do you know what to look for, and what should you consider in the interview process?

Data Science Word Cloud

I’ve been doing what is now called “data science” since the early 1990s and have helped to hire numerous scientists and engineers over the years.  The teams I’ve had the opportunity to work with are some of the best in the world, tackling some of the most challenging problems facing our country.  These folks are also some of the smartest people I’ve ever had the opportunity to work with.

That said, not everyone is a good fit, and the discipline of data science requires important key elements.  Hiring someone into your team is incredibly important to your business, especially if you’re a small startup or building a critical internal data science team; mistakes can be expensive in both time and money.  This can be even more intimidating if you don’t have the background or experience in hiring scientists, especially someone responsible for this new discipline of working with data.

The Best Way To Learn New Things

Science and business seem like two very different disciplines, but is the best approach to learning any different in these two fields?  These areas of life seem so unique, and the people in them can be quite varying (one with the nerdy pocket protector and the other dressed in the well-tailored suit).  However, both science and business require learning, and the best approach to learning in each is really the same.

Businessman-Nerd
The best approach to learning is generally through failure.  For example, Thomas Edison failed an astounding number of times before he invented a working lightbulb, and there are likely thousand of stories about how successes came as a result of many tries and many failures.

In many ways, this is really an application of the scientific method.  I’ve written a number of posts about Stephen Wolfram (such as using Wolfram|Alpha to look at your own social network, his views on big data, computing a theory of everything, and how he created his company).  In the effort to learn even more about how the world works, Wolfram has pushed scientific discovery to the next level, which he’s done with his book A New Kind of Science (NKS for short).

Beating Cancer and Favoritism with Data

Fortune-Startup RisingI read a couple of items in this month’s Fortune magazine that I thought it was worth passing along.

The first was a small article by Brian Dumaine about the work being done at Applied Proteomics to identify cancer before it develops.  At Applied Proteomics, they use mass spectroscopy to capture and catalog 360,000 different pieces of protein found in blood plasma, and then let supercomputers crunch on the data to identify anomalies associated with cancer.  The company has raised $57 million in venture capital and is backed by Microsoft co-founder Paul Allen.  You can read the first bit of the article here.

The second is from the Word Check callout, showing how access to information is making the word a better place:

wasa: Pronounced [wah-SUH]

(noun) Arabic slang:  A display of partiality toward a favored person or group without regard for their qualifications.  A system that drives much of life in the Middle East — from getting into a good school to landing a good job.

But on the Internet, there is no wasa.

– Adapted from Startup Rising: The Entrepreneurial Revolution Remaking the Middle East by Christopher M. Schroeder

8 Lessons from Nerd Culture

SupermanI found this set of business wisdoms in the August 2013 issue of Entrepreneur magazine.  While not perfect mantras by which to guide a business, I thought there were pretty fun.

=================

Chris Hardwick didn’t rely on just his nerdy instincts in founding his company; he also took inspiration from his heroes.  Super-power your business with these lessons from some epic nerd properties.

How Wolfram|Alpha Can Help You Discover Your Own Social Network

Ever wonder what your own personal network looks like?  You are likely connected to many different groups (family, friends, community, work), but do you know how they are connected?  Or are they connected at all?  Are you the glue that connects these various groups?

Word Cloud

This is a great age we’re living in, and I’m glad to be involved with developing lots of really advanced technologies.  One of the technology areas that I’m really fascinated with has been pushed forward by Stephen Wolfram.  He created the industry standard computing environment Mathematica, which now serves as the engine behind his company’s newest creation, Wolfram|Alpha.  (I’ve written a few posts on Wolfram|Alpha in the past, and you can read them here and here).

How to Make Your Own Custom URL Shortener

This is a technical post about what I’ve discovered in creating my own custom URL shortener.  Hopefully, you can learn to do the same things I did, and my experience will save you some headaches if it’s something you’re interesting in trying.

Short-and-Tall

On my website, I focus a lot about decisions and discovery.  I love finding out how the world works and then applying what I’ve learned to make better decisions, and I also try to share what I can along the way.  I hope that it helps others.

How to Make Better Decisions: Recognize Uncertainty

It’s a complex world, and we are constantly making decisions.  Just imagine the number of decisions we make about breakfast:  How big a breakfast should I have?  Should I have coffee?  If so, how much?  Should I have toast?  Should I use butter?  Should I have one piece or two?  Should I cut the toast?  If so, should they be cut into rectangles or triangles?  Should I keep the crust? Should I have juice?  Should it be apple juice or orange juice?  How about milk?  I haven’t even gotten to the pancakes, waffles, syrup, sausage, cereal, bacon… (mmm, bacon…)

question-mark

And these aren’t the really important ones!  How do we know we’re making good decisions, and can we make better ones?