Data Science has become an exploding field in recent years, and depending on whether you are focusing on machine learning, artificial intelligence, or citizen data science, the discipline of data science is creating very high expectations.
There is indeed much promise for data science, where predictive models and decision engines can target skin cancer in patient imagery, presciently recommend a new product that piques your interest, or power your self-driving car to evade a potential accident.
However, promise requires much effort for it to be realized. It takes a lot of work and brand new engineering disciplines that are not yet mature or even employed on a wide scale. As there is greater recognition of the value of data science, and the generation of data is increasing at exponential rates, this engineering effort is starting and will grow beyond its adolescence soon.
This is why we are at the advent of a new engineering discipline that can truly realize the promise of data science – a discipline that I call “analytics engineering”.
Consider the parallels.
Computer science became an established academic discipline in the 1950s and 1960s, and serves as a foundation for today’s technology industry, articulating the theory and application of algorithms, computation, logic, and information representation in building real computing devices. The applications of computer science have included code breaking in World War II, the creation of ENIAC and the IBM mainframe.
However, something happened in the 1970s to democratize the development of software – the announcement of the Altair. At the time, Intel sold a microprocessor for $10,000, but the Altair was only $400. At this price, the microcomputer became accessible to individuals – geeks who wanted to build their own computer. Club started meeting in Silicon Valley, such as the Homebrew Computer Club and the Altair Users Group, to show what could be done with these computers and how they could be programmed.
Hackers took hold of an industry and an explosion of innovation ensued. Steve Jobs and Steve Wozniak had formed Apple, Bill Gates and Paul Allen had launched Microsoft, and the personal computer was born.
Eventually, as industries were created and matured, a strong engineering discipline came to computer programming – a discipline we now refer to as software engineering.
Data Science includes the analysis of data and applying solid approaches to gain meaning or insight from the data. There are several fundamentals to the field of data science, which I’ve elaborated on before.
That said, the maturity of data science as a discipline is following a similar trajectory as that of software. Top universities such as Columbia, Cornell, and University of California-Berkeley now offer programs and degrees in Data Science, establishing the academic discipline.
With prototyping languages such as R and Python, which are downloadable free, anyone can literally start programming, working with data, and applying data science principles. The barrier to entry for becoming a data scientist is now nearly zero.
However, just because someone can do something doesn’t mean they can do it well. Becoming a true practitioner is important, and learning the disciplines of a craft through experience and hard work is a must. Additionally, firms that leverage data science capabilities cannot afford to deploy an 24/7 operational capability based on a model developed for free on someone’s laptop. More engineering specialty is required, which is where the industry is heading, just like software and other engineering disciplines.
Throughout the history of innovation, this maturity curve has followed a common path, being part of a great surge in capability and creativity, supported by solid engineering practice. With mechanical engineering, the Industrial Revolution was launched. Electrical engineering led to advancements such as electricity, radio, and television. Of course, software engineering has dawned the age of computing and the internet. What will analytics engineering bring? Possibly what’s needed to support the age of artificial intelligence.
Analytics Engineering encompasses a key set of specialties that are not yet in common practice. There is a promise for data science that all one needs to do is “just push a button and the models get developed”. Others say that there are many different models we could try, so if we try a thousand different models on the data, we’ll evaluate all of them and “pick the best one”.
These approaches are symptoms of technology hype, so we should take them with a grain of salt.
For example, even after many years of developing computer applications, software even today isn’t written with just a “push of a button”. Engineering practices are still needed (and need to be followed) for quality software to be shipped. Sure – hackers can prototype something quickly and demonstrate truly innovative capabilities. However, for this to scale, be reliable, and ultimately operational without frequent failures, engineering disciplines need to be employed.
In this new age, true analytics engineering disciplines are what is needed, tailored to the needs of analytics and decision modeling.
Data Science isn’t magic, and never will be. Yet, more focused analytics engineering disciplines can be developed to become part of decision model development and improvement moving forward. The promise of data science, machine learning, and artificial intelligence will depend on this trend, which makes this an exciting time for the industry.
Why is this important for data science? Imagine a ROC curve where the false positive rate is very low, say 1% at an acceptably high true positive rate.
Are we satisfied? Consider the case where a decision model, say to identify high risk customers in a financial institution, is run on a database of 1 million customers. A false positive rate of 1% would still yield a database of 10,000 customers that would need to be reviewed, and purely in error, since these are falsely flagged as high risk customers. When you are working at scale, with millions if not billions of tests being run through our decision models, the performance of these models needs to demonstrate incredibly low false positive rates to be worth using.
As Analytics Engineering matures, here are some of the developments that we can expect:
– New metrics will be developed to compare model performance in more accurate ways, superseding effective yet crude metrics such as Area Under the Curve (or AUC).
– New analysis techniques will be leveraged to focus on insights gained from the tails of statistical distributions, which are the true drivers of false positive rates in decision models.
– Tools and technologies will be created and matured to manage models, control versions, and track audit changes in model development and deployment.
– Standards, similar to CMMI or Agile in the software engineering world, will be developed and gain traction to provide for more explicit best practices around the creation, management, and engineering of decision models.
Companies such as Netfilx, Tesla, Apple, Amazon, Google, Facebook, and others are already developing these disciplines in-house, as the success of their respective business models demand this advancement. However, other businesses will need to leverage these capabilities soon to keep pace.
It’s an exciting time to recognize and help define what this new engineering discipline will become. For data science, it’s currently like the Wild West of old – wide expanses, plenty of room to “stake your claim”, and a rush to “get in on” the field that is hot. That said, we aren’t all cowboys and the West is being now tamed.
Welcome to the advent of Analytics Engineering.