The Advent of Analytics Engineering

Data Science has become an exploding field in recent years, and depending on whether you are focusing on machine learning, artificial intelligence, or citizen data science, the discipline of data science is creating very high expectations.

There is indeed much promise for data science, where predictive models and decision engines can target skin cancer in patient imagery, presciently recommend a new product that piques your interest, or power your self-driving car to evade a potential accident.

However, promise requires much effort for it to be realized. It takes a lot of work and brand new engineering disciplines that are not yet mature or even employed on a wide scale. As there is greater recognition of the value of data science, and the generation of data is increasing at exponential rates, this engineering effort is starting and will grow beyond its adolescence soon.

This is why we are at the advent of a new engineering discipline that can truly realize the promise of data science – a discipline that I call “analytics engineering”.

Consider the parallels.

Computer science became an established academic discipline in the 1950s and 1960s, and serves as a foundation for today’s technology industry, articulating the theory and application of algorithms, computation, logic, and information representation in building real computing devices. The applications of computer science have included code breaking in World War II, the creation of ENIAC and the IBM mainframe.

However, something happened in the 1970s to democratize the development of software – the announcement of the Altair. At the time, Intel sold a microprocessor for $10,000, but the Altair was only $400. At this price, the microcomputer became accessible to individuals – geeks who wanted to build their own computer. Club started meeting in Silicon Valley, such as the Homebrew Computer Club and the Altair Users Group, to show what could be done with these computers and how they could be programmed.

Hackers took hold of an industry and an explosion of innovation ensued. Steve Jobs and Steve Wozniak had formed Apple, Bill Gates and Paul Allen had launched Microsoft, and the personal computer was born.

Eventually, as industries were created and matured, a strong engineering discipline came to computer programming – a discipline we now refer to as software engineering.

Data Science includes the analysis of data and applying solid approaches to gain meaning or insight from the data. There are several fundamentals to the field of data science, which I’ve elaborated on before.

That said, the maturity of data science as a discipline is following a similar trajectory as that of software. Top universities such as Columbia, Cornell, and University of California-Berkeley now offer programs and degrees in Data Science, establishing the academic discipline.

With prototyping languages such as R and Python, which are downloadable free, anyone can literally start programming, working with data, and applying data science principles. The barrier to entry for becoming a data scientist is now nearly zero.

However, just because someone can do something doesn’t mean they can do it well. Becoming a true practitioner is important, and learning the disciplines of a craft through experience and hard work is a must. Additionally, firms that leverage data science capabilities cannot afford to deploy an 24/7 operational capability based on a model developed for free on someone’s laptop. More engineering specialty is required, which is where the industry is heading, just like software and other engineering disciplines.

Throughout the history of innovation, this maturity curve has followed a common path, being part of a great surge in capability and creativity, supported by solid engineering practice. With mechanical engineering, the Industrial Revolution was launched. Electrical engineering led to advancements such as electricity, radio, and television. Of course, software engineering has dawned the age of computing and the internet. What will analytics engineering bring? Possibly what’s needed to support the age of artificial intelligence.

Analytics Engineering encompasses a key set of specialties that are not yet in common practice. There is a promise for data science that all one needs to do is “just push a button and the models get developed”. Others say that there are many different models we could try, so if we try a thousand different models on the data, we’ll evaluate all of them and “pick the best one”.

These approaches are symptoms of technology hype, so we should take them with a grain of salt.

For example, even after many years of developing computer applications, software even today isn’t written with just a “push of a button”. Engineering practices are still needed (and need to be followed) for quality software to be shipped. Sure – hackers can prototype something quickly and demonstrate truly innovative capabilities. However, for this to scale, be reliable, and ultimately operational without frequent failures, engineering disciplines need to be employed.

In this new age, true analytics engineering disciplines are what is needed, tailored to the needs of analytics and decision modeling.

Data Science isn’t magic, and never will be.  Yet, more focused analytics engineering disciplines can be developed to become part of decision model development and improvement moving forward.  The promise of data science, machine learning, and artificial intelligence will depend on this trend, which makes this an exciting time for the industry.

Why is this important for data science? Imagine a ROC curve where the false positive rate is very low, say 1% at an acceptably high true positive rate.

Are we satisfied?  Consider the case where a decision model, say to identify high risk customers in a financial institution, is run on a database of 1 million customers.  A false positive rate of 1% would still yield a database of 10,000 customers that would need to be reviewed, and purely in error, since these are falsely flagged as high risk customers.  When you are working at scale, with millions if not billions of tests being run through our decision models, the performance of these models needs to demonstrate incredibly low false positive rates to be worth using.

As Analytics Engineering matures, here are some of the developments that we can expect:

– New metrics will be developed to compare model performance in more accurate ways, superseding effective yet crude metrics such as Area Under the Curve (or AUC).

– New analysis techniques will be leveraged to focus on insights gained from the tails of statistical distributions, which are the true drivers of false positive rates in decision models.

– Tools and technologies will be created and matured to manage models, control versions, and track audit changes in model development and deployment.

– Standards, similar to CMMI or Agile in the software engineering world, will be developed and gain traction to provide for more explicit best practices around the creation, management, and engineering of decision models.

Companies such as Netfilx, Tesla, Apple, Amazon, Google, Facebook, and others are already developing these disciplines in-house, as the success of their respective business models demand this advancement.  However, other businesses will need to leverage these capabilities soon to keep pace.

It’s an exciting time to recognize and help define what this new engineering discipline will become.  For data science, it’s currently like the Wild West of old – wide expanses, plenty of room to “stake your claim”, and a rush to “get in on” the field that is hot.  That said, we aren’t all cowboys and the West is being now tamed.

Welcome to the advent of Analytics Engineering.

10 Things To Know When Hiring Data Scientists

I’ve been performing data science before there was a field called “data science“, so I’ve had the opportunity to work with and hire a lot of great people.  But if you’re trying to hire a data scientist, how do you know what to look for, and what should you consider in the interview process?

Data Science Word Cloud

I’ve been doing what is now called “data science” since the early 1990s and have helped to hire numerous scientists and engineers over the years.  The teams I’ve had the opportunity to work with are some of the best in the world, tackling some of the most challenging problems facing our country.  These folks are also some of the smartest people I’ve ever had the opportunity to work with.

That said, not everyone is a good fit, and the discipline of data science requires important key elements.  Hiring someone into your team is incredibly important to your business, especially if you’re a small startup or building a critical internal data science team; mistakes can be expensive in both time and money.  This can be even more intimidating if you don’t have the background or experience in hiring scientists, especially someone responsible for this new discipline of working with data.

The Best Way To Learn New Things

Science and business seem like two very different disciplines, but is the best approach to learning any different in these two fields?  These areas of life seem so unique, and the people in them can be quite varying (one with the nerdy pocket protector and the other dressed in the well-tailored suit).  However, both science and business require learning, and the best approach to learning in each is really the same.

Businessman-Nerd
The best approach to learning is generally through failure.  For example, Thomas Edison failed an astounding number of times before he invented a working lightbulb, and there are likely thousand of stories about how successes came as a result of many tries and many failures.

In many ways, this is really an application of the scientific method.  I’ve written a number of posts about Stephen Wolfram (such as using Wolfram|Alpha to look at your own social network, his views on big data, computing a theory of everything, and how he created his company).  In the effort to learn even more about how the world works, Wolfram has pushed scientific discovery to the next level, which he’s done with his book A New Kind of Science (NKS for short).

Beating Cancer and Favoritism with Data

Fortune-Startup RisingI read a couple of items in this month’s Fortune magazine that I thought it was worth passing along.

The first was a small article by Brian Dumaine about the work being done at Applied Proteomics to identify cancer before it develops.  At Applied Proteomics, they use mass spectroscopy to capture and catalog 360,000 different pieces of protein found in blood plasma, and then let supercomputers crunch on the data to identify anomalies associated with cancer.  The company has raised $57 million in venture capital and is backed by Microsoft co-founder Paul Allen.  You can read the first bit of the article here.

The second is from the Word Check callout, showing how access to information is making the word a better place:

wasa: Pronounced [wah-SUH]

(noun) Arabic slang:  A display of partiality toward a favored person or group without regard for their qualifications.  A system that drives much of life in the Middle East — from getting into a good school to landing a good job.

But on the Internet, there is no wasa.

– Adapted from Startup Rising: The Entrepreneurial Revolution Remaking the Middle East by Christopher M. Schroeder

8 Lessons from Nerd Culture

SupermanI found this set of business wisdoms in the August 2013 issue of Entrepreneur magazine.  While not perfect mantras by which to guide a business, I thought there were pretty fun.

=================

Chris Hardwick didn’t rely on just his nerdy instincts in founding his company; he also took inspiration from his heroes.  Super-power your business with these lessons from some epic nerd properties.

Introducing Wolfram|Alpha Pro

Stephen Wolfram is doing it again.  I’m a big fan of Wolfram (you can read some of my other posts here, here, and here…), and am always intrigued by what he comes up with.  A couple of days ago, Wolfram launched his latest contribution to data science and computational understanding – Wolfram|Alpha Pro

Here’s an overview of what the new Pro version of Wolfram|Alpha can provide:

With Wolfram|Alpha Pro, you can compute with your own data. Just input numeric or tabular data right in your browser, and Pro will automatically analyze it—effortlessly handling not just pure numbers, but also dates, places, strings, and more.

Upload 60+ types of data, sound, text, and other files to Wolfram|Alpha Pro for automatic analysis and computation. CSV, XLS, TXT, WAV, 3DS, HDF, GXL, XML…

Zoom in to see the details of any output—rendering it at a larger size and higher resolution.

Perform longer computations as a Wolfram|Alpha Pro subscriber by requesting extra time on the Wolfram|Alpha compute servers when you need it.

Licenses of prototying and analysis software go for several thousand dollars (Matlab, IDL, even Mathematica) – student versions can be had for a few hundred dollars, but you can’t leverage data science for business purposes on student licenses.

Wolfram|Alpha Pro lets anyone with a computer, an internet connection, and a small budget to leverage the power of data science.  Right now, you can get a free trial subscription, and from there, the costs are $4.99/month.  This price is introductory, but it could be sedutive enough to attract a lot of users (I’ve already signed up – all you need for the free trial is an e-mail address…)

One option that I find really interesting is Wolfram’s creation of the Computable Document Format (CDF), which interactivity lets you get dynamic versions of existing Wolfram|Alpha output as well as access to new content using interactive controls, 3D rotation, and animation.  It’s like having Wolfram|Alpha is embedded in the document.

I had attended a Wolfram Science Conference back in 2006 and saw the potential for such a document format back then.  There were a number of presenters who later wrote up their work into a paper, published by the journal Complex Systems.  Since many of the presentations utilized a real interactivity with the data, I could see where much of the insight would be lost when people tried to write things down and limit their visualizations to simple, static graphs and figures.

I remember contacting Jean Buck at Wolfram Research, and recommending such a format.  Who knows whether that had any impact, but I’m certainly glad to see that this is finally becoming a reality.  I actually got the opportunity to meet Wolfram at the conference (he even signed a copy of his Cellular Automata and Complexity for me… – Jean was kind enough to arrange that for me – thanks, Jean!)

If you’re interested in data science and have a spare $5 this month, try out Wolfram|Alpha Pro!

Data Science Tidbits

Here are some interesting data science nuggets that I thought were interesting for a mid-January day…

The first comes from TechMASH about data science being the next big thing.  The primary nugget of note is that the supply of employees with the needed skills as data scientists – those people who really understand how to pull relevant information out of data reliably – is going to have a tough time meeting demand.  Here’s an interesting infographic on the current disconnects – for example, while 37% of “business intelligence” professional studied business in school, 42% of today’s “data scientists” studied computer science, engineering, and natural sciences.  This highlights the increasing demand for students that have solid mathematics backgrounds – it’s becoming more about knowing how you pull information from data, regardless of application.

Don’t get me wrong – to be effective applying data science, you need two things:  a subject matter expert that understands what makes sense and what doesn’t, and someone who really understands data to pull out the information.  Sometimes that can reside within one person, but it’s rare and takes many years of training to acquire the necessary excellence in both fields.   And as the demands for data analysis grow, these two areas will likely form into distinct disciplines with interesting partnership opportunities being created.

The definition of data science is still being defined, but I’m convinced it will have huge impact in the next five years.  And while the science aspects of data are starting to be defined, the engineering aspects of data and analytics are truly in their infancy…

On the same thread, here’s a Forbes article by Tom Groenfeldt on the need for data scientists, or Excel jockeys, or whatever they will be called in the future.  For some companies, the move to “data science” is quite apparent, but for others, the current assemblance of business professionals that have figured out the ins-and-outs of Excel spreadsheets work quite well.  This is likely a snapshot of where things are today, but I do believe that as the questions we ask of the data get more complicated, we will clearly see the need for a more rigorous science-based discipline to data wrangling…

The last tidbit is from the Wall Street Journal about the healthcare field being the next big area for Big Data.  I do think that healthcare is ripe for leveraging data, and I’ve written other posts on the subject.  One former Chief Medical Officer that I spoke with mentioned that one of the big problems is just getting the data useable in the first place.  He said that, as of today, 85% of all medical records are still in paper form.  The figure seems a bit high to me, but I don’t really know how many patient records in various individual doctor’s offices are still sitting in folders on shelves. 

There has been a big push lately, spurred by financial support from the U.S government, for upgrading to electronic health records (EHR).   This will help to solve the data collection problem – if you can’t get data into an electronic format, you can’t utilize information technologies to pull information out of the data.

Rise of the Algorithm

I ran across this article from the Independent today about the impacts of data algorithms, the ethics of data mining, and the future of our lives in an automated, data-crunching world.  Below is a quote from the article by Jaron Lanier, musician, computer scientist and author of the bestseller You Are Not a Gadget.

Algorithms themselves are a form of creativity. The problem is the illusion that they’re free-standing. If you start to think that information isn’t just a mask behind which people are hiding, if you forget that, you’ll pay a price for that way of thinking. It will cause you to be less creative.

If you show me an algorithm that dehumanises, impoverishes, manipulates or spies upon people,” he continues, “that same core maths can be applied differently. In every case. Take Facebook’s new Timeline feature [a diary-style way of displaying personal information]. It’s an idea that has been proposed since the 1980s [by Lanier himself]. But there are two problems with it. One, it’s owned by Facebook; what happens if Facebook goes bankrupt? Your life disappears – that’s weird. And two, it becomes fodder for advertisers to manipulate you. That’s creepy. But its underlying algorithms, if packaged in a different way, could be wonderful because they address a human cognitive need.

I think this is a really great read for anyone who’s interested in data, algorithms, and their impact on society – there’s a lot of really good stuff to take in.  You can read the entire article here

Forbes: Can Big Data Fix Healthcare?

This is the very question asked by Colin Hill, CEO and co-founder of GNS Healthcare, a healthcare analytics company.  Hill hopes to make the case that healthcare can benefit from what a recent McKinsey report calls “the next frontier for innovation, competition and productivity.”  

I think Hill is onto something, especially with this insight:

What will healthcare look like in the year 2020?  One thing is certain: we can’t afford its current trajectory.  Left unchecked, our $2.6 trillion in annual spending will grow to $4.6 trillion by 2020, one-fifth of GDP.  With almost 80 million Baby Boomers approaching retirement, economists forecast these trends will likely bankrupt Medicare and Medicaid in the near future.  And while healthcare reform ignites a number of important changes, alone it does not resolve our issues.  It’s critical we fix our system now.

Something’s got to give, and better decisions from better data can yield significant healthcare savings if done right.  Saving lives and reducing costs dramatically in healthcare would qualify as one of those hard problems where disciplined approaches can yield significant results.  Here is Hill’s post on Forbes…

Fast Company: Interview with LinkedIn’s Reid Hoffman

I ran across this interview by Fast Company with LinkedIn co-founder about his new book The Start-Up of You and the need for companies to have a data strategy, or risk losing “potentially a lot” in the future.  Here’s that brief bit from the Hoffman interview:

What do companies miss out on if they don’t have a data strategy?

Potentially a lot. If you say the way our products and services are constituted, how we determine our strategy and maintain a competitive edge against other folks–if data is a very strong element of each of these, and you’re not doing anything, it’s like trying to run a business without business intelligence. I’m not sure I have a broad enough view that I would say every company needs to have a data strategy. But I would say many companies do. I certainly think that any company that is over 20 people needs to have a technology strategy, and data is essential to where technology is going.

LinkedIn has already been on record as not worrying about Facebook taking over their business.  According to Hoffman, “People with advanced degrees are three times more likely to use LinkedIn.”

You can read the Fast Company interview here