The Advent of Analytics Engineering

Data Science has become an exploding field in recent years, and depending on whether you are focusing on machine learning, artificial intelligence, or citizen data science, the discipline of data science is creating very high expectations.

There is indeed much promise for data science, where predictive models and decision engines can target skin cancer in patient imagery, presciently recommend a new product that piques your interest, or power your self-driving car to evade a potential accident.

However, promise requires much effort for it to be realized. It takes a lot of work and brand new engineering disciplines that are not yet mature or even employed on a wide scale. As there is greater recognition of the value of data science, and the generation of data is increasing at exponential rates, this engineering effort is starting and will grow beyond its adolescence soon.

This is why we are at the advent of a new engineering discipline that can truly realize the promise of data science – a discipline that I call “analytics engineering”.

Consider the parallels.

Computer science became an established academic discipline in the 1950s and 1960s, and serves as a foundation for today’s technology industry, articulating the theory and application of algorithms, computation, logic, and information representation in building real computing devices. The applications of computer science have included code breaking in World War II, the creation of ENIAC and the IBM mainframe.

However, something happened in the 1970s to democratize the development of software – the announcement of the Altair. At the time, Intel sold a microprocessor for $10,000, but the Altair was only $400. At this price, the microcomputer became accessible to individuals – geeks who wanted to build their own computer. Club started meeting in Silicon Valley, such as the Homebrew Computer Club and the Altair Users Group, to show what could be done with these computers and how they could be programmed.

Hackers took hold of an industry and an explosion of innovation ensued. Steve Jobs and Steve Wozniak had formed Apple, Bill Gates and Paul Allen had launched Microsoft, and the personal computer was born.

Eventually, as industries were created and matured, a strong engineering discipline came to computer programming – a discipline we now refer to as software engineering.

Data Science includes the analysis of data and applying solid approaches to gain meaning or insight from the data. There are several fundamentals to the field of data science, which I’ve elaborated on before.

That said, the maturity of data science as a discipline is following a similar trajectory as that of software. Top universities such as Columbia, Cornell, and University of California-Berkeley now offer programs and degrees in Data Science, establishing the academic discipline.

With prototyping languages such as R and Python, which are downloadable free, anyone can literally start programming, working with data, and applying data science principles. The barrier to entry for becoming a data scientist is now nearly zero.

However, just because someone can do something doesn’t mean they can do it well. Becoming a true practitioner is important, and learning the disciplines of a craft through experience and hard work is a must. Additionally, firms that leverage data science capabilities cannot afford to deploy an 24/7 operational capability based on a model developed for free on someone’s laptop. More engineering specialty is required, which is where the industry is heading, just like software and other engineering disciplines.

Throughout the history of innovation, this maturity curve has followed a common path, being part of a great surge in capability and creativity, supported by solid engineering practice. With mechanical engineering, the Industrial Revolution was launched. Electrical engineering led to advancements such as electricity, radio, and television. Of course, software engineering has dawned the age of computing and the internet. What will analytics engineering bring? Possibly what’s needed to support the age of artificial intelligence.

Analytics Engineering encompasses a key set of specialties that are not yet in common practice. There is a promise for data science that all one needs to do is “just push a button and the models get developed”. Others say that there are many different models we could try, so if we try a thousand different models on the data, we’ll evaluate all of them and “pick the best one”.

These approaches are symptoms of technology hype, so we should take them with a grain of salt.

For example, even after many years of developing computer applications, software even today isn’t written with just a “push of a button”. Engineering practices are still needed (and need to be followed) for quality software to be shipped. Sure – hackers can prototype something quickly and demonstrate truly innovative capabilities. However, for this to scale, be reliable, and ultimately operational without frequent failures, engineering disciplines need to be employed.

In this new age, true analytics engineering disciplines are what is needed, tailored to the needs of analytics and decision modeling.

Data Science isn’t magic, and never will be.  Yet, more focused analytics engineering disciplines can be developed to become part of decision model development and improvement moving forward.  The promise of data science, machine learning, and artificial intelligence will depend on this trend, which makes this an exciting time for the industry.

Why is this important for data science? Imagine a ROC curve where the false positive rate is very low, say 1% at an acceptably high true positive rate.

Are we satisfied?  Consider the case where a decision model, say to identify high risk customers in a financial institution, is run on a database of 1 million customers.  A false positive rate of 1% would still yield a database of 10,000 customers that would need to be reviewed, and purely in error, since these are falsely flagged as high risk customers.  When you are working at scale, with millions if not billions of tests being run through our decision models, the performance of these models needs to demonstrate incredibly low false positive rates to be worth using.

As Analytics Engineering matures, here are some of the developments that we can expect:

– New metrics will be developed to compare model performance in more accurate ways, superseding effective yet crude metrics such as Area Under the Curve (or AUC).

– New analysis techniques will be leveraged to focus on insights gained from the tails of statistical distributions, which are the true drivers of false positive rates in decision models.

– Tools and technologies will be created and matured to manage models, control versions, and track audit changes in model development and deployment.

– Standards, similar to CMMI or Agile in the software engineering world, will be developed and gain traction to provide for more explicit best practices around the creation, management, and engineering of decision models.

Companies such as Netfilx, Tesla, Apple, Amazon, Google, Facebook, and others are already developing these disciplines in-house, as the success of their respective business models demand this advancement.  However, other businesses will need to leverage these capabilities soon to keep pace.

It’s an exciting time to recognize and help define what this new engineering discipline will become.  For data science, it’s currently like the Wild West of old – wide expanses, plenty of room to “stake your claim”, and a rush to “get in on” the field that is hot.  That said, we aren’t all cowboys and the West is being now tamed.

Welcome to the advent of Analytics Engineering.

How Wolfram|Alpha Can Help You Discover Your Own Social Network

Ever wonder what your own personal network looks like?  You are likely connected to many different groups (family, friends, community, work), but do you know how they are connected?  Or are they connected at all?  Are you the glue that connects these various groups?

Word Cloud

This is a great age we’re living in, and I’m glad to be involved with developing lots of really advanced technologies.  One of the technology areas that I’m really fascinated with has been pushed forward by Stephen Wolfram.  He created the industry standard computing environment Mathematica, which now serves as the engine behind his company’s newest creation, Wolfram|Alpha.  (I’ve written a few posts on Wolfram|Alpha in the past, and you can read them here and here).

How to Make Your Own Custom URL Shortener

This is a technical post about what I’ve discovered in creating my own custom URL shortener.  Hopefully, you can learn to do the same things I did, and my experience will save you some headaches if it’s something you’re interesting in trying.

Short-and-Tall

On my website, I focus a lot about decisions and discovery.  I love finding out how the world works and then applying what I’ve learned to make better decisions, and I also try to share what I can along the way.  I hope that it helps others.

Introducing Wolfram|Alpha Pro

Stephen Wolfram is doing it again.  I’m a big fan of Wolfram (you can read some of my other posts here, here, and here…), and am always intrigued by what he comes up with.  A couple of days ago, Wolfram launched his latest contribution to data science and computational understanding – Wolfram|Alpha Pro

Here’s an overview of what the new Pro version of Wolfram|Alpha can provide:

With Wolfram|Alpha Pro, you can compute with your own data. Just input numeric or tabular data right in your browser, and Pro will automatically analyze it—effortlessly handling not just pure numbers, but also dates, places, strings, and more.

Upload 60+ types of data, sound, text, and other files to Wolfram|Alpha Pro for automatic analysis and computation. CSV, XLS, TXT, WAV, 3DS, HDF, GXL, XML…

Zoom in to see the details of any output—rendering it at a larger size and higher resolution.

Perform longer computations as a Wolfram|Alpha Pro subscriber by requesting extra time on the Wolfram|Alpha compute servers when you need it.

Licenses of prototying and analysis software go for several thousand dollars (Matlab, IDL, even Mathematica) – student versions can be had for a few hundred dollars, but you can’t leverage data science for business purposes on student licenses.

Wolfram|Alpha Pro lets anyone with a computer, an internet connection, and a small budget to leverage the power of data science.  Right now, you can get a free trial subscription, and from there, the costs are $4.99/month.  This price is introductory, but it could be sedutive enough to attract a lot of users (I’ve already signed up – all you need for the free trial is an e-mail address…)

One option that I find really interesting is Wolfram’s creation of the Computable Document Format (CDF), which interactivity lets you get dynamic versions of existing Wolfram|Alpha output as well as access to new content using interactive controls, 3D rotation, and animation.  It’s like having Wolfram|Alpha is embedded in the document.

I had attended a Wolfram Science Conference back in 2006 and saw the potential for such a document format back then.  There were a number of presenters who later wrote up their work into a paper, published by the journal Complex Systems.  Since many of the presentations utilized a real interactivity with the data, I could see where much of the insight would be lost when people tried to write things down and limit their visualizations to simple, static graphs and figures.

I remember contacting Jean Buck at Wolfram Research, and recommending such a format.  Who knows whether that had any impact, but I’m certainly glad to see that this is finally becoming a reality.  I actually got the opportunity to meet Wolfram at the conference (he even signed a copy of his Cellular Automata and Complexity for me… – Jean was kind enough to arrange that for me – thanks, Jean!)

If you’re interested in data science and have a spare $5 this month, try out Wolfram|Alpha Pro!

Gartner Magic Quadrant Report on Big Data Integration Tools

Based upon their Magic Quadrant analysis of data integration tools, Gartner rates Informatica Corp. and IBM as the top software vendors in the space.

Gartner uses a Magic Quadrant to rate companies as leaders, challengers, niche players and visionaries based on several criteria including “completeness of vision” and “ability to execute.”  From Gartner’s website:

  • Leaders execute well against their current vision and are well positioned for tomorrow.
  • Visionaries understand where the market is going or have a vision for changing market rules, but do not yet execute well.
  • Niche Players focus successfully on a small segment, or are unfocused and do not out-innovate or outperform others.
  • Challengers execute well today or may dominate a large segment, but do not demonstrate an understanding of market direction.

A post by Mark Brunelli, Senior News Editor, at SeniorDataManagement has a more detailed analysis of the Gartner report.  Here’s what Brunelli wrote, detailing some of the thoughts of Ted Friedman, a Gartner vice president and information management analyst and co-author of the report:

“You’re hearing a lot about big data and analytics around big data,” Friedman said. “To do that kind of stuff you’ve got to collect the data that you want to analyze and put it somewhere. [That] in effect is a job for data integration tools.”

It does seems that the main focus right now in this space is on data handling and data management.  A lot of work is being done by companies to create data visualization tools to gain insight from the data, but as the problems get much harder, better analytics approaches will need to be brought to bear.  The real key over the next few years will be on the smart analysis of all this data, turning the data into reliable actionable information.

WSJ: The King of Big Data

On the Wall Street Journal website today, Ben Rooney posts an interview with Hortonworks CEO Eric Baldeschwieler, co-creator of Hadoop.  For all those in the big data space, the Hadoop project develops open-source software for reliable, scalable, distributed computing, and Hortonworks is focused on accelerating the development and adoption of Hadoop.

In Rooney’s interview, Baldeschwieler describes the problem Hadoop is designed to solve:

At its base, it is just a way to take bulk data and storage in a way that is cheap and replicated and can pull up data very, very fast.

Hadoop is at one level much simpler than other databases. It has two primary components; a storage layer that lets you combine the local disks of a whole bunch of commodity computers, cheap computers. It lets you combine that into a shared file system, a way to store data without worrying which computer it is on. What that means is you can use cheap computers. That lets you strip a lot of cost out of the hardware layer.

The thing that people don’t appreciate when you drop a lower price point is that it is not about saving money, it is about being able to do an order of magnitude more on the same budget. That is revolutionary. You can score five to 10 times more data and you can process it in ways that you can’t imagine. A lot of the innovation it opens up is just the speed of innovation. You get to an answer faster, you move into production faster, you make revenue faster.

You can read Rooney’s WSJ interview here

Popular Science: Wolfram on Big Data

I’m a big fan on Stephen Wolfram and his efforts in building Mathematica and pushing forward his approach to scientific discovery, A New Kind of Science.  In a recent post, Popular Science editor Mark Jannot talks to Wolfram about big data, human understanding, and the origin of the universe.

Here’s just on back-and-forth between Jannot and Wolfram:

Jannot:  A couple years ago at TED, Tim Berners-Lee led the audience in a chant of “More raw data now!” so he’s out there trying to create the data Web. And your project in Wolfram Alpha is not to deliver raw data but to deliver the interface that allows for raw data to turn into meaning.

Wolfram:  Yes, what we’re trying to do—as far as I’m concerned, the thing that’s great about all this data is that it’s possible to answer lots of questions about the world and to make predictions about things that might happen in the world, and so on. But possible how? Possible for an expert to go in and say, “Well, gosh, we have this data, we know what the weather was on such-and-such a date, we know what the economic conditions at such-and-such place were, so now we can go and figure something out from that.” But the thing that I’m interested in is, Can one just walk up to a computer and basically be able to say, “OK, answer me this question.”

This part of the Q&A is particularly interesting, since it highlights a difference of approach in what some want in technology.  Berners-Lee seems to want more “raw data”, while Wolfram is highlighting that the data isn’t really important unless you can turn the data into actionable information.  Wolfram|Alpha does just this – the technology uses Wolfram’s understanding of computation (what he built as part of his wildly successful Mathematica product line) and lets us answer questions. 

It’s an incredibly rich article – one worth reading (1) if you’re interested in data and where its taking us, and (2) if you’re interested in Wolfram and his take on science and technology.  I’m interested in both, so I think it’s very worth highlighting…

Here’s the Popular Science article, and another post to Wolfram|Alpha that highlights the history of computable knowledge (you can even order a poster of the timeline here…).   I’ve had a number of other posts on Wolfram and his scientific approach, which might worth looking into as well…

Upcoming Data Science Conferences

Looking ahead, there are a number of interesting upcoming conferences within the “big data” and “data science” fields, so I thought I’d list a few in this post.

First, Strata is probably one of the biggest and fastest growing conferences within the field, and they’ll be holding their next conference in Santa Clara, CA, starting on February 28, 2012.  Featured speakers include danah boyd from Microsoft Research, Hal Varian, Chief Economist for Google, and Pete Warden, CTO of Jetpac.  They hold online conferences as well – here’s more info on the December 7 event from a previous post

Next, Predictive Analytics World hosts a number of conferences around the world and for differing industries.  Their next conference is scheduled for November 30 in London, UK, in conjunction with the Marketing Optimization Summit and Conversion Conference as part of Data Driven Business Week.  PAW held a recent conference in New York this past October and will be hosting another one in San Francisco in March of 2012.

Also, The Big Data Summit team announced today session details for the upcoming technology event, November 8-10, 2011 in Miami, Florida.  The Summit is for C-level executives who are involved in data storage, data management and data analysis to gather and discuss how companies can effectively manage, protect and leverage the growing amounts of data in the enterprise.

Nerd Pride Friday: Triumph of the Nerds

This Friday, I thought I’d highlight one of my favorite and nerdiest documentaries of the technology industry.  Given the presence of the late Steve Jobs in the media lately – his loss to cancer, the new Walter Issaccon biography, and Apple’s re-creation of personal computing – I thought it would be nice to reflect upon Jobs’ first major impact into our technological lives. 

15 years ago, PBS created a documentary called Triumph of the Nerds:  An Irreverent History of the PC Industry.  It was hosted by Bob Cringely, who at the time wrote a column for InfoWorld about the goings-on of Silicon Valley and now writes a weekely column, I, Cringely.  In it, Cringely humorously and effectively describes how Steve Jobs and Bill Gates created the PC industry as we know it. 

I was always impressed with this documentary and the impact the subjects (especially Jobs) had on our lives through their business and technology pursuits.  I personally think it’s amazing and worth your time watching.  They made another documentary about the history of the Internet:  Nerds 2.0.1 – this second documentary was done in 1998 and the Internet as we know it was probably only three years old or so…

NYT: The Future of Computing

Here’s a nice post from the New York Times about big data, speed, and the future of computing.  It talks a little bit about the technology that makes IBM’s Watson computer so fantastic at beating Jeopardy! champions (we first wrote about this last year…), and that the need for speed will likely change the computer architectures themselves.

There will likely be groundbreaking changes in hardware and software, where computation and decision-making will both become part of the same technology.  This could be where much of the analytics engineering advances come from over the next decade.  Read more from the NYT post here