Two of the biggest buzzwords in our industry are “big data” and “data science”. Big Data seems to have a lot of interest right now, but Data Science is fast becoming a very hot topic.
I think there’s room to really define the science of data science – what are those fundamentals that are needed to make data science truly a science we can build upon?
Below are my thoughts for an outline for such a set of fundamentals:
Fundamentals of Data Science
Introduction
The easiest thing for people within the big data / analytics / data science disciplines is to say “I do data science”. However, when it comes to data science fundamentals, we need to ask the following critical questions: What really is “data”, what are we trying to do with data, and how do we apply scientific principles to achieve our goals with data?
– What is Data?
– The Goal of Data Science
– The Scientific Method
Probability and Statistics
The world is a probabilistic one, so we work with data that is probabilistic – meaning that, given a certain set of preconditions, data will appear to you in a specific way only part of the time. To apply data science properly, one must become familiar and comfortable with probability and statistics.
– The Two Characteristics of Data
– Examples of Statistical Data
– Introduction to Probability
– Probability Distributions
– Connection with Statistical Distributions
– Statistical Properties (Mean, Mode, Median, Moments, Standard Deviation, etc.)
– Common Probability Distributions (Discrete, Binomial, Normal)
– Other Probability Distributions (Chi-Square, Poisson)
– Joint and Conditional Probabilities
– Bayes’ Rules
– Bayesian Inference
Decision Theory
This section is one of the key fundamentals of data science. Whether applied in scientific, engineering, or business fields, we are trying to make decisions using data. Data itself isn’t useful unless it’s telling us something, which we’re making a decision about what it is telling us. How do we come up with those decisions? What are the factors that go into this decision making process? What is the best method for making decisions with data? This section tells us…
– Hypothesis Testing
– Binary Hypothesis Test
– Likelihood Ratio and Log Likelihood Ratio
– Bayes Risk
– Neyman-Pearson Criterion
– Receiver Operating Characteristic (ROC) Curve
– M-ary Hypothesis Test
– Optimal Decision Making
Estimation Theory
Sometimes we make characterizations of data – averages, parameter estimates, etc. Estimation from data is essentially an extension of decision making, a natural next section from Decision Theory.
– Estimation as Extension of M-ary Hypothesis Test
– Unbiased Estimation
– Minimum Mean Square Error (MMSE)
– Maximum Likelihood Estimation (MLE)
– Maximum A Posteriori Estimation (MAP)
– Kalman Filter
Coordinate Systems
To bring various data elements together into a common decision making framework, we need to know how to align the data. Knowledge of coordinate systems and how they are used becomes important to lay a solid foundation for bringing disparate data together.
– Introduction to Coordinate Systems
– Euclidian Spaces
– Orthogonal Coordinate Systems
– Properties of Orthogonal Coordinate Systems (angle, dot product, coordinate transformations,
etc.)
– Cartesian Coordinate System
– Polar Coordinate System
– Cylindrical Coordinate System
– Spherical Coordinate System
– Transformations Between Coordinate Systems
Linear Transformations
Once we understand coordinate systems, we can learn why to transform the data to get at the underlying information. This section describe how we can transform our data into other useful data products through various types of transformations, including the popular Fourier transform.
– Introduction to Linear Transformations
– Properties of Linear Transformations
– Matrix Multiplication
– Fourier Transform
– Properties of Fourier Transforms (time-frequency relationship, shift invariance, spectral
properties, Parseval’s Theorem, Convolution Theorem, etc.)
– Discrete and Continuous Fourier Transforms
– Uncertainty Principle and Aliasing
– Wavelet and Other Transforms
Effects of Computation on Data
An often overlooked aspect of data science is the impact the algorithms we apply have on the information we are seeking to find. Merely applying algorithms and computations to create analytics and other data products has an impact on the effectiveness data-driven decision making ability. This section take us on a journey of advanced aspects of data science.
– Mathematical Representation of Computation
– Reversible Computations (Bijective Mapping)
– Irreversible Computations
– Impulse Response Functions
– Transformation of Probability Distributions (due to addition, subtraction, multiplication,
division, arbitrary computations, etc.)
– Impacts on Decision Making
Prototype Coding / Programming
One of the key elements to data science is the willingness of practitioners to “get their hands dirty” with data. This means being able to write programs that access, process, and visualize data in important languages in science and industry. This section takes us on a tour of these important elements.
– Introduction to Programming
– Data Types, Variables, and Functions
– Data Structures (Arrays, etc.)
– Loops, Comparisons, If-Then-Else
– Functions
– Scripting Languages vs. Compilable Langugages
– SQL
– SAS
– R
– Python
– C++
Graph Theory
Graphs are ways to illustrate connections between different data elements, and they are important in today’s interconnected world.
– Introduction to Graph Theory
– Undirected Graphs
– Directed Graphs
– Various Graph Data Structures
– Route and Network Problems
Algorithms
Key to data science is understanding the use of algorithms to compute important data-derived metrics. Popular data manipulation algorithms are included in this section.
– Introduction to Algorithms
– Recursive Algorithms
– Serial, Parallel, and Distributed Algorithms
– Exhaustive Search
– Divide-and-Conquer (Binary Search)
– Gradient Search
– Sorting Algorithms
– Linear Programming
– Greedy Algorithms
– Heuristic Algorithms
– Randomized Algorithms
– Shortest Path Algorithms for Graphs
Machine Learning
No data science fundamentals course would be complete without exposure to machine learning. However, it’s important to know that these techniques build upon the fundamentals described in previous sections. This section gives practitioners an understanding of useful and popular machine learning techniques and why they are applied.
– Introduction to Machine Learning
– Linear Classifiers (Logistic Regression, Naive Bayes Classifier, Support Vector Machines)
– Decision Trees (Random Forests)
– Bayesian Networks
– Hidden Markov Models
– Expectation-Maximization
– Artificial Neural Networks and Deep Learning
– Vector Quantization
– K-Means Clustering
Richard Feynman is one of the greatest scientific minds, and what I love about him, aside from his brilliance, is his perspective on why we perform science. I’ve been reading the compilation of short works of Feynman titled The Pleasure of Finding Things Out, and I recently came across a section that really hit home with me.
Look for strong math backgrounds. Data science requires a background in mathematics – this is really not negotiable. We made some mistakes in the past finding candidates that had strong software backgrounds, but didn’t really have the necessary math fundamentals. What happened? The software team didn’t have a strong appreciation for the data-crunching algorithms being developed, so there was a divide within our team; it became harder for the scientists and the software engineers to work together to achieve common goals. Knowledge of statistics, linear algebra, calculus, geometry, and trigonometry are baseline requirements. I’ve even heard stories of a company (not ours, thank goodness…) that had a programmer implement algorithms incorrectly, and didn’t really appreciate what was being done. An algorithm that used the sum of squared values was implemented by squaring the sum of the values, because it ran faster (by the way, these two algorithms are not the same!) This is a simple example, but if your implementation team doesn’t have a strong math background, they might not know the difference, causing you real problems down the road.
Seek the willingness to program, not necessarily specific languages. Here’s the heretical statement – don’t pay as much mind to the actual programming languages someone has on their resume. They need to have experience programming and show that they can get their hands dirty with coding. However, data science is about learning and discovering; you need to be flexible. So, look beyond the recitation of R, Python, Pig/Hive, C/C++, Perl, MATLAB, IDL, SQL, SAS, Java, Unix shell, Ruby, Scala… Any good scientist who is willing and able to program can pick up a new language. It’s much harder to get someone who knows the ins-and-outs of a particular language to be a good scientist. Keyword searches of résumés targeting specific languages may exclude some strong candidates, while also including others that are weak yet know what to put on their résumé. Manual review can lead to the same result, since if you’re looking for Hadoop experience, you may hire that and onboard a less-than-stellar data scientist in the process.
Make sure candidates have a probabilistic/statistical view of the world. In data science, nothing is black and white. Data has two driving elements – the behavioral (characterized by what we know – our model for how the data is observed) and the statistical (everything else that we currently don’t understand). The job of the quality data science team is to characterize the behavioral and deal with the statistical. As your team gets to know the data better, they will find even more subtle drivers and nuggets of information, turning what used to be statistical into a more accurate behavioral model (this is the cool part of data science – amazing predictive ability!). Candidates must have an appreciation for a probabilistic view of the world, meaning that when a certain condition occurs, you expect the data to appear a given way with some probability (or only some of the time). A background in statistics is an absolute must in data science, so look for that on the résumé and test for it in your interviewing. With that said…
Look for people who are detail-oriented and want to get to the root cause. Statistics come from the lack of information about what drives the data we observe, which you can get at when you have more data. Sometimes there is a real root cause to what we see, and good data scientists try to figure out why. Technical staff members that aren’t detail-oriented tend to make more mistakes than others who are, leading to inaccurate results and incorrect conclusions. I’ve seen really smart people find some very confusing results in their data analysis and be stumped as to what it was. When we looked into it further, it was merely a bug in their algorithm (not necessarily in their implementation) which led to some subtle errors. A probabilistic view of the world is important, but having a taste for getting to the bottom of things is equally as valuable.
Find people who can communicate effectively. An often overlooked quality for data science candidates is top communication skills. Even if someone is working alone on their data analysis, they have to communicate with someone, whether that is his boss or her colleague; no one works alone. I’ve written several articles about the importance of communicating (such as What We Can Learn From Stephen Hawking, Why Scientists Are Lousy Communicators, and tips on Job Interview Presentations), and it becomes especially important for those in the sciences. Math and science geeks think presenting is merely for marketers and sales people… Not so! If you want others to believe the results of your data science teams, your team has an obligation to communicate effectively.
Include your current scientific staff in interviews. We know that hiring is the job for managers. However, including your current staff in the interview process can yield real benefits. It can ensure that new candidates will be good fits for the organization and can even improve the company. In his 1998 letter to shareholders, Jeff Bezos, CEO of Amazon, detailed three questions that were asked of his hiring teams when evaluating candidates. Here’s what Bezos wrote about these questions:
Bring members of your current team in with the understanding that you’re looking for people who will make their team better, and the help from your current staff will be valuable in assessing talent.
Don’t get so hung up on brainteasers – whether they can or can’t answer them. I know that some companies like to put candidates on the spot and get them to solve brainteasers during their interview. Personally, I find this to be a waste of time and an inaccurate way to tell whether someone will work well as a data scientist on your team. Some people need a little time to work through a problem, but if they have that time, they nail it. Others get to the right answer by trying out many things, learn from their mistakes, and hone in on what works. Brainteasers would make these candidates look like they can’t do the job, so they’d get weeded out. Plus, if someone happened to solve a brainteaser quickly, it may mean that they’ve been exposed to that particular before, which is why they know it so easily (for example, here’s one: For any prime number p > 5, show me why p^{2}-1 is divisible by 24…). You aren’t hiring someone who can solve the problem – you are hiring someone who can find the solution to the problem. They may solve it themselves (which can be especially important when the problem has never been solved before), but if it has been solved, why would you want someone who is predisposed to solving it over again? Instead…
Ask open-ended questions that provoke how people approach problems. There is a great book, Are You Smart Enough To Work at Google?, which details how Google evaluates candidates for their teams. There is even an insightful question they have asked: You are shrunk to the height of a nickel and thrown into a blender. Your mass is reduced so that your density is the same as usual. The blades start moving in 60 seconds. What do you do? (How would you answer this?…) For their interview process, Google posts on their website how they approach it and what they look for. They generally look at four elements:
Use a group interview process when possible. Having an interview process that is back-to-back-to-back-to-back one-on-one interviews leads to repeat questions, making it tiring for the candidate. Additionally, when the interviewing team gets together to discuss the candidate (if they do at all), each member has a different perspective on the candidate because different questions may have been asked and different answers might have been given for the same repeat question. When you have multiple people (four to six) hearing the same thing as part of a group interview, you can get a better feel for the person coming on board. The information is the same, but different people pick up on different things, so it gives the team a more well-rounded perspective on the candidate. Something to keep in mind: These types of interviews can be intimidating for someone being interviewed, so it’s important to establish an environment of trust from the start. Make them comfortable so that you can get the best out of them.
Look for people that can tell you what they’ve learned, not just telling you what they did. Machine learning algorithms are great at exploiting separations in data. But, why are we looking for separations in data? To make better decisions with that data. The tools of data science are important to know, but if we don’t look for the “why” in the data science we are performing, then we are just using tools for the sake of using tools. Just because someone is an expert in hammers and nails doesn’t make him a carpenter. Extracting information out of data is all about context – what question are you asking of the data, and what drives what you see? This is about understanding, forming hypotheses, drawing conclusions. If a data scientist starts down the path of “we used this algorithm and the metrics came out like this…” without giving you some context or understanding of what it means, then you and your team could run into problems down the road (overfitting, building models that aren’t robust, etc.). It’s the difference between hearing about what you did on your summer vacation and what you learned on your summer vacation. In looking for what makes a good data scientist, DJ Patil talks about storytelling – the ability to use data to tell a story and to be able to communicate it effectively. Data scientists need to understand what they are trying to communicate and let their data science help them tell that story. No one really wants to hear what you did on your summer vacation, but they may want to know what you’ve learned and how you learned it.
I’ve launched books that’ve failed. I did a book called “E-mails Addresses of the Rich and Famous” – Roger Ebert got really mad at me. I’ve made videotapes that didn’t work; I’ve made books that didn’t work.
My lesson was: If I fail more than you do, I win, because built into that lesson is this notion that you get to keep playing. If you get to keep playing, you get to keep failing, and sooner or later, you’re going to make it succeed.
The people who lose are either the ones who don’t fail at all and get stuck, or ones who fail so big they don’t get to play again…
If you’re talking to a pacemaker assemblyman or an airline pilot, they don’t try stuff; they don’t say, “I wonder what happens if I do this,” and we’re really glad they don’t do that, because the cost of failing is greater than the cost of discovering what works and what doesn’t.
But almost no one I know builds pacemakers and I don’t know airline pilots. Most of us now live in a world where the kind of failure I’m talking about isn’t fatal at all. If you post a blog post and it doesn’t resonate with people, post another one tomorrow. If you tweet something and no one retweets it, tweet again in an hour. If you’re obsessed with doing what everyone else is doing, because of someone saying “you failed,” then you’re in really big trouble.
The first was a small article by Brian Dumaine about the work being done at Applied Proteomics to identify cancer before it develops. At Applied Proteomics, they use mass spectroscopy to capture and catalog 360,000 different pieces of protein found in blood plasma, and then let supercomputers crunch on the data to identify anomalies associated with cancer. The company has raised $57 million in venture capital and is backed by Microsoft co-founder Paul Allen. You can read the first bit of the article here.
The second is from the Word Check callout, showing how access to information is making the word a better place:
wasa: Pronounced [wah-SUH]
(noun) Arabic slang: A display of partiality toward a favored person or group without regard for their qualifications. A system that drives much of life in the Middle East — from getting into a good school to landing a good job.
But on the Internet, there is no wasa.
– Adapted from Startup Rising: The Entrepreneurial Revolution Remaking the Middle East by Christopher M. Schroeder
]]>=================
Chris Hardwick didn’t rely on just his nerdy instincts in founding his company; he also took inspiration from his heroes. Super-power your business with these lessons from some epic nerd properties.
SUPERMAN: In this summer’s film Man of Steel, Superman’s dad tells his son, “You will give the people of Earth an ideal to strive torwards.” Hardwick believes businesses can achieve that, too. “Altruism,” he says. “That’s what businesses should learn from Superman.”
LORD OF THE RINGS: With all the adventures of Frodo and the fellowship, J.R.R. Tolkien’s tale has plenty of lessons for startups. “Always do what you believe is right,” Hardwick advises. “No matter how much you think you can handle it, don’t pick up the ring… Don’t toy with darkness.”
X-MEN: As diverse as Wolverine and the gang were, they all had a unified vision, organized behind the infrastructure built by Professor X. “He gave them tools; he gave them a home base; he organized the community,” Hardwick says. “He’s the CEO of X-Men, basically.”
SUPER MARIO: Hardwick’s favorite lesson is that of perserverance. “No matter how many times the princess gets kidnapped, you’re going to rescue her,” he says.
STAR TREK: With species ranging from Vulcans to humans, the voyages of the starship Enterprise are all about diversity, Hardwick says. “Race or species was irrelevant,” he points out. “It was all about working together as a team.”
THE WALKING DEAD: While zombies might be out for themselves (and brains!), Hardwick believes the popular comic book and TV show is all about community. “Essentially they assemble a party of experts in their world who are all good at something and contribute to the group,” he says.
BATMAN: Remember when Bruce Wayne’s parents were killed? Well, Batman never forgot it. And by studying the Dark Knight, Hardwick notes, entrepreneurs can learn about “turning adversity into something constructive.”
STAR WARS: The sprawling epic of Darth Vader and son Luke Skywalker has a simple message. “It’s the David and Goliath story,” Hardwick says. “Fight for what you believe in, don’t stop at any cost, and you’ll triumph.”
========================
Here’s the link to the online version if you’re interested:
http://entrepreneur.coverleaf.com/entrepreneur/august_2013?pg=36#pg36
]]>