The Fundamentals of Data Science

 

Two of the biggest buzzwords in our industry is “big data” and “data science”. Big Data seems to have a lot of interest right now, but Data Science is fast becoming a very hot topic.

I think there’s room to really define the science of data science - what are those fundamentals that are needed to make data science truly a science we can build upon?

Below are my thoughts for an outline for such a set of fundamentals:

Fundamentals of Data Science

Introduction
- What is Data?
- The Goal of Data Science
- The Scientific Method

Probability and Statistics
- The Two Characteristics of Data
- Examples of Statistical Data
- Introduction to Probability
- Probability Distributions
- Connection with Statistical Distributions
- Statistical Properties (Mean, Mode, Median, Moments, Standard Deviation, etc.)
- Common Probability Distributions (Discrete, Binomial, Normal)
- Other Probability Distributions (Chi-Square, Poisson)
- Joint and Conditional Probabilities
- Bayes’ Rules
- Bayesian Inference

Decision Theory
- Hypothesis Testing
- Binary Hypothesis Test
- Likelihood Ratio and Log Likelihood Ratio
- Bayes Risk
- Neyman-Pearson Criterion
- Receiver Operating Characteristic (ROC) Curve
- M-ary Hypothesis Test
- Optimal Decision Making

Estimation Theory
- Estimation as Extension of M-ary Hypothesis Test
- Unbiased Estimation
- Minimum Mean Square Error (MMSE)
- Maximum Likelihood Estimation (MLE)
- Maximum A Posteriori Estimation (MAP)
- Kalman Filter

Coordinate Systems
- Introduction to Coordinate Systems
- Euclidian Spaces
- Orthogonal Coordinate Systems
- Properties of Orthogonal Coordinate Systems (angle, dot product, coordinate transformations,
etc.)
- Cartesian Coordinate System
- Polar Coordinate System
- Cylindrical Coordinate System
- Spherical Coordinate System
- Transformations Between Coordinate Systems

Linear Transformations
- Introduction to Linear Transformations
- Properties of Linear Transformations
- Matrix Multiplication
- Fourier Transform
- Properties of Fourier Transforms (time-frequency relationship, shift invariance, spectral
properties, Perseval’s Theorem, Convolution Theorem, etc.)
- Discrete and Continuous Fourier Transforms
- Uncertainty Principle and Aliasing
- Wavelet and Other Transforms

Effects of Computation on Data
- Mathematical Representation of Computation
- Reversible Computations (Bijective Mapping)
- Irreversible Computations
- Impulse Response Functions
- Transformation of Probability Distributions (due to addition, subtraction, multiplication,
division, arbitrary computations, etc.)
- Impacts on Decision Making

Prototype Coding / Programming
- Introduction to Programming
- Data Types, Variables, and Functions
- Data Structures (Arrays, etc.)
- Loops, Comparisons, If-Then-Else
- Functions
- Scripting Languages vs. Compilable Langugages
- SQL
- SAS
- R
- Python
- C++

Graph Theory
- Introduction to Graph Theory
- Undirected Graphs
- Directed Graphs
- Various Graph Data Structures
- Route and Network Problems

Algorithms
- Introduction to Algorithms
- Recursive Algorithms
- Serial, Parallel, and Distributed Algorithms
- Exhaustive Search
- Divide-and-Conquer (Binary Search)
- Gradient Search
- Sorting Algorithms
- Linear Programming
- Greedy Algorithms
- Heuristic Algorithms
- Randomized Algorithms
- Shortest Path Algorithms for Graphs

Machine Learning
- Introduction to Machine Learning
- Linear Classifiers (Logistic Regression, Naive Bayes Classifier, Support Vector Machines)
- Decision Trees (Random Forests)
- Bayesian Networks
- Hidden Markov Models
- Expectation-Maximization
- Artificial Neural Networks and Deep Learning
- Vector Quantization
- K-Means Clustering

Question:  Do you have any thoughts on the fundamentals of data science? You can leave a comment below.

A Data Science Lesson from Richard Feynman

Richard Feynman

Richard Feynman

Richard Feynman is one of the greatest scientific minds, and what I love about him, aside from his brilliance, is his perspective on why we perform science.   I’ve been reading the compilation of short works of Feynman titled The Pleasure of Finding Things Out, and I recently came across a section that really hit home with me.

In the world of data science, much is made about the algorithms used to work with data, such as random forests or k-mean clustering.  However, I believe there is a missing component – one that deals the fundamentals underlying data science, and that is the real science of data science.

10 Things To Know When Hiring Data Scientists

I’ve been performing data science before there was a field called “data science“, so I’ve had the opportunity to work with and hire a lot of great people.  But if you’re trying to hire a data scientist, how do you know what to look for, and what should you consider in the interview process?

Data Science Word Cloud

I’ve been doing what is now called “data science” since the early 1990s and have helped to hire numerous scientists and engineers over the years.  The teams I’ve had the opportunity to work with are some of the best in the world, tackling some of the most challenging problems facing our country.  These folks are also some of the smartest people I’ve ever had the opportunity to work with.

That said, not everyone is a good fit, and the discipline of data science requires important key elements.  Hiring someone into your team is incredibly important to your business, especially if you’re a small startup or building a critical internal data science team; mistakes can be expensive in both time and money.  This can be even more intimidating if you don’t have the background or experience in hiring scientists, especially someone responsible for this new discipline of working with data.

Beating Cancer and Favoritism with Data

Fortune-Startup RisingI read a couple of items in this month’s Fortune magazine that I thought it was worth passing along.

The first was a small article by Brian Dumaine about the work being done at Applied Proteomics to identify cancer before it develops.  At Applied Proteomics, they use mass spectroscopy to capture and catalog 360,000 different pieces of protein found in blood plasma, and then let supercomputers crunch on the data to identify anomalies associated with cancer.  The company has raised $57 million in venture capital and is backed by Microsoft co-founder Paul Allen.  You can read the first bit of the article here.

The second is from the Word Check callout, showing how access to information is making the word a better place:

wasa: Pronounced [wah-SUH]

(noun) Arabic slang:  A display of partiality toward a favored person or group without regard for their qualifications.  A system that drives much of life in the Middle East — from getting into a good school to landing a good job.

But on the Internet, there is no wasa.

- Adapted from Startup Rising: The Entrepreneurial Revolution Remaking the Middle East by Christopher M. Schroeder

A Review of The Signal and The Noise, A New Book by Nate Silver

Imagine a guy with glasses who used to model baseball stats and play online poker nailing the outcome of the 2012 elections. And when I say “nailing”, I mean that he correctly predicted the U.S. Presidential contest in every one of the 50 states (and nearly every U.S. Senate race, too). He even performed better than some of the most widely-used polling firms. Now imagine that he gives his thoughts on making these types of predictions. That’s exactly what Nate Silver does in his new book The Signal and the Noise.

Nate-Silver-book
I’ve worked in what’s now being called “data science” for nearly twenty years. The title of Silver’s book – The Signal and the Noise – presents an important and sometimes overlooked part of this science. The “signal” is what we’re looking for in the data, and the “noise” is all the stuff in the data that gets in the way of what we’re looking for.