Popular Science: Wolfram on Big Data

I’m a big fan on Stephen Wolfram and his efforts in building Mathematica and pushing forward his approach to scientific discovery, A New Kind of Science.  In a recent post, Popular Science editor Mark Jannot talks to Wolfram about big data, human understanding, and the origin of the universe.

Here’s just on back-and-forth between Jannot and Wolfram:

Jannot:  A couple years ago at TED, Tim Berners-Lee led the audience in a chant of “More raw data now!” so he’s out there trying to create the data Web. And your project in Wolfram Alpha is not to deliver raw data but to deliver the interface that allows for raw data to turn into meaning.

Wolfram:  Yes, what we’re trying to do—as far as I’m concerned, the thing that’s great about all this data is that it’s possible to answer lots of questions about the world and to make predictions about things that might happen in the world, and so on. But possible how? Possible for an expert to go in and say, “Well, gosh, we have this data, we know what the weather was on such-and-such a date, we know what the economic conditions at such-and-such place were, so now we can go and figure something out from that.” But the thing that I’m interested in is, Can one just walk up to a computer and basically be able to say, “OK, answer me this question.”

This part of the Q&A is particularly interesting, since it highlights a difference of approach in what some want in technology.  Berners-Lee seems to want more “raw data”, while Wolfram is highlighting that the data isn’t really important unless you can turn the data into actionable information.  Wolfram|Alpha does just this – the technology uses Wolfram’s understanding of computation (what he built as part of his wildly successful Mathematica product line) and lets us answer questions. 

It’s an incredibly rich article – one worth reading (1) if you’re interested in data and where its taking us, and (2) if you’re interested in Wolfram and his take on science and technology.  I’m interested in both, so I think it’s very worth highlighting…

Here’s the Popular Science article, and another post to Wolfram|Alpha that highlights the history of computable knowledge (you can even order a poster of the timeline here…).   I’ve had a number of other posts on Wolfram and his scientific approach, which might worth looking into as well…

Data Science and Winemaking

David Smith is an author, blogger and R evangelist for Revolution Analytics, where he serves as VP of Marketing, and he blogs daily at blog.revolutionanalytics.com.  This weekend, he posted an article about the use of data science in the winemaking business (yes, data science is getting to be everywhere!…)

The article is about Palmaz Vineyards, a couple of hours drive north of the Revolution offices in Napa Valley, and there was an article in Food and Wine magazine about Palmaz’s tracking of data all through their winemaking process.

It’s an interesting article, even if you’re only interested in wine!…

O’Reilly’s List of the World’s Most Powerful Data Scientists

I came across this list through Bob Gourley‘s post on sys-con.com.  Bob references a Forbes pictorial of Tim O’Reilly’s list of the world’s 7 most powerful data scientists.  O’Reilly is the founder and CEO of O’Reilly Media, where his company publishes plenty of technical resources, and hosts the fast growing Strata conferences.

Bob gives a nice overview of the Forbes list – you should read his analysis here…  But here are O’Reilly’s top seven (well, ten…):

  • Larry Page (Google)
  • Jeff Hammerbacher (Cloudera) and DJ Patil (Greylock Ventures)
  • Sebastian Thrun (Stanford) and Peter Norvig (Google)
  • Elizabeth Warren (Candidate, U.S. Senate, Massachusetts)
  • Todd Park (Dept. Health and Human Services)
  • Alex “Sandy” Pentland (MIT)
  • Hod Lipson and Michael Schmidt (Cornell)

Internet Evolution: Interview with Moneyball’s Lewis and Beane

Here’s a neat little interview conducted by Internet Evolution’s Todd Watson of Michael Lewis and Billy Beane.  Watson was attending the Information OnDemand event this past month, where one of the key themes of the event was the idea of putting business analytics into practice to help improve business outcomes.  Watson felt that Beane did a great job of this in the business of baseball, and Lewis did a great job of writing about this, so he got both together for this interview.

Billy Beane is the general manager of the Oakland Athletics, changing the way that major league baseball uses data to field their rosters.  Michael Lewis is the author of Moneyball, documenting Beane’s efforts to build a winning baseball franchise while being limited with a payroll that dwarfs his competition.

Lewis’ book was recently turned into a major motion picture featuring Brad Pitt as Beane and Jonah Hill as the statistics whiz that helps Beane turn the A’s around.

Here’s just a little bit from Watson’s interview on how Lewis got turned on to writing about Beane and the A’s:

Todd Watson: One of the key themes of the IOD event has been “turning insight into action,” and that seems to be a theme prevalent in some of your books — most notably Moneyball and The Big Short. I’m curious, in terms of baseball managers who are using sabermetrics to make more informed decisions, I’m really interested in how you got turned on to that topic and also just how that came to be and what inspired you to write the book?

Michael Lewis: It was really simple. I was living in Billy’s backyard in Berkeley so I was paying attention to the A’s. I didn’t know… I wasn’t a baseball fanatic, but I did know there was this payroll issue and I got interested in that.

I got interested in that in the first place, because at first I thought I was going to write a piece about the A’s. I think it was when Jose Canseco got this giant deal, and he was being paid something like $8 million, and the right fielder and left fielder were being paid something like $150,000, and I wanted to know if the outfielders were pissed!

And, how they felt when those Jose Canseco dropped a fly ball. (Laughter) And I was going to come out and write about that, and then I started thinking about it, and I realized there were these huge discrepancies from team to team. And then I wondered, so how does the whole team feel about being poor?

I enjoyed Moneyball, both the movie and the book.  I have mentioned before that Lewis is a really great author – I wrote another post about Michael Lewis’ book on the 2008 global financial meltdown called The Big Short

You can read more of Watson’s interview with Lewis and Beane here

Popular Science: Turning Raw Data into Useable Knowledge

Here’s a thought provoking post from Popular Science about how we need to think in dealing with the upcoming ocean of data.   Below is one of the interesting segments from the post, as provided by Bill Anderson of the School of Information at the University of Texas at Austin and associate editor of the CODATA Data Science Journal:

Take the number 37, Anderson says. Other than stating a numerical order, it means little on its own. But with some more information — 37 degrees Celsius, for instance — it can take on more meaning. Now give it some context: 37 degrees C is normal body temperature. Now 37 represents something useful, something a doctor or researcher could use, and it becomes a piece of knowledge that could comfort a patient or answer a question.

Anderson says there really is no such thing as “raw data”, which is generally right.  What’s really important to understand is that data itself isn’t important without a question we’re are trying to answer with the data.  A single piece of data has information about a lot of things, but only if we are asking the right questions of that data. 

For example, with the number 37 above, we might want to ask if it’s hot outside.  We first need some expectation for what the data would look like if it was hot, and what it might look like if it wasn’t.  We don’t really need to know the units of measurement (Celsius, for example), but we do need to know if 37 is normal (although the units help you figure that out), and “hot” means that the data would be larger than normal.

With the right questions (and some questions are easier to answer with data than others…), we can make the data tell us what we want to know.

Here’s the link to the Popular Science post…

Upcoming Data Science Conferences

Looking ahead, there are a number of interesting upcoming conferences within the “big data” and “data science” fields, so I thought I’d list a few in this post.

First, Strata is probably one of the biggest and fastest growing conferences within the field, and they’ll be holding their next conference in Santa Clara, CA, starting on February 28, 2012.  Featured speakers include danah boyd from Microsoft Research, Hal Varian, Chief Economist for Google, and Pete Warden, CTO of Jetpac.  They hold online conferences as well – here’s more info on the December 7 event from a previous post

Next, Predictive Analytics World hosts a number of conferences around the world and for differing industries.  Their next conference is scheduled for November 30 in London, UK, in conjunction with the Marketing Optimization Summit and Conversion Conference as part of Data Driven Business Week.  PAW held a recent conference in New York this past October and will be hosting another one in San Francisco in March of 2012.

Also, The Big Data Summit team announced today session details for the upcoming technology event, November 8-10, 2011 in Miami, Florida.  The Summit is for C-level executives who are involved in data storage, data management and data analysis to gather and discuss how companies can effectively manage, protect and leverage the growing amounts of data in the enterprise.

Nerd Pride Friday: Triumph of the Nerds

This Friday, I thought I’d highlight one of my favorite and nerdiest documentaries of the technology industry.  Given the presence of the late Steve Jobs in the media lately – his loss to cancer, the new Walter Issaccon biography, and Apple’s re-creation of personal computing – I thought it would be nice to reflect upon Jobs’ first major impact into our technological lives. 

15 years ago, PBS created a documentary called Triumph of the Nerds:  An Irreverent History of the PC Industry.  It was hosted by Bob Cringely, who at the time wrote a column for InfoWorld about the goings-on of Silicon Valley and now writes a weekely column, I, Cringely.  In it, Cringely humorously and effectively describes how Steve Jobs and Bill Gates created the PC industry as we know it. 

I was always impressed with this documentary and the impact the subjects (especially Jobs) had on our lives through their business and technology pursuits.  I personally think it’s amazing and worth your time watching.  They made another documentary about the history of the Internet:  Nerds 2.0.1 - this second documentary was done in 1998 and the Internet as we know it was probably only three years old or so…

IBM Presentation at IOD

On the website Enterprise Irregulars, Evangelos Simoudis wrote a post on his talk at the recent Information On Demand conference in Las Vegas. 

Simoudis mentions that today what is generally referred to by “big data” means data that is large in size, semi-structured or unstructured in form, and real-time or near real-time in the way it is generated and consumed.  This melds well with another description of the two different types of big data – “fast” big data that is real-time and “slow” big data that is generally needed for analysis and hypothesis testing.

Simoudis is Managing Director at Trident Capital, focusing on investments in Internet and software businesses, and here’s what he had to say in the Enterprise Irregulars post about why he believes IBM’s Watson technology is important:

1. First, it uses a question and answer interaction which business users find more natural as it enables them to incrementally improve their understanding of a problem.

2. Second, it effectively combines structured with unstructured data some of which is curated, such as published articles of special or general interest, while other is dynamically collected from the open internet.

3. Third, Watson’s data analysis speed, that is the result of its underlying architecture, makes the system suitable for several application areas, particularly those where data remains useful for a short period such as medical analysis, financial analysis, and consumer sentiment analysis.

4. And finally, Watson’s concurrent use of many analysis and prediction techniques, not only provides a unique approach to machine learning and fact-prediction, but more importantly it enables the analytic application to explore more alternatives to a possible solution, thus increasing the probability of successfully addressing a problem.

I do believe that this is where analytics technology development is heading – we’ve taken note of IBM’s Watson technology some time back.  You can read more about Simoudis’ thoughts and his IOD talk here

WSJ: Profile of 1010Data

The Wall Street Journal has a profile of 1010data, as part of their ongoing series of successful companies in the big data space.  According to the WSJ profile, the New York-based company’s customers pay a subscription and hand their data over to the company, which gives it back to them in a format that’s nearly identical to a spreadsheet.  This newly formatted data can then be manipulated and analyzed with additional advanced analytical capabilities.

1010data has been successful in raising capital, landing $35 million for a minority stake from Norwest Venture Partners earlier last year.  Here’s the WSJ profile to learn more about 1010data, and you can find their website here

Kaggle Raises $11M for Data Science Spec Work

It’s been common practice within the graphic arts world to have competitions for logo design – companies put out a bid for a logo, and get multiple bids with examples in order to win the work.  It appears that this model is reaching the analytics development world…

It was announced today by VentureBeat that Kaggle just raised $11 million to support their algorithm bakeoff business.  Here’s a bit of the background on the concept from the VentureBeat piece:

Promulgated by sites like 99designs, the idea behind speculative work is that there are hordes of hungry creatives just raring for a chance at a job. A potential client stages a competition to determine who gets the contract. The creatives compete, each creating work specifically for the client, most wasting hours of time in the process.

For the client, however, it’s a win: They don’t risk anything on an unknown, and the result is often reasonably acceptable.

Kaggle is a new company that is bringing this concept — getting smart people to do specific work free of charge — from the creative industries into the sciences. The startup just announced an $11 million round of institutional funding, quite a large amount considering this is the company’s first round.

As with the creative graphics industry, this model works fine if what you want is something quick, which would then be cheap.  For many of the easy data science problems, this would work well enough, since textbook solutions can be implemented easily enough by smart programmers.

However, in this model, what you get is a bunch of “scientists” hacking up algorithms to win a gig, which will get the client most of the way there, but will eventually hit the wall and enter the data fog – not knowing how to proceed or how to improve upon the original work.  You can’t get this cheaply with crowdsourced algorithm work – it takes a specific analytics engineering discipline…

Read more about this shift in the data science industry here

© 2011 Mic Farris. All rights reserved.

Bad Behavior has blocked 160 access attempts in the last 7 days.