Archives For Data Science

This section is dedicated to burgeoning field of data science and its applications

You know that data science is truly becoming a recognized scientific discipline when billions of university dollars will be spent on its future.

I wrote previously about Columbia’s effort to expand its Manhattan campus to build a data science and engineering center.  However, Stanford and Cornell are also in the race. 

Much of this comes from a desire by the City of New York to become home to a leading engineering and applied science campus.   NYC is willing to invest $100 million into infrastructure improvements for the winner.

Stanford and Cornell have put in bids for the project to use the land on Roosevelt Island, while Columbia will be expanding its Manhattanville campus.  Other schools are looking to expand in NYC as well - Carnegie Mellon, which is looking at the Brooklyn Navy Yard, and NYU, which wants to move into Downtown Brooklyn.

Edd Dumbill is the general manager for the Strata Conference, recently wrote a nice post on Google+ titled “Why Do We Need Data Science?

Here is a really good insight from Dumbill and how data science applies to business:

Why is the scientific method applicable to business and data?

Every company’s business is complex in itself, and they operate in a complex world. The financial, economic and societal structures we live and do business in are complex. Because of this complexity and interactions, businesses can be viewed in the same light as organic, biological systems. They are complex entities within a complex system.

This is where science comes into play. Even assuming you could come up with a top-down mathematical model of your business, there’s too much interaction and randomness with complex systems for your model to be practical. Thus, the exploratory approach of science becomes useful to a business.

Your world and business is a giant laboratory, and ever more so as the world becomes more networked. By employing data scientists you can discover better how your business works, how it can be improved, and find new things you can do that you didn’t know of before. To do this, you must connect up three kinds of people: the business folk, the data scientists and the data engineers.

I do like Dumbill’s take here and there is absolute merit with applying the scientific method to business activities.   Peter Wang of Streamitive commented on Dumbill’s post as well, and has some interesting points…

Ultimately, data doesn’t mean anything without trying to answer questions.  To get actionable information, you need data and you need to be asking the right questions. That’s why the scientific method is so important – it’s all about posing a hypothesis or asking a question, and then squeezing the right information out of the data in order to answer it.

Based upon their Magic Quadrant analysis of data integration tools, Gartner rates Informatica Corp. and IBM as the top software vendors in the space.

Gartner uses a Magic Quadrant to rate companies as leaders, challengers, niche players and visionaries based on several criteria including “completeness of vision” and “ability to execute.”  From Gartner’s website:

  • Leaders execute well against their current vision and are well positioned for tomorrow.
  • Visionaries understand where the market is going or have a vision for changing market rules, but do not yet execute well.
  • Niche Players focus successfully on a small segment, or are unfocused and do not out-innovate or outperform others.
  • Challengers execute well today or may dominate a large segment, but do not demonstrate an understanding of market direction.

A post by Mark Brunelli, Senior News Editor, at SeniorDataManagement has a more detailed analysis of the Gartner report.  Here’s what Brunelli wrote, detailing some of the thoughts of Ted Friedman, a Gartner vice president and information management analyst and co-author of the report:

“You’re hearing a lot about big data and analytics around big data,” Friedman said. “To do that kind of stuff you’ve got to collect the data that you want to analyze and put it somewhere. [That] in effect is a job for data integration tools.”

It does seems that the main focus right now in this space is on data handling and data management.  A lot of work is being done by companies to create data visualization tools to gain insight from the data, but as the problems get much harder, better analytics approaches will need to be brought to bear.  The real key over the next few years will be on the smart analysis of all this data, turning the data into reliable actionable information.

As the data science and big data technology booms start accelerating, it’s worth noting how these technologies will change our lives – both positively and potentially negatively.

I posted previously about the ongoing discussion of privacy, but I’ve found another post on GigaOM about the same topic.  According to the article, the Supreme Court of the United States heard oral arguments on Tuesday in a case that could decide how connected the concept of big data is to constitutional expectations of privacy.

The case, United States v. Jones, is specifically about whether police needed a search warrant to place a GPS device on a suspect’s car and monitor his movements for 28 days.  Several justices, however, seized upon a very important question: How much data is too much before allowable surveillance crosses the line into an invasion of privacy?  This is a really nice post, and if you’re interested in the constitutional issues regarding privacy (for example, an appellate court has found that warrantless GPS tracking is a violation of the Fourth Amendment), I’d recommend that you take time to read the article

These two posts do highlight interesting differences in privacy and who controls our data.  We sometimes have a knee-jerk reaction to institutions that keep data on us and then use it for other purposes (whether they benefit us or not).  George Orwell’s 1984 and the Big Brother metaphors with which we’re all familiar deal with government controlling the data and what it can do with it – that’s what the US v. Jones case is really all about.

However, in the private world where we interact with companies and people more directly, it’s not really a Big Brother issue, because we give up our privacy all the time – there’s no legal requirement to give up data; we do it by choice.  We willingly give up our privacy in order to benefit from technology – little bit by little bit.  If we want a website to provide us great recommendations (say Netflix), the company is going to have to know more about us – what we like, and what we don’t like.  

It seems a bit “Big Brother”, but even people store data about us all the time – they’re called memories.  Some are good and some are bad; people remember what we enjoy and what we hate.  People who become our friends are the ones that become great matches for us – they enjoy our humor, they know what we like to discuss, and look out for us when we’re not around.

Companies will be trying to do that as well, but of course, it’s all about trust.  Just as we trust our friends with all that they know about us, we hope to trust companies with all the data they store about us.    That’s probably the biggest thing we need to wrestle with in the Age of Big Data – how to establish trust between people and the machines that will be keeping and using the data they have about us…

A couple of interesting notes today….  On the PRNewswire today, Kontagent, a leading enterprise user analytics company, today announced that it has closed $12 million in Series B financing with a consortium of investors including Battery Ventures, Altos Ventures, and Maverick Capital.  Kontagent focuses on social and mobile web applications with their kSuite product, which combines a proprietary database with customized analytics and real-time monitoring to help customers identify and react to usage patterns in real-time.  This continues the pattern of heavy entry level funding into analytics and big data startups – data science applications are becoming the next big technology boom…

There’s another interesting article snippet at Bloomberg Businessweek about how the oncoming avalanche of data could change the nature of astronomy and physics.  According to the article, Johns Hopkins will be building a 100 gigabit-per-second network to shuttle data from the campus to other large computing centers at national labs and even to Google.  Here’s what Dr. Alex Szalay, Alumni Centennial Professor at Johns Hopkins and head of the network project, thinks about what this could mean for the future of science itself:

In his mind, the new way of using massive processing power to filter through petabytes of data is an entirely new type of computing that will lead to advances in astronomy and physics, much as how the microscope’s creation in the 17th century led to advances in biology and chemistry. In that light, the creation of a 100 gigabit-per-second research network at Johns Hopkins becomes not just a fast network but also an essential tool for research and discovery, a basic component of the 21st century microscope.

You can read about the Kontagent financing deal here, and Dr. Szalay’s effort to build big data networks here

On WebProNews, Chris Crum shows some interesting infographics about the big data onslaught.   The three topics that are covered include:

  • Big Data – just what the information deluge is going to look like (for example, by 2015, there will be 7.9 zettabytes of data created, enough to fill 1.8 million Libraries of Congress
  • The Digital Promise – whether we’re doing enough to bring America’s classrooms into the 21st century
  • Saving Money – an argument that one might want to consider ending a relationship – a couple could save $743 per month

Interesting graphics (though I wouldn’t advocate breaking up just to save money…) Take a look at the infographics here

Interesting how you find things on the Internet, but I ran across this post by Jay Ulfelder, an American political scientist who studies the survival and breakdown of political regimes (among other things) and a self-described dart-throwing chimp (yes, this is how I got here…).

Jay writes an interesting piece about a couple of experiences he had recently that got him thinking statistically – the possible link of the deaths of three children to a self-published book on spanking and corporal punishment, and a debate he was engaged in regarding the effectiveness of U.S. government support for pro-democracy movements in countries under authoritarian rule (pretty heady stuff!…).

In doing so, Jay reflects upon a recent book he’s read – Numbers Rules Your World by Kaiser Fung – about thinking statistically and recounted his experiences in that light.  Jay’s post is lengthy, but if you’re interesting in social and political dynamics (like me) and how data science plays a role in understanding these behaviors (like me!), then you might find Jay’s post worth the read.  You can read his piece here

I ran across this post on Analyst First by , where he describes the world of Business Analytics as a prime area for the Lean Startup model.

I am a big fan of Eric Ries’ book The Lean Startup, where he advocates treating every new entrepreneurial venture, whether inside an existing company or as its own startup company, as a startup.  And further, since this is a startup venture, the uncertainty about whether this venture will succeed or fail is very high.

So, rather than put together detailed plans about building the product that the business will be based upon, a startup venture should be building the “minimum viable product” and getting feedback from customers quickly to see whether you’re on the right track.  The faster you get feedback, the better you’ll be able to build a business that is sustainable and meets the needs of your customers.

Effectively, Ries argues that you should treat the startup venture, every aspect of it, as a series of scientific experiments designed to inform you whether you are building a sustainable business consistent with your company’s vision.  It’s basically applying the scientific method to your business.

For one, I say, “Absolutely dead on!”  Most business activities, whether marketing or sales or even less-than-disciplined engineering, are performed via rules of thumb (“here’s what’s worked before…”) - there is no true “validated learning”, as Ries put it.   Generally, many businesses and engineering teams operate with the approach of “we made a number of changes last month, and our customers seem to like them, and our overall numbers are higher this month, so we must be on the right track”.  This might make a company feel good, but it gets them no closer to understanding why they might be succeeding, and what to do if things turn south.

And what is worse, the internal workings of the business may be driven by managers more motivated by preserving the current business enterprise than creating a new one.  This puts entrepreneurial ventures at risk from getting started in the first place, or at least started with the greatest possible chance for success.

And I like the way that Samild describes it in his post:

In the twenty-first century we can build almost anything that can be imagined. The challenge is not to build more stuff. It’s to build the right stuff. Most startups fail, says Ries, because they make the wrong things. The key activity of a startup should therefore be learning, not building. What creates value for a startup is it determining whether or not it’s on the path to a sustainable business.

If you’re interested in the Lean Startup approach to business (which, again, I highly recommend), you can find out more at Eric Ries’ website here, and you can buy his book The Lean Startup here…  Also, you can read more of Stephen Samild’s post here

You might not think of a Brad Pitt movie and human resources as fitting together, but stay with me…

Josh Bersin of Bersin and Associates makes the case that human resources professionals need to exploit data science to support their businesses better.  Just as analyzing the statistics of baseball helped Billy Beane’s Oakland A’s beat their better-financed competition, so too can HR departments win the battles to attract and keep top talent for their organization through data science.

According to Bersin, in surveying 711 HR departments, “attracting and selecting the right talent” rates highest among HR skills and capabilities, yet even though better data analysis can more successfully achieve this top goal, these HR departments rate “developing workforce analytics for management” and “measuring HR program effectiveness” at the bottom.  Basically, measurement, analytics, and assessing performance objectively is ranked as the least important skills for HR departments.

Bersin is advocating that HR departments can raise the bar for their effectiveness by embracing data science (I agree!…).  Bersin’s presentation is on SlideShare and can be found here…  Whether you are an HR professional or a data science weenie, you’ll likely find the slides interesting.

On the Wall Street Journal website today, Ben Rooney posts an interview with Hortonworks CEO Eric Baldeschwieler, co-creator of Hadoop.  For all those in the big data space, the Hadoop project develops open-source software for reliable, scalable, distributed computing, and Hortonworks is focused on accelerating the development and adoption of Hadoop.

In Rooney’s interview, Baldeschwieler describes the problem Hadoop is designed to solve:

At its base, it is just a way to take bulk data and storage in a way that is cheap and replicated and can pull up data very, very fast.

Hadoop is at one level much simpler than other databases. It has two primary components; a storage layer that lets you combine the local disks of a whole bunch of commodity computers, cheap computers. It lets you combine that into a shared file system, a way to store data without worrying which computer it is on. What that means is you can use cheap computers. That lets you strip a lot of cost out of the hardware layer.

The thing that people don’t appreciate when you drop a lower price point is that it is not about saving money, it is about being able to do an order of magnitude more on the same budget. That is revolutionary. You can score five to 10 times more data and you can process it in ways that you can’t imagine. A lot of the innovation it opens up is just the speed of innovation. You get to an answer faster, you move into production faster, you make revenue faster.

You can read Rooney’s WSJ interview here