ODBMS.org: Interview with Amazon CTO Vogels

Professor Roberto Zicari of ODBMS.org recently posted his interview with Amazon.com CTO and VP Dr. Werner Vogels about the future of big data, and where Amazon is going with it.

Within the article, Zicari reflects upon Vogels’ mention of a book called The Fourth Paradigm: Data-Intensive Scientific Discovery (you can download a free PDF version of the book here).  The book is a collection of essays that expands on the vision of pioneering computer scientist Jim Gray for a new, fourth paradigm of discovery based on data-intensive science and offers insights into how it can be fully realized.  Also, there’s an interesting article (titled Sailing in an Ocean of 0s and 1s) that talks about The Fourth Paradigm.

You can read the interview with Dr. Vogels here

Strata Online Conference – Free Registration

On December 7, Strata will be hosting a free online conference about moving to big data.  Here’s a quick synposis of the conference agenda, scheduled to last from 9 AM to 10:30 AM Pacific time:

  • Introduction:  Dealing with New Expectations – Alistair  Croll (Bitcurrent)
  • Top-Down:  What CEOs Can Do to Accelerate Data Mindsets – Diego Saenz (Data Driven CEO), Jonathan Bruner (Forbes Media)
  • Web Analytics: The Enterprise Gateway Drug to Big Data? – Justin Cutroni (WebShare Design LLC)
  • Take a Lesson from the Research World – Kaitlin Thaney (Digital Science)
  • How Will Deep Q&A Impact the Use of Analytics in Business? – Christer Johnson (IBM)

Strata is sponsored by the technical publisher O’Reilly and has become one of the fastest growing forums for big data and data science applications.  Upcoming events include another Strata Online Conference, scheduled for January 25, 2012, and the Strata Conference, calendared for February 28-March 1, 2012, in Santa Clara, CA.

 

HP’s Project Moonshot – Addressing “Slow Big Data”

Wired’s Jon Stokes has a really interesting post on HP’s newly unveiled data server strategy to address two different big data challenges.

As Stokes describes (and which he also mentions really comes from a presentation by Twitter’s Nathan Marz), there are “fast” big data and “slow” big data problems.   For the “fast” problems, you apply a set of pre-developed algorithms and tools to the incoming datastream, looking for events that match certain patterns so that your platform can react in real-time.  However, sometimes you need to ask questions of the data, and then analyze the results, which can’t be done in real-time effectively.  This describes the “slow” problems, or as Stokes puts it, where you gather information and test hypotheses by running queries against a vast backlog of historical data.

It turns out that the natural evolution of analytics is to go from “slow” problems to “fast” problems, turning the inquisitive understanding of the data, requiring analysis, into faster number-crunching analytics.  Knowing the right way to generate these “fast” analytics requires an solid analytics engineering discipline, especially when the problems being answered get harder and harder

Big data solutions are currently focused on speed and data management platforms, but there is still a need for understanding the science behind developing the right analytics.

ZDNet’s Kusnetzky on Big Data and More

In a post today, Dan Kusnetzky of ZDNet interviews MapR Technologies CEO, John Schroeder, about his views on big data, Hadoop, and what his company is doing in this space.

Here’s a little from Kusnetzky’s post on his thoughts on “big data” (which I think are “spot on”):

What is big data?

A while ago, I posted What is “Big Data?”. Here’s a segment of that post.

“Big Data” is a catch phrase that has been bubbling up from the high performance computing niche of the IT market. Increasingly suppliers of processing virtualization and storage virtualization software have begun to flog “Big Data” in their presentations. What, exactly, does this phrase mean?

If one sits through the presentations from ten suppliers of technology, fifteen or so different definitions are likely to come forward. Each definition, of course, tends to support the need for that supplier’s products and services. Imagine that.

In simplest terms, the phrase refers to the tools, processes and procedures allowing an organization to create, manipulate, and manage very large data sets and storage facilities. Does this mean terabytes, petabytes or even larger collections of data? The answer offered by these suppliers is “yes.” They would go on to say, “you need our product to manage and make best use of that mass of data.” Just thinking about the problems created by the maintenance of huge, dynamic sets of data gives me a headache.

You can read more of Kusnetzky’s most recent post here

Hadoop World Approaching

Next week, Hadoop World will be taking over the Sheraton New York Hotel and Towers in Manhattan.  Sessions will be held on Tuesday and Wednesday, November 8 and 9, with training and certification sessions being held the other days that week.

Companies such as Facebook, bit.ly, Etsy, Netezza, Explorys, and CBS Interactive will be giving presentations on deploying Hadoop across the enterprise.  Also, Cloudera will be holding training and certification sessions focused on Hadoop essentials, administration, development and related technologies, including Apache Hive, Apache Pig and Apache HBase.

Click here for the full agenda, and here for the list of speakers…

Privacy in the Age of Big Data

Today, Audrey Watters of O’Reilly Radar posts her interview with Terence Craig, co-author of Privacy and Big Data, about the impacts of big data on personal privacy.  Craig makes the claim that data transparency will eventually trump anonymity, meaning that our lives will be less private in the future as we all take advantage of the technologies that come from the new information age.

Here’s a quick Q&A between Watters and Craig on the subject of privacy:

Assuming that data can’t be anonymized and companies don’t have malicious plans for our personal data, what expectations can we have for privacy?

Terence Craig: We’ve moved back to our evolutionary default for privacy, which is essentially none. Hunter-gatherers didn’t have privacy. In small rural villages with shared huts between multi-generational families, privacy just wasn’t really available there.

The question is how do we address a society that mirrors our beginnings, but comes with one big difference? Before, anyone who knew the intimate details of our lives were people we had met physically, and they were often related to us. But now the geographical boundary has been erased by the Internet, so what does that mean? And how are we as a society going to evolve to deal with that?

With that in mind, I’ve given up on the idea of digital privacy as a goal. I think you have to if you want to reap the rewards of being a full participant in a digitized society. What’s important is for us to make sure we have transparency from the large institutions that are aggregating data. We need these institutions to understand what they’re doing with data and to share that with people so we, in aggregate, can agree whether or not this is a legitimate use of our data. We need transparency so that we — consumers, citizens — can start to control the process. Transparency is what’s important. The idea that we can keep the data hidden or private, well … that horse has left the stable.

You can read more of Watters’ interview here

 

Wall Street Moves Toward Leveraging Unstructured Data

Forbes has a post on where Wall Street is going in the big data field.  While capital markets have previously been focused on high velocity market data, trying to beat the next trader in taking advantage of market inefficiencies, they are now moving toward unstructured data that fall outside these high velocity data streams, according to the article.

Here’s an example that is mentioned in the article:

A futures trader watching the Florida orange juice market might want to track rain in Africa, its impact on storms over the Atlantic, and their impact on the orange crop, and the follow-on effect on the price of juice futures. That would require taking weather data from a number of sources and aggregating years of it looking for patterns that might provide an edge in the financial markets.

Wall Street has been using technical analysis for many years to take advantage of market swings, but this only focuses on the data of the trades themselves.  Now, they’re trying to leverage information from many disparate data sources, and fuse all this information together to make better and more timely decisions.  Turning all this data into information that’s actionable will certainly be a real feat…

You can read the Forbes article here

Harnessing Big Data Identified as Key Health Care Innovation for 2012

The healthcare industry is hot on the big data bandwagon, based on a recent list of key medical innovations presented by the Cleveland Clinic.  #8 on the list was “Harnessing Big Data to Improve Healthcare” – this top 10 list included other obviously medical innovations such as catheter-based renal denervation to control resistant hypertension and implantable devices to treat complex brain aneurysms.

Of course, even the healthcare industry is recognizing the need to pull more information out of the ocean of data they have.  In explaining why this ranked so high on their list, the Cleveland Clinic said:

The amount of data collected each day dwarfs human comprehension and even brings most computing programs to a quick standstill. It is estimated that 2.5 quintillion bytes of data are created daily, so much that 90% of the data in the world has been created in the last two years. This is what’s called big data, and hospitals, medical centers, hospital systems, pharmaceutical, biotechnology and medical device companies that comprise the trillion-dollar healthcare industry in this country are awash in it. Together, they easily amass terabytes and oftentimes petabytes of structured and unstructured data. Unfortunately, not enough of this deluge of big data sets has been systematically collected and stored, and therefore this valuable information has not been aggregated, analyzed, or made available in a format to be readily accessed to improve healthcare.

You can see read more about what the healthcare industry hopes to obtain by leveraging big data here, and the complete Cleveland Clinic’s Top 10 Medical Innovations list here

Basho Raises $5M

The Boston Business Journal is reporting that big data storage firm Basho Technologies raised $5 million in its latest round of financing.  According to the report, this brings Basho’s total to $12.5M raised in 2011, adding to the $7.5M total they raised back in May.

Basho makes highly distributable and scalable non-relational databases, which are needed for handling and managing the incredibly large datasets now available.  These types of technologies are some of the hottest offerings in the market right now, indicated by the ability of these firms to raise significant capital.  Basho, founded in 2008 by a group of software architects, engineers, and executives from Akamai, recently announced the licensing of their NoSQL database technology, Riak, to the National Board of E-Health in Denmark to operate their nationwide medical prescription card program.

NYT: The Future of Computing

Here’s a nice post from the New York Times about big data, speed, and the future of computing.  It talks a little bit about the technology that makes IBM’s Watson computer so fantastic at beating Jeopardy! champions (we first wrote about this last year…), and that the need for speed will likely change the computer architectures themselves.

There will likely be groundbreaking changes in hardware and software, where computation and decision-making will both become part of the same technology.  This could be where much of the analytics engineering advances come from over the next decade.  Read more from the NYT post here

© 2011 Mic Farris. All rights reserved.

Bad Behavior has blocked 164 access attempts in the last 7 days.