Can Lessons from Data Science Help Journalism?

New York Times newsroom 1942. Source: Library of Congress http://hdl.loc.gov/loc.pnp/cph.3c12969

 

You might think journalism and data science don’t really go together, but on that, I differ. Below are some thoughts on the topic and lessons we can draw from data science on how to make journalism better and more effective in these times.

Big Data and 1984?

As the data science and big data technology booms start accelerating, it’s worth noting how these technologies will change our lives – both positively and potentially negatively.

I posted previously about the ongoing discussion of privacy, but I’ve found another post on GigaOM about the same topic.  According to the article, the Supreme Court of the United States heard oral arguments on Tuesday in a case that could decide how connected the concept of big data is to constitutional expectations of privacy.

The case, United States v. Jones, is specifically about whether police needed a search warrant to place a GPS device on a suspect’s car and monitor his movements for 28 days.  Several justices, however, seized upon a very important question: How much data is too much before allowable surveillance crosses the line into an invasion of privacy?  This is a really nice post, and if you’re interested in the constitutional issues regarding privacy (for example, an appellate court has found that warrantless GPS tracking is a violation of the Fourth Amendment), I’d recommend that you take time to read the article

These two posts do highlight interesting differences in privacy and who controls our data.  We sometimes have a knee-jerk reaction to institutions that keep data on us and then use it for other purposes (whether they benefit us or not).  George Orwell’s 1984 and the Big Brother metaphors with which we’re all familiar deal with government controlling the data and what it can do with it – that’s what the US v. Jones case is really all about.

However, in the private world where we interact with companies and people more directly, it’s not really a Big Brother issue, because we give up our privacy all the time – there’s no legal requirement to give up data; we do it by choice.  We willingly give up our privacy in order to benefit from technology – little bit by little bit.  If we want a website to provide us great recommendations (say Netflix), the company is going to have to know more about us – what we like, and what we don’t like.  

It seems a bit “Big Brother”, but even people store data about us all the time – they’re called memories.  Some are good and some are bad; people remember what we enjoy and what we hate.  People who become our friends are the ones that become great matches for us – they enjoy our humor, they know what we like to discuss, and look out for us when we’re not around.

Companies will be trying to do that as well, but of course, it’s all about trust.  Just as we trust our friends with all that they know about us, we hope to trust companies with all the data they store about us.    That’s probably the biggest thing we need to wrestle with in the Age of Big Data – how to establish trust between people and the machines that will be keeping and using the data they have about us…

Game Theory and the Health Care Debate

With the health care debate raging in the House and Senate, some Swampland blog posters at Time.com have linked these tough decisions to the classic game theory problem called the Prisoner’s Dilemma.  It’s an interesting read about how complex problems can be boiled to a mathematical understanding…

You can read the thread at the Time.com website here, here and here

Odd Stats on Sarah Palin

This week’s Newsweek had a lot of interesting things that piqued my interest – here’s one regarding Sarah Palin

In Jon Meacham’s article, he talks about Why Palin Matters to Obama – And To You.  Certainly Palin is a polarizing figure – there are a lot of people who love her and a lot of people that think she’s dangerous.  Meacham relates Palin’s following within the Republican party to that of Arizona Senator Barry Goldwater in 1964.

But the part I found the most interesting were the numbers – at least poll numbers…  Here’s a quote from Meacham’s article:

“According to Gallup, Republicans are more likely to say they would seriously consider voting for Palin for president (65 percent) than to say she is qualified for the job (58 percent).”

So, what in the world can we make of this?!  If we analyze the numbers a bit more, we can see a couple of things…

One, of Republicans, there are some of them that would vote for Palin for President, even if they thought she wasn’t qualified for the job – at least 7% of Republicans.  It could be higher if there were some that thought she was qualified but wouldn’t consider voting for her…

And, probably of more interest, these people who might consider voting for an unqualified Palin must believe that the alternative – any Democratic nominee – would be a worse option…

I don’t know what previous polls have shown for these “qualified vs. unqualified” polls, but I’m willing to bet that most people believe that the nominees, Democrat or Republican, are qualified to be President.  And, it is likely that people vote for candidates that they do believe are qualified…

But the fact that there are at least 7% of Republicans that would consider voting for an unqualified Republican as opposed to a qualified Democrat is less about the candidate (Palin or anyone else…) and more, I think, about the polarized political environment America finds itself in.

Right now, Republicans think Democrats are out to ruin the country by taking it over.  Also, Democrats think Republicans are out to ruin the country purely to enrich themselves.   Both are wrong, but given the state of things right now, there’s no telling them that… 

To me, this seems to be what is driving these interesting numbers.  Regardless, Palin makes for good political theater…

Arnold’s F-Bomb Veto

So, I’m through Newsweek this weekend, and I run across a quote from Goucher College mathematician Robert Lewand, where he said:

“Somebody in the governor’s office was just having a little fun.”

What Dr. Lewand was describing was the result of a recent veto by California Governor Arnold Schwarzenegger.  So, what could have brought this so much attention?

Well, it turns out that the statement accompanying Gov. Schwarzenegger’s veto came with a subliminal, more obscene message.  Here’s what the actual statement said:

“For some time now I have lamented the fact that major issues are overlooked while many unnecessary bills come to me for consideration. Water reform, prison reform, and health care are major issues my Administration has brought to the table, but the Legislature just kicks the can down the alley.

Yet another legislative year has come and gone without the major reforms Californians overwhelmingly deserve. In light of this, and after careful consideration, I believe it is unnecessary to sign this measure at this time.”

Seems harmless enough, right?  Well, if you look at the actual statement, and especially the first letter in each line of this statement, you’ll see the intended message of “f–k you”.  (To read more about the story, you can go here…)

You might be asking – how does anyone know that anyone intended anything?  Couldn’t this just be a coincidence?  Well, it could be, but we have to ask ourselves a couple of questions:  One, how likely is it that this arrangement of letters occurred by accident?  And two, how likely is it that someone in the Governor’s office arranged the letters on purpose?

In science, we ask ourselves these questions all the time to figure out which one leads to the right answer.  So, let’s start with the second question first – how likely do we think that a government staffer actually played with the sentences to make the message read this way?  I don’t know, but I’m sure it’s pretty unlikely, but let’s say that the odds of this happening is… oh… 1 in a million.  Let’s even put it out there more – let’s say the likelihood of this happening is about 1 in 1 billion.  That’s certainly pretty rare!…

Now, what we do next is to figure out how likely it is that these seven letters came out the way they did by accident.  Well, if you look at how often letters actually occur in the Engligh language, you can calculate this likelihood.  I went to Wikipedia and found these occurrence probabilities, so I calculated the likelihood for myself.

(if you don’t mind, I’d rather not list the letters here – don’t want the obscenity police coming after my blog…)

But here’s something that was interesting.  Originally, when I heard that Professor Leward said it was a 5.5 in 1 trillion shot for this arrangement of letters to be random, I didn’t believe him.  I thought he was off. 

Starting simply, if there are 26 letters in the alphabet, then assuming that each letter is equally likely (which it isn’t, but this is where I started…), then 1/26 = 3.846%, and then multiplied seven times for the 7-letter message, this would be about 1 in 8 billion.  So I thought, “Wait!  This mathematician is wrong, isn’t he?!”…  Well, it turns out that I was…  But then again, I wasn’t (I’ll get to that below…)

So I did the calculation using the actual letter frequencies that I found on Wikipedia, and that’s where I got the same answer as Professor Leward – 5.5 in 1 trillion, or 1 in 185 billion… (it turns out that the letter “k” is actually quite rare – only 0.77% as opposed to the letter “f” which occurs 3.78% of the time…)

But wait, there’s more.  It turns out that Professor Leward should in fact revise his calculations.  The letters of the obscene message actually occur as the first letter of each word, and the letter frequencies for being the first letter of a word are different than the letter frequencies regardless of where the letter shows up.

So, if you use the letter frequencies for the first letter of each word (also on the Wikipedia page), you get a likelihood of 1 in 485 billion or about 2 in 1 trillion, about 2.7 less likely than Professor Leward’s original calculation.

Now, which cause for the “FY” memo is more likely?  Well, now what we do is compare the likelihoods, and the one that has the greater likelihood is the best explanation.  So, even though it would be very rare for a staffer to embed coded vulgarity into the Governor’s statement (1 in a billion), it’s still nearly 500 times more likely that this is the cause than for these letters to have arranged themselves by accident.  Using our likelihood ratio analysis, we can have a lot of confidence that some staffer is going to get a spanking sometime soon…

So is this proof?  It depends on what you mean by proof, but nothing is absolute – you can only compare how likely things are to cause what we see.  This is how science works, and how we all make our decisions…  

Obscene messages happening at random and government staffers arranging for obscene messages to show up in official documents are both very, very rare events.  However, it’s far more likely that Gov. Schwarzenegger’s F-bomb veto is the result of purpose than of accident…