Back at the end of December, I was happily surfing through a few of the 'Best of 2010' and '2011 predictions' articles put forth by pundits, bloggers, etc., when a post on data prediction by Josh Jones-Dilworth caught my attention. The author outlines five data-driven trends to look for this year. His last point struck me as particularly prescient: "You'll be sick of hearing about data (if you're not already)."

Right on. It's only February, and I'm already feeling it. I can't seem to escape the deluge of articles about data. How we acquire it. How we store it. How we separate the wheat from the chaff. This week in particular sticks out.

Science special: Dealing With Data

The Feb. 11 issue of the journal Science includes a special issue devoted to the challenges and opportunities of data collection, curation, and access. The entire collection of perspective articles are available online for free (registration required). From the introduction:

"We have recently passed the point where more data is being collected than we can physically store. This storage gap will widen rapidly in data-intensive fields. Thus, decisions will be needed on which data to archive and which to discard. A separate problem is how to access and use these data. Many data sets are becoming too large to download. Even fields with well-established data archives, such as genomics, are facing new and growing challenges in data volume and management. And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used."

here and here. You can also search for it. It's receiving a lot of attention in the news and in the blogosphere.

Here are a few of the gee-whiz points culled from this paper, written up by Suzanne Wu on

  • Looking at both digital memory and analog devices, the researchers calculate that humankind is able to store at least 295 exabytes of information. Put another way, if a single star is a bit of information, that's a galaxy of information for every person in the world. That's 315 times the number of grains of sand in the world. But it's still less than one percent of the information that is stored in all the DNA molecules of a human being.
  • In 2007, humankind successfully sent 1.9 zettabytes of information through broadcast technology such as televisions and GPS. That's equivalent to every person in the world reading 174 newspapers every day.
  • On two-way communications technology, such as cell phones, humankind shared 65 exabytes of information through telecommunications in 2007, the equivalent of every person in the world communicating the contents of six newspapers every day.

Simulating Twitter, The Locker Project

But this wasn't the only fascinating data-centric news this week. MIT's Technology Review reports that researchers in Spain have constructed a simulated network called SONG (Social Network Write Generator) that can forecast Tweet behavior. Why would one want to do this?

Many groups are likely to be interested in using a virtual Twitterverse. Erramilli and co say it can be used to analyse the capacity of parts of a network and to benchmark its performance. But it's the ability to forecast tweeting activity and the effect of things like flash mobbing that is likely to generate the most interest.

Meanwhile, the O'Reilly Radar blog reports this week of a new company called Singly that aims to popularize the open source Locker Project, which will employ a new protocol called TeleHash. It took me a while to wrap my head around this. Essentially, it's about harnessing and sharing data in new, more personalized ways. Here's an excerpt from a recent post on ReadWriteWeb that helped:

The open source service will capture what's called exhaust data from users' activities around the web and offline via sensors, put it firmly in their own possession and then allow them to run local apps that are built to leverage their data.

Many prognosticators suggest that this will be the Next Big Thing for apps and online services. Web 3.0, in other words, will be all about me. It's about delivering a highly-personalized data set that will draw together my online and (increasingly) offline activity. It'll be sort of like a data journal (or a locker). And by combining my data with other data sets, I'll presumably be able to find hidden patterns, correlations, and context that relate to my life in a very personal way.

As I understand it, the TeleHash protocol will permit the decentralized P2P sharing and searching for data across the network. It's about me connecting with you—just as we do in today's social enivironment— but in a much more targeted and sophisticated way. While I'm sure I haven't grasped all of the nuances of this project, it sounds promising.

IBM's Watson on Jeopardy!

Smartest Machine on Earth. Apt to my theme, it's about the big three-day contest next week on Jeopardy that pits two of the show's best-ever human contestants against IBM's Watson. If you're unable to watch Jeopardy next week, Ph.D. students who worked on the Watson project are going to live-blog the contest as it airs.

Will the machine win? It's going to be fun to watch. Even if Watson doesn't win, it's amazing that a machine exists that can (quickly) answer obtuse Jeopardyesque questions. Talk about harnessing data. By the way, be sure to check out IBM's Watson website. They've done a good job with it. 

Sending Data Offworld

So ... there are many interesting efforts going on to better process, use and understand the data we're collectively generating on planet Earth. But what about transmitting data off the planet? Yes, I'm talking about the search for extraterrestrial life. There's a preprint of a new study out this week about this pursuit, too.

It's a fascinating—and refreshingly readable—paper about METI. That's Messaging to Extraterrestrial Intelligence. The paper sums of the debate encircling how, and if, we should try to send transmissions into the void. It suggests that current attempts at transmissions are probably too feeble to matter, and suggests future laser and microwave systems may be more viable. The authors also advocate a moratorium on future METI transmissions until an international body addresses the risks associated with attempts to contact ET life.

Here's one excerpt that struck me:

In 2000, the International Academy of Astronautics sent a proposal to the UN Committee on the Peaceful Uses of Outer Space entitled "Declaration of Principles for Activities Following the Detection of Extraterrestrial Intelligence”, also known as the First Protocol (Billingham and Heyns 1999). The proposal was received without objection. Principle 8 reads, in part "No response to a signal or other evidence of extraterrestrial intelligence should be sent until appropriate international consultations have taken place". No one seems opposed to having international consultations about transmitting after we detect them by standard SETI. Assuming this to be the case, it is surely even more important to have the consultations about transmitting before we detect them when we don't even have their signal in hand.

Good point.