Blog Article

The Rise of Big Data: The Utility of Datasets

Data visualization and machine learning will be key to analyzing large datasets in this new scientific revolution.

Published March 1, 2012

By Diana Friedman
Academy Contributor

Image courtesy of A.JourMory – via stock.adobe.com.

The importance of observation—the crux of the scientific method—remains unchanged from the early days of scientific discovery. The methods by which observations are made, however, have changed greatly. Consider astronomy. In the early days, under a black expanse of night punctuated by brilliant fiery lights, a group of science-minded people looked up at the sky and recorded what they saw—the fullness of the moon, the locations and formations of the stars.

Observation with the naked eye was the norm until the 17^th century, when the invention of the telescope revolutionized astronomy, allowing scientists to see beyond what their eyes could show them—a literal portal into the unknown.

A New Revolution

Now, a new revolution is taking place, in astronomy and across nearly all scientific disciplines: a data revolution. Scientific data collection has become almost entirely automated, allowing for the collection of vast amounts of data at record speed. These massive datasets allow researchers from various organizations and locales to mine and manipulate the data, making new discoveries and testing hypotheses from the contents of a spreadsheet.

“The astronomy community was able to switch to the idea that they can use a database as a telescope,” says Alex Szalay, Alumni Centennial Professor, Department of Physics and Astronomy, Johns Hopkins University, as well as a researcher in the Sloan Digital Sky Survey (SDSS), a 10+ year effort to map one-third of the sky.

Thanks to projects like the SDSS and open access data from the Hubble Space Telescope, would-be Galileos don’t need access to a telescope, or even a view of the night sky, to make discoveries about our universe. Instead, huge data sets (so-called “big data”) can provide the optimal view of the sky, or, for that matter, the chemical base pairs that make up DNA.

How Big is ‘Big Data’?

It is hard to estimate exactly how much data exists today compared to the early days of computers. But, “the amount of personal storage has expanded dramatically due to items like digital cameras and ‘intellectual prosthetics,’ like iPhones,” says Johannes Gehrke, professor, Department of Computer Science, Cornell University. “For example, if you bought a hard drive 20 years ago, you would have had 1.5 to 2 gigabytes of storage. Today, you can easily get 2 terabytes. That’s a factor of 1,000.”

It is not just the amount of data that has changed; the way we interact with and access that data has changed too, says Gehrke, a 2011 winner of the New York Academy of Sciences Blavatnik Awards for Young Scientists. “There is an entire industry that has sprung up around our ability to search and manage data—look at Google and Microsoft,” says Szalay.

But what is big data? Is a 2-terabyte file considered big data? Not anymore. “It’s a moving target,” says Szalay. “In 1992, we thought a few terabytes was very challenging.” Now, the average portable, external hard drive can store a few terabytes of data. An easy definition of big data is “more data than a traditional data system can handle,” says Gehrke.

Searching for Structure

Scientists working on large-scale projects, like the SDSS, or those in genomics or theoretical physics, now deal with many terabytes, even petabytes, of information. How is it possible to make sense of so much data?

“We have the data—we can collect it—but the bottleneck occurs when we try to look at it,” says Szalay. Szalay is currently working on a project at Johns Hopkins to build a data-driven supercomputer (called a data scope) that will be able to analyze the big datasets generated by very large computer simulations, such as simulations of turbulence. “We are able to provide scientists who don’t usually have access to this kind of computing power with an environment where they can play with very large simulations over several months; with this computer we are providing a home to analyze big data.”

The rub? Scientists need to be fluent in computation and data analysis to use such resources. “Disciplines in science have been growing apart because they are so specialized, but we need scientists, regardless of their specific niche, to get trained in computation and data analytics. We need scientists to make this transition to ultimately increase our knowledge,” says Szalay.

Two fields in particular are garnering attention from scientists for their ability to provide structure when data is overwhelming: data visualization and machine learning.

Picture This

Data visualization takes numbers that are either generated by a large calculation or acquired with a measurement and turns them into pictures, says Holly Rushmeier, chair and professor, Department of Computer Science, Yale University, and a judge for the Academy’s Blavatnik Awards for Young Scientists. For example, a project might take numbers representing flow going through a medium and turn them into an animation.

“Visualization allows you to look at a large volume of numbers and look for patterns, without having a preconceived notion of what that pattern is,” says Rushmeier. In this way, visualization is both a powerful debugging tool (allowing researchers to see, through the creation of a nonsensical picture, if there might be a flaw with their data) and an important means for communication of data, whether to other researchers or to the general public (asin the case of weather forecasts). So perhaps the old adage needs to be re-written: Is a picture now worth a thousand lines of code?

“There are many flavors of visualization,” says Rushmeier. Information can be mapped onto a natural structure, such as valves being mapped onto the heart, or an entirely new picture can be created (data without a natural structure is referred to as high-dimensional data). The classic example of high-dimensional data is credit card data, says Rushmeier, “but there is a lot of high-dimensional data in science.”

Mapping Information

Rushmeier is currently immersed in 3D mapping, working closely with an ornithologist who studies bird vision. He records light waves to which birds are sensitive, from the UV to the infrared, to get a better sense of how bird vision evolved and for what purposes (e.g., mating and survival). Through 3D mapping, Rushmeier is able to take the ornithologist’s numerical data and simulate the actual viewpoint of the bird onto different 3D surfaces.

“To stop a conversation dead in its tracks, I tell people I work in statistics. To get a conversation going, I say I work in artificial intelligence,” jokes David Blei. Both are true—Blei, associate professor, computer science, Princeton University, works in the field of machine learning, a field that encompasses both statistical and computational components.

The goal of machine learning is to build algorithms that find patterns in big datasets, says Blei. Patterns can either be predictive or descriptive, depending on the goal. “A classic example of a predictive machine-learning task is spam filtering,” says Blei. A descriptive task could, for instance, help a biologist pinpoint information about a specific gene from a large dataset.

Part of our Daily Lives

Machine learning is not only used by technology companies and scientists—it is a part of our daily lives. The Amazon shopping and Netflix recommendations that pop up almost instantaneously on our computer and TV screens are a result of complex machine-learning algorithms, and the recommendations are often eerily spot-on. But it is important to remember that getting from data to real information requires a step, says Blei. This is especially true when machine learning is applied to science and medicine.

“We need more work in exploratory data analysis,” says Blei, as well as careful validation of algorithms, to avoid making irresponsible conclusions. Interestingly, Blei says that quality of data is not as important to the final result as it might seem; instead, quantity of data is paramount when it comes to drawing conclusions through machine learning. And enormous datasets abound in science—just consider all of the raw data generated by The Human Genome Project.

Now, says Blei, the analysis of data sources (like Twitter) pose an equally big challenge. “Unlike a dataset, a data source has no beginning and no end.”

A prediction that doesn’t require a complex algorithm? The fields of data visualization and machine learning, as well as other forms of data science, will continue to grow in importance as datasets and data sources get bigger over time and everyone, from neuroscientists to corporations, looks for a way to turn data into meaningful information.

This story originally appeared in the Winter 2012 issue of The New York Academy of Sciences Magazine.

Also read:

Author

Contributing Author