The news is full of headlines describing the “rise of big data” and the consequent need for data scientists and “big data” professionals. Yet, as stewards of vast troves of printed and electronic information for generations, haven’t librarians always dealt with big data? Could it be that Data Science is just a “hype” term for what librarians have been doing all along?

Moore’s “Law”

Before we follow up on that question, please indulge a bit of recent history.  Many people are familiar with the idea of Moore’s Law: that the amount of computer processing power per chip doubles every 18 months. A similar idea, promoted by Mark Kryder, the former chief technology officer of Seagate (a hard disk manufacturer), suggests that the amount of data storage one can fit on a given area of a magnetic medium also doubles every 18 months. What is little understood about these “laws” is that doubling in a fixed amount of time creates an accelerating trend that starts off slow but eventually reaches a tipping point of massive growth. Between 2005 and the present we leaped from reasonably affordable disk drives that could comfortably hold all of your family photos to online “cloud” storage sufficient to digitize a whole floor of library books that is available completely free to anyone with a computer and an Internet connection. Today, for less than the price of a nice meal at a fancy restaurant one can buy a hard disk drive that has sufficient storage to hold the entire printed collection of the Library of Congress.

Today’s Data Problems need Generalists and Specialists

Thanks to this accelerating trend, hospitals, schools, manufacturers, colleges, retailers, government agencies, and libraries have begun to collect and store truly enormous amounts of data. The goal in many cases is to make use of these data to provide valuable new services or to improve efficiency. The problem with reaching these goals is that as the amount of storage and processing has grown, the complexity of the data and the challenges of working with it have also accelerated. In the good old days a programmer would write a program, a user would use the program, a statistician would analyze the data that the user produced with the program, and a librarian would archive the report that the statistician created by analyzing the data. Those days are gone. The reason we now see lots of job advertisements for “data scientist” is that there is a pressing need for interdisciplinary bridge builders who understand all of the above: the Internet, databases, analytics, visualization, and data curation. These professionals have their specialties – some are good at working with numbers, others are database experts, still others have expertise in unstructured data (e.g., text) – but they also need generalist skills that let them blend the wide range of methods needed to manage today’s data problems.

Where does the New Librarian fit in?

Librarians have always been great at information management and organization. This is a core skill in data science; it manifests most strongly in the data curation component of the big data problem. Many librarians are also outstanding communicators and have been trained in the art and science of transforming user information needs into strategies and resources for investigation and learning. So librarians clearly have roles at the start and the finish of the big data problem. But what about the middle of the equation, where data transformation, analysis, and visualization are the heart of the data science endeavor? This brings us back to our original question of how library science and data science are connected.

The essential task of the data science professional is to transform raw, messy data into actionable knowledge that can be used by decision makers. To paraphrase my astute colleague R. David Lankes, ‘the mission of librarianship is to facilitate knowledge creation in communities.’ It is easy to see the overlap here. A librarian does not need to become a programmer, but every librarian interested in knowledge creation should have some essential familiarity with how various software tools can transform data. A librarian need not be a database engineer, but every librarian must understand the underpinnings of information retrieval tools. A librarian does not need to be a statistician, but every librarian should have a clear understanding of how descriptive summaries and basic tests of numeric data can be used and misused. Finally, a librarian does not need to be a graphic designer, but every librarian needs to recognize the features of effective data displays. In short, to fulfill their missions, librarians can exercise a range of sophisticated skills that squarely occupy the central ground between understanding information user needs on one end and data curation on the other.

When you consider some of the key values that drive librarianship, however, it becomes evident that librarians must take a leading role in working with big data lest this emerging specialty become the servant only of proprietary interests. Librarians stand for open access to information, for privacy rights, for serving the information needs of the community, for the importance of accurate information in a democratic society, and for the necessity of preserving the legacy of historical information for future generations. Public library users, students in school libraries, and faculty and students in university libraries all depend upon these bedrock values to support their missions of learning, exploration, and citizenship. We’ve known for quite a while that fulfilling these missions requires much more than choosing, shelving, and lending books. In the near future, the ability to fulfill the roles of citizenship will require finding, joining, examining, analyzing, and understanding diverse sources of data. For a citizen to become an effective advocate tomorrow, she might need to “mash-up” map data, census data, health data, and environmental data to develop a meaningful understanding of a challenge that the community faces. Who but a librarian will stand ready to give the assistance needed, to make the resources accessible, and to provide a venue for knowledge creation when the community advocate arrives seeking answers?

Information v. Data

We frequently hear the word “information” paired up with other words to describe our world – the information age, the information industry, the information society. In this light, data science almost seems like a step backwards from the place where most librarians get their professional education: in graduate programs of library and information science. The excitement and burgeoning interest in data science, however, arises from a recognition that data are the raw ingredients of knowledge, and that we urgently need more professionals who possess a deep understanding of how to transform, analyze, and present data to facilitate knowledge creation. Librarians are poised to become the core of future cadres of data scientists, but doing so will require filling in that middle ground in data education where too few librarians have gone so far. Doing so will require an additional educational commitment, and quite possibly less attention to certain traditional topics. The tradeoff will be worthwhile, as data science holds enormous potential as a focus area in the future of librarianship.

What are your thoughts on librarians as potential data scientists?  Share in the comments.