By: Diane Stirling
A research project that will develop metadata models for managing heterogeneous workflows, and that involves Syracuse University experts in the fields of gravitational wave physics, information science, and computer science, has been funded by the National Science Foundation’s Advanced Cyberinfrastructure Division.
The principal investigator (PI) for the project, Duncan Brown, associate professor in the Physics Department at the School of Arts and Sciences at Syracuse University, is joined in the effort by co-principal investigators Jian Qin, professor at the School of Information Studies; Peter Couvares, senior scientist in the Physics Department; and Ewa Deelman, research associate professor in the Computer Science Department at the University of Southern California.
The grant will fund work for the design, development, and deployment of metadata-aware workflows and data-mining tools to enable the management of large, heterogeneous data sets produced by scientific analysis. The pilot effort targets the cyber infrastructure used to search for gravitational waves by the Laser Interferometer Gravitational Wave Observatory (LIGO). LIGO is part of a worldwide network of gravitational-wave observatories poised to probe black holes, neutron stars and supernovae by using gravitational waves as a tool to study physics and astronomy. The gravitational-wave physics community has an immediate need for improved data management and analysis tools to accomplish its scientific goals, Brown wrote in his funding proposal.
Dr. Qin explained that the diverse collaboration dates back to 2010, when she and iSchool Research Associate Professor Howard Turtle first began discussions with Brown regarding the possibility of a metadata and information retrieval project applied to this scientific area of research.
The project entails a holistic approach to scientific data management, Dr. Qin said, which is focused on scientists’ needs regarding the management of their data input and output, pipelines and workflows at each stage of a research lifecycle. “Because it is computational intensive, we have to understand how data flows from one point to another and the provenance information generated along the way, and then the whole workflow of the research,” she explained.
Dr. Qin’s portion of the work will entail responsibility for studying the data flow, data structures, and the research lifecycle, as well as understanding the needs for data retrieval, data discovery, data tracking, and long-term data preservation for future access, she said. Metadata models will be developed that describe the data sets and other data artifacts, such as who created them, who ran the analysis job, where the data originates, where the output goes, and tracking every data set generated. This will assure that the data source, output location, and algorithms and parameters used in the science are available to other researchers in the future, she added.
“It’s a good collaboration,” the iSchool professor remarked. “You can imagine I don’t understand astrophysics, but along the way, I’ve learned a lot about astrophysics research, as well as the field’s data structure and data management needs, so I can relate those issues and needs back to my field, and think about how metadata and ontologies can help meet those needs.”
Dr. Qin said the majority of her focus will occur in the first two years of the study, when she will work with the principal investigators to design metadata models and implement and test them. The third year is expected to consist of evaluating the effectiveness and usefulness of the models developed.
The proposal references how the completed research will benefit a number of scientists in diverse fields of scientific study by permitting easier ways to track and find existing research. In Brown’s proposal, he explained: “Efficient methods to access and mine the large data sets generated by LIGO's diverse gravitational-wave searches are critical to the overall success of gravitational-wave physics and astronomy. Providing these capabilities will maximize existing NSF investments in LIGO, support new modes of collaboration within the LSC, and better enable LSC scientists to explain their results to the external scientific community—including the critical issue of data and analysis provenance for LIGO's first detections.”