Laying the groundwork for harmonizing the various forms of data and open-source software across different formats and repositories, this project will cut data processing time, give more researchers access to big data, and produce more robust, replicable and accurate research.
While software engineering research has made great progress over the decades, the unavailability of data has long been a major limitation. Open source software (OSS) makes large quantities of data available, creating the promise of using modern quantitative techniques on very large data sets to understand how to create better software and improve productivity. However, the data exists in different formats, has many errors and omissions, is located in millions of repositories around the world, and requires extensive processing to render it in a form researchers can use. The planning grant aims to develop the requirements for creating World of Code (WoC), an infrastructure for enhancing software engineering research, and to build a research community centered around massive data gathered from all open source repositories around the world. This planning process will ensure that WoC, if realized, will enable researchers to easily access all open source data, resolve issues of replication, avoid drawing samples that are not representative, and reduce the risk of inappropriate conclusions based on erroneous data. The planning grant will establish a Steering Committee and an Advisory Board to guide the development and evolution of the resource in a way that provides the greatest value to the community of researchers. Hackathons will give both experienced and novice researchers an opportunity to learn how to use WoC's prototype capabilities, add to the tooling, and concretely express the desires and priorities of the research community WoC seeks to serve.
Resources needed to mine and maintain data on the entire OSS code base and version histories far exceed what individual research groups can accomplish in the scope of conventional hypothesis-driven research. The pressing need to investigate the entirety of OSS will be addressed by creating requirements and building a community for an infrastructure that should contain: 1) A complete, rapidly updated source code data for the entirety of OSS; 2) Tooling for data correction, augmentation, and curation essential for such data: e.g., identity disambiguation and extraction of dependencies; 3) Services supporting research tasks that rely on the entirety of the collection; 4) A self-governing community of researchers, developers, and companies who maintain, enhance, and operate this infrastructure. To elicit the precise requirements for this OSS-wide infrastructure a wide audience of software researchers will be engaged.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.