TBD
This project will facilitate software engineering research by creating a public research infrastructure centered around achieving a curated collection of source-code and version control history data approximating the entirety of open-source software (OSS).
Despite the fact that real-world data from OSS development has catalyzed progress in software engineering research in the last two decades, and that the OSS version control data are public and detailed (with actions of developers and versions of the source code), the data are not suitable for research. This is caused by the following factors:
- The sheer scale of the data and the need for their curation, which involves collection, contextualization, correction, augmentation, and integration.
- The data are spread across many platforms, embedded in many tools and formats, and spread across tens of millions of repositories.
- The difficulty of curating data across the entire OSS ecosystem, beyond the capabilities of individual research groups, leaves many important research questions unanswered.
- Individual OSS projects depend on each other and share source code and developers among them. This creates tremendous risks of, for example the spread of vulnerable source code and the ripple effects of volunteer maintainers disengaging.
The WoC team will create nearly complete, fully curated, and extensively cross-referenced version control data that will enable the research community to measure and understand the dynamics of OSS ecosystems and, thus, help identify and manage risk to OSS and to society in general.
This project will use input from the software engineering community to create a research infrastructure that contains: 1) regularly updated and cross-referenced source-code and version history resource approximating the entirety of OSS; 2) data curation capabilities, e.g., identity disambiguation and extraction of dependencies; 3) easy-to-use web services and applications to support common research tasks; 4) training: tutorials, mentoring, hackathons and seminars to help use the resource effectively and efficiently; and 5) a community of researchers, developers, and companies who maintain, guide, enhance, and operate this infrastructure.
This research infrastructure will thus enable answers to an entirely new set of research questions concerning OSS network structure defined by technical dependencies, code sharing, and knowledge flows. It will also provide accessible means for stratified sampling from the OSS universe of code, improving the generality of research findings.