TBD
Ultra-large-scale software repositories such as GitHub contain an enormous corpus of software and information about software. Scientists and engineers alike are interested in analyzing this wealth of information both for curiosity as well as for testing important research hypotheses. However, the barrier to entry in analyzing this wealth of information is prohibitive due to several factors. First, analyzing this information requires a well-established infrastructure and deep expertise in programmatically accessing version control systems, data storage and retrieval, data mining, and parallelization. This requirement significantly increases the cost of scientific research that attempts to answer questions involving ultra-large-scale software repositories. As a result, experiments are often not replicable, and reusability of experimental infrastructure is low. Second, data associated and produced by such experiments is often lost and becomes inaccessible and obsolete, because there is no systematic curation. Finally, building analysis infrastructure to process ultra-large-scale data efficiently can be very difficult.
This project will continue to enhance the Boa, a CISE research infrastructure, to aid and assist with such research. This next version of Boa will be called Boa 2.0 and it will continue to be globally disseminated. The project will additionally further develop the programming language, also called Boa. Boa allows scientists and engineers to focus on developing the program logic by handling the details of programmatically accessing version control systems, data storage and retrieval, data mining, and parallelization behind the scenes. The project will also enhance the data mining infrastructure for Boa, and a BIGDATA repository containing millions of open-source projects for analyzing ultra-large-scale software repositories to help with such experiments. The project will integrate Boa 2.0 with the Center for Open Science’s Open Science Framework (OSF) to improve reproducibility and will utilize the national computing resource XSEDE to improve scalability.
The broader impacts of Boa 2.0 stem from its potential to enable developers, designers, and researchers to build intuitive, multi-modal, user-centric, scientific applications that can aid and enable scientific research on individual, social, legal, policy, and technical aspects of open-source software development. This advance will primarily be achieved by significantly lowering the barrier to entry and thus enabling a larger and more ambitious line of data-intensive scientific discovery in this area.