Improving the Apache AsterixDB big data management system to harness the data generated by digital activity and sensors.
Researchers in all disciplines can gain tremendous insight from the big data generated on a daily basis through social networks, blogs, online communities, news sources, and mobile applications, as well as our increasingly sensed surroundings. Researchers exploring these big data sources need software to store, index, manage, and analyze the data, while researchers investigating new technical approaches for managing and analyzing it can benefit significantly from shared building blocks to use as a foundation for their efforts. Over the past ten years, the Apache AsterixDB scalable big data management system (BDMS) has been developed to address this need by providing a repository for semi-structured data that cannot be organized in tables.
Apache AsterixDB is a highly scalable BDMS that stores, indexes, and manages semi-structured data, e.g., much like MongoDB, but it supports a full query language with the expressiveness of SQL and more. Unlike analytics engines such as Apache Hive or Spark, AsterixDB stores and manages data, so it can use knowledge of data partitioning and index availability to avoid scanning whole data sets when processing queries.
This project is enhancing AsterixDB to better meet community needs, including improved text handling, numerous query processing improvements, additional standard-based geospatial data support, user-defined functions for user-provided logic including machine-learned models, and a variety of storage-level improvements to increase the system's storage, indexing, data ingestion, and integration with other systems. In addition to enabling computer and information science and engineering research on big data management, Apache AsterixDB can be used train students nationwide in big data management and analysis; such training is crucial to addressing the information explosion due to social media, the mobile Web, and Internet of Things (IoT).