Overview: The Institute for Social Research (ISR) proposes to implement the COA3D (Collect, Organize, Archive, Access, and Analyze Data) infrastructure. It addresses two of NSF?s Big Ideas: convergence across scientific domains and harnessing the data revolution. When implemented, COA3D will be comprised of a flexible system of standards, tools for collecting and managing data that implement and rely on those standards, and technological processes that leverage those standards for discovering, analyzing, archiving, and disseminating data. It will improve the quality, integrity, and safety of data while increasing accessibility to and collaboration between users across all social science disciplines. COA3D, encompassing the full data life cycle, will ensure research data are FAIR (findable, accessible, interoperable, and reusable), and will make scientific analyses using that data more rigorous, transparent, and reproducible. COA3D will include stand-alone functional components for each stage of the research lifecycle, and the components together will be interoperable.
Intellectual Merit: Twentieth-century data infrastructure cannot adequately support 21st century social science. Diverse types of data enable path-breaking analyses into human behavior but also present challenges of scale, sensitivity, and structure, requiring new approaches to causal inference, storage and preservation, analysis, and privacy. There is consensus in the scientific community on the urgent need for new modes of access, confidentiality protection, methodological approaches, and tools so that research using a variety of data types meets accepted scientific standards. Current barriers include multiple incommensurate standards for data, lack of interoperability, and the inherent difficulty of managing data abundance. The designed and ready-to-implement infrastructure described in this proposal will enable social scientists across disciplines to conduct their work more efficiently and to create, organize, archive, access, and analyze data in ways that they cannot with existing infrastructure. The overarching principles of COA3D are simultaneously to improve the quality, integrity, and safety of data while increasing the accessibility of the data to more users from all disciplines. The component parts of the infrastructure include: (1) repositories for research plan pre-registration, grants, preprints, consent statements, data use agreements, IRB approvals, and other research documents; (2) interactive software to facilitate data harmonization, and attach appropriate metadata to permit interoperability, maintain provenance, prepare for re-use and discovery, and check for confidentiality issues; (3) repositories for analysis code; repositories for social media and multimedia data (or linkages to those stored elsewhere); (4) tools for harvesting online data and tools for data visualization and integration; user friendly cloud-based infrastructure for analysis of large, non-designed datasets; (5) protection of privacy and confidentiality with a cloud-based virtual data enclave and through carefully designed noise infusion; and (6) automated discovery and inclusion of scientific output as metadata.
Broader Impacts: COA3D will have both direct and indirect impacts beyond academic researchers. It will make social and behavioral data more accessible to researchers inside and outside of academia by making it more findable and accessible and by making data-driven analyses more reproducible. COA3D will have a transformative impact on social science research, not simply research done at the Institute for Social Research or the University of Michigan, but in empirical social science more broadly. These resources will increase the quantity, quality, and transparency of data available to researchers. It will allow researchers to focus their efforts on their research as less time is required to repeat (over and over again) the descriptions of their research and