Data Collection for ML Models: Strategies and Protocols for Ensuring Dataset Integrity

cover
10 Jun 2024

Authors:

(1) TIMNIT GEBRU, Black in AI;

(2) JAMIE MORGENSTERN, University of Washington;

(3) BRIANA VECCHIONE, Cornell University;

(4) JENNIFER WORTMAN VAUGHAN, Microsoft Research;

(5) HANNA WALLACH, Microsoft Research;

(6) HAL DAUMÉ III, Microsoft Research; University of Maryland;

(7) KATE CRAWFORD, Microsoft Research.

1 Introduction

1.1 Objectives

2 Development Process

3 Questions and Workflow

3.1 Motivation

3.2 Composition

3.3 Collection Process

3.4 Preprocessing/cleaning/labeling

3.5 Uses

3.6 Distribution

3.7 Maintenance

4 Impact and Challenges

Acknowledgments and References

Appendix

3.3 Collection Process

As with the questions in the previous section, dataset creators should read through these questions prior to any data collection to flag potential issues and then provide answers once collection is complete. In addition to the goals outlined in the previous section, the questions in this section are designed to elicit information that may help researchers and practitioners to create alternative datasets with similar characteristics. Again, questions that apply only to datasets that relate to people are grouped together at the end of the section.

• How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If the data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

• What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? How were these mechanisms or procedures validated?

• If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

• Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

• Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the timeframe in which the data associated with the instances was created.

• Were any ethical review processes conducted (e.g., by an institutional review board)? If so, please provide a description of these review processes, including the outcomes, as well as a link or other access point to any supporting documentation.

If the dataset does not relate to people, you may skip the remaining questions in this section.

• Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

• Were the individuals in question notified about the data collection? If so, please describe (or show with screenshots or other information) how notice was provided, and provide a link or other access point to, or otherwise reproduce, the exact language of the notification itself.

• Did the individuals in question consent to the collection and use of their data? If so, please describe (or show with screenshots or other information) how consent was requested and provided, and provide a link or other access point to, or otherwise reproduce, the exact language to which the individuals consented.

• If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? If so, please provide a description, as well as a link or other access point to the mechanism (if appropriate).

• Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? If so, please provide a description of this analysis, including the outcomes, as well as a link or other access point to any supporting documentation.

• Any other comments?

This paper is available on arxiv under CC 4.0 license.