Datasheets for Datasets

Author: Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, Kate Crawford

Publisher: Microsoft

Publication Year: 2019

Summary: The following proposal identifies that as the field of machine learning has grown, we continually see misuse of algorithmic processes or data that reinforces biases or has other ethical and legal concerns. The proposal here is that similar to the electronics agency, each “component” in data should have a description, prompted by questions in 7 categories. For example, motivation for creating the dataset, the data collection process and preprocessing, and whether the database is maintained, and by whom. It proposes this format, with a sample, as a systematic way to think about data and start conversations on ethical and legal concerns. It was prompted by machine learning concerns specifically, but it could be useful to think through for any kind of modeling or work with data.