Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers

Author: Kenny Peng, Arunesh Mathur, Arvind Narayanan

Publisher: 35th Conference on Neural Information Processing Systems

Publication Year: 2021

Summary: The following research paper was done on how the most common method of removal of unethical data does not work as most expect or intend. Essentially, deleting unethical data is often not enough as the information is out on the internet in way too many places for most public datasets that deletion of the source would not change anything. The study talks about a particular Microsoft dataset that used 100,000 “celebrities” faces to match you with what celebrity you look most similar to. The dataset was taken down because consent was not granted to be included in this database. Even though it is no longer publicly available, there are still remains of its usage within 1,000 papers out on the internet.