Author: Lucy Vasserman, John Cassidy
Publisher: Medium
Publication Year: 2019
Summary: The following article discusses how the Jigsaw team at Google analyzed Perspective API’s toxicity model, which gives toxicity scores to online comments from a variety of sources. In the training data, these comments are rated by humans, who score the models on a toxicity scale with some other optional attributes. In this post about model transparency, they analyzed some of the model scores and found that toxicity scores were often misattributed in cases where a term could be used as an identity or an insult. For example, many comments with “gay”, “lesbian”, or “queer” were rated as more toxic than they should be, and some similar results were found for terms like “white” or “black”. Conversely, terms such as “cis” and “straight” were scored less toxic than they should have been, according to their analysis. This was a demonstration of model transparency on the first version of their algorithm, which has since been updated and re-released several times. It illustrates well the idea that we need to look closely at how our models work, and even with humans helping to build the model, cases with several possible interpretations can be extremely nuanced and difficult to categorize.