Author: Joanna Kaminska
Publisher: Statice
Publication Year: 2022
Summary: The following article discusses how it is important to know that while working with data there is a possibility that the data is biased; this is a big problem due to the fact that if this biased data is used to create an machine learning (ML) model, then the model too will most likely be a biased model. The problem with a biased model is that discrimination may occur against certain groups of people – this usually occurs through systemic biases where certain groups of people are underrepresented in a dataset. In other instances, the bias occurs through selection bias; it is very important that independence or randomization is present throughout the data collection process. Models are also affected by too much training causing overfitting which may lead to poor performance from the model. One important thought given by the authors is that it is good to have “Diverse” and “Representative” datasets. A model should not be created if it is affected by sample bias, because this means that there are some groups or individuals or other factors that are not taken into consideration when trying to explain things about a certain population.