Synthetic Data Generation

Author: Christian Schitton

Publisher: Medium

Publication Year: 2022

Summary: The following article discusses how synthetic data is a less well-known area of data science. Synthetic data addresses issues of insufficient representation in data. For example, models can become biased when trained on datasets where demographics are underrepresented. By creating synthetic, but demographically identically data to help to represent these gaps in data, synthetic data can avoid this issue of underrepresented communities introducing bias in models. Synthetic data, because of its synthetic nature, can help protect the privacy of the real people represented by data. The synthetic data is demographically identical, so no information is lost, but does not tie to real individuals the way that real data does. This can help to combat data protection rules and corporations that do not what to expose their business data. Some large companies are already using synthetic data such as Google, Amazon, and American Express. The article goes into more of the technical side of synthetic data.