Synthetic data is information that's artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models.[1]
Data generated by a computer simulation can be seen as synthetic data. This encompasses most applications of physical modeling, such as music synthesizers or flight simulators. The output of such systems approximates the real thing, but is fully algorithmically generated.
Synthetic data is used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data. In many sensitive applications, datasets theoretically exist but cannot be released to the general public;[2] synthetic data sidesteps the privacy issues that arise from using real consumer information without permission or compensation.
Synthetic data is increasingly being used for machine learning applications: a model is trained on a synthetically generated dataset with the intention of transfer learning to real data. Efforts have been made to enable more data science experiments via the construction of general-purpose synthetic data generators, such as the Synthetic Data Vault.[12] In general, synthetic data has several natural advantages:
This usage of synthetic data has been proposed for computer vision applications, in particular object detection, where the synthetic environment is a 3D model of the object,[13] and learning to navigate environments by visual information.
At the same time, transfer learning remains a nontrivial problem, and synthetic data has not become ubiquitous yet. Research results indicate that adding a small amount of real data significantly improves transfer learning with synthetic data. Advances in generative adversarial networks (GAN), lead to the natural idea that one can produce data and then use it for training. Since at least 2016, such adversarial training has been successfully used to produce synthetic data of sufficient quality to produce state-of-the-art results in some domains, without even needing to re-mix real data in with the generated synthetic data.