One area of increasing concern to Cyber Security is the intersection of AI, Big Data, personal privacy, and ID Theft. Privacy concerns arise when using data containing personal information. For example, financial institutions are using Big Data to train AI algorithms to detect fraud by identifying transactions that fall outside the expected transaction patterns. They also use similar techniques for marketing purposes.
An approach that is gaining traction is using synthetic data to reduce the risk of exposing personal information.
What is Synthetic Data?
In the broadest definition, synthetic data is artificially generated data. Bear in mind, though, that there are different types of synthetic data needed in different scenarios.
Why do we need it?
There are two principal reasons for using synthetic data:
- Ease of collection. In many cases where large volumes of anonymous data are needed, it is easier to generate it than collect it. It can be generated to focus on a specific event or series of events that might not be collected in real-world data. A good example is the data used to train the AI piloting driverless cars.
- Regulatory Compliance. The EU and the US have new privacy regulations that mean that Big Data has increasing restrictions on its collection, storage, and use. Anonymizing it is no longer enough. It can be quite simple to correlate or cross-reference two datasets to infer or identify exact individuals.
And that is where synthetic data comes in.
Synthetic data has been around for some thirty years or so. It has recently increased in popularity for two main reasons:
- The risks to personal privacy from Big Data have become more apparent;
- AI applications like fly-by-wire and driverless vehicles are demanding enormous volumes of data for testing. Far more than can be economically collected. What is collected might not cover all the situations needing to be tested.
The first attempt to provide privacy-protecting synthetic data was in the early 1990s when the US Decentennial Census data was shared without disclosing any personal information.
It became evident that simple anonymization of data was not sufficient to protect privacy. Anonymized data could be linked to other databases, and data mining software would identify real identities, either confirmed or inferred.
Studies by the World Economic Forum and the EU, among others, showed the potential and shortcomings of synthetic data generation. They were particularly concerned about the Cyber Security impact of Big Data and sophisticated data mining.
Generating Synthetic Data
Generation techniques are becoming more sophisticated, moving way beyond simply anonymizing existing information. In some cases, real-world raw data will be better, but you can model your generation process to focus on specific outliers you want to test.
Companies will generate anonymized synthetic datasets to order, usually from structured data like financial or demographic information.
Deep Fakes generation is a specific variant of Synthetic Data generation. A Deep Fake is a fully synthetic persona, including images, videos, and background information. It looks at first sight to be a real person, including artificially generated features, but on further investigation, is completely fake. Deep fakes usually pass Cyber Security without trouble.
There has been much discussion about Deep Fakes subverting the electoral process by being used in recent elections, including the 2016 US Presidential election and UK EU referendum, as a political tool to influence voter intentions. The usual techn ique was to generate momentum for a particular view by using Deep Fakes to spread fake news and make comments on Social Media.
Using Synthetic Data
Synthetic Data is used to
- Analyse Big Data. Processing BI analyses on sensitive synthetic data bypasses GDPR regulations.
- Train AI Algorithms;
- Make anonymous data available on the cloud; and
- Enable Safe Data Sharing. Synthetic data allows sharing of data with business partners in collaborative projects.
Does Synthetic Data Protect Privacy
As with many things, the answer is both yes, and no. In pure privacy terms, it probably does by potentially removing identifiable personal data from data sets. On the other hand, the ability to create large bodies of deep fakes with seemingly live data raises significant Cyber Security questions about using false identities in perpetrating fraud.