A New Era of Healthcare: Navigating Synthetic Data

Published on

December 12, 2023

Authors

Sónia Marques

Data Scientist, Deeper Insights

Advancements in AI Newsletter

Subscribe to our Weekly Advances in AI newsletter now and get exclusive insights, updates and analysis delivered straight to your inbox.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

In the age of data-driven healthcare, synthetic data is emerging as a transformative force. This innovative technology allows us to create artificial yet statistically meaningful data sets, offering a plethora of applications that can revolutionise healthcare as we know it. From studying patterns in stigmatised illnesses to simulating the impact of healthcare policies, synthetic data is breaking new ground.

While the concept of synthetic data has been around for over three decades, it's just now reaching a tipping point in broader adoption. This groundbreaking technology is revolutionising not just medical research, but also the methods healthcare providers use to manage patient care. So, what is synthetic data, and how is it causing a stir in the healthcare sector?

What is Synthetic Data?

Synthetic data is artificially generated information that mimics the characteristics of real-world data. Created through algorithms and computational models, synthetic data serves as a stand-in for actual patient records, clinical studies, or other types of healthcare data. While it doesn't replace the depth and nuance of real data, it offers a valuable alternative for research and analysis. By replicating statistical patterns found in genuine data sets, synthetic data provides a risk-free environment for healthcare professionals to test hypotheses, develop models, and make informed decisions.

Better than “real” patient data?

A great advantage of synthetic data is that it can be used to address specific requirements which may not be met with real data. Synthetic datasets may be used as a “simulation” allowing researchers to account for unexpected results and create a solution if the initial results are not satisfactory. In addition to being complicated and expensive to collect, real patient data can contain inaccuracies or reveal a bias that may affect the quality of the machine learning model trained with it. Synthetic data potentially ensure balance, and variety and can automatically fill in missing values and apply labels, enabling more accurate predictions. [source]

‍

Synthetic Data Applications

Stigmatised Illnesses

Utilising synthetic data offers significant advantages in enhancing data infrastructure and addressing emerging health challenges. In the field of mental health, data-sharing restrictions on conditions like opioid use disorder (OUD) have hindered researchers and public health departments. By generating synthetic longitudinal records of individuals diagnosed with OUD and those who have experienced opioid overdose fatalities, researchers can access valuable data for analysing patterns, identifying risks, simulating policy impacts, and evaluating program effectiveness. Synthetic datasets also prove beneficial for studying communicable diseases and stigmatised populations, where barriers to data sharing exist, such as individuals diagnosed with HIV. [source]

COVID-19 pandemic

In response to the COVID-19 pandemic, the demand for timely and accessible data has amplified, leading to increased interest in synthetic data generation. Notably, the National Institutes of Health's National COVID Cohort Collaborative (N3C) employed synthetic data alongside restricted research identifiable files. The synthetic version of collected electronic health records (EHR) was created to enhance data availability for the wider research community and citizen scientists. As these developments progress, more studies are underway to validate the utility of synthetic data in research. Recent papers focusing on the use of synthetic data for COVID-19-related clinical research have indicated that synthetic data can serve as a proxy for real datasets, and analysing both synthetic and real data can yield statistically significant results. This amplifies the value and usefulness of synthetic data. [source]

Medical imaging

Several examples demonstrate the effectiveness of synthetic data in improving medical imaging algorithms. For instance, researchers have successfully utilized synthetic data to train deep-learning models for detecting abnormalities in chest X-rays [source]. By combining synthetic images representing different pathologies with real patient data, the models achieved higher accuracy and sensitivity in identifying lung diseases.

‍

Advantages and Limitations of Model Performance Enhancement (using data generating with GANs), in “Training Strategies for Radiology Deep Learning Models in Data-limited Scenarios”. [Source]

In another example, synthetic data has been generated to improve the performance of MRI-based brain tumour segmentation algorithms [source, source]. By generating synthetic MRI images with varying tumour characteristics, researchers are able to train models that accurately delineate tumours, assisting clinicians in diagnosis and treatment planning.

Source: https://arxiv.org/abs/2012.08604

‍

Advantages and Limitations of Synthetic Data

While synthetic data holds immense promise, it's not without its challenges. Here's a quick look.

Advantages:

Diversity and Variety: Ensuring diversity in the generated data accurately represents the target population and captures different scenarios. For instance, synthetic data can yield statistically significant results by amplifying the value and usefulness of the data.
Enhanced Collaboration: Facilitates cross-institutional research, allowing organizations to look at each other's synthetic data without data-sharing agreements. The U.S. Department of Veterans Affairs' ARCHES program is a prime example, using synthetic data to collaborate on suicide prevention. [source]

Limitations:

Insufficient Representation: Synthetic data may not capture all the nuances, especially for rare events or conditions. This limitation is evident in predicting rare events, where synthetic data might not adequately capture the intricacies associated with those events. The scarcity of actual patient data for rare events limits the availability of diverse and high-quality training examples.
Disclosure Risk: The more the synthetic data is realistic, the higher the risk for disclosure. [source] For example, the CMS DE-SynPUF's synthesising process resulted in a significant reduction in the amount of interdependence among variables, making it less useful for analytics. [source]

Implications:

Medical Research: The application of synthetic data in medical research, such as improving algorithms for detecting abnormalities, has demonstrated effectiveness. However, its limitations in predicting rare diseases based on various clinical variables have also been observed.
Ethical Considerations: The balance between realism and disclosure risk raises ethical considerations that must be carefully navigated to protect privacy while maintaining analytic value.

‍

Key Takeaway

Synthetic data is becoming an increasingly valuable asset in healthcare, offering a wide range of applications from policy simulation to medical imaging. While the technology has been around for several decades, its current trajectory indicates broader adoption and more sophisticated uses in the near future. Challenges such as data representation and privacy concerns do exist, but ongoing research and collaborative efforts are addressing these issues. Initiatives like the National COVID Cohort Collaborative (N3C) and the U.S. Department of Veterans Affairs' ARCHES program exemplify the growing confidence in synthetic data's utility. As healthcare continues to evolve in a data-driven direction, synthetic data is likely to play a significant role in advancing both research and patient care.

‍