The ethical and privacy issues of data augmentation in the medical field
/in Innovation/by Piercosma BiscontiThe ethical issues arising from the use of data augmentation, or synthetic data generation, in the field of medicine are increasingly evident. This technique, which is also called synthetic data generation, is a process in which artificial data are created in order to enrich a starting data set or to overcome certain limitations. This type of technology is particularly used when AI models have to be trained for the recognition of rare diseases, on which there is little data available for training. By means of data augmentation, further data can be artificially added, while still remaining representative of the starting sample.
From a technical point of view, data augmentation is performed using algorithms that modify existing data or generate new data based on existing data. For example, in the context of image processing, original images can be modified by rotating them, blurring them, adding noise or changing the contrast. In this way, different variants of an original image are obtained that can be used to train artificial intelligence models. The use of this technology makes it increasingly effective to use AI to recognise diseases, such as certain types of rare cancers.
However, there are several ethical issues that arise from the use of data augmentation in medicine. One of the main concerns relates to the quality of the data generated. If the source data are not representative of the population or if they contain errors or biases, the application of data augmentation could amplify these issues. For example, if the original dataset concerns only Caucasian white males, there is a risk that the data augmentation result will have a bias towards these individuals, transferring the inequalities present in the original data to the generated data.
Replication bias is certainly the most critical issue with regard to data augmentation. If the artificial intelligence model is trained on unrepresentatively generated data or data with inherent biases, the model itself may perpetuate these biases during the decision-making process. For this reason, in synthetic data generation, the quality of the source dataset is an even more critical issue than in artificial intelligence in general.
Data privacy is another issue to consider. The use of data augmentation requires access to sensitive patient data, which might include personal or confidential information. It is crucial to ensure that this data is adequately protected and only used for specific purposes. To address these concerns, solutions such as federated learning and multiparty computation have been proposed. These approaches make it possible to train artificial intelligence models without having to transfer sensitive data to a single location, thus protecting patients’ privacy.
Federated learning is an innovative approach to training artificial intelligence models that addresses data privacy issues. Instead of transferring sensitive data from individual users or devices to a central server, federated learning allows models to be trained directly on users’ devices.
The federated learning process works as follows: initially, a global model is created and distributed to all participating users’ devices. Subsequently, these devices train the model using their own local data without sharing it with the central server. During local training, the models on the devices are constantly updated and improved.
Then, instead of sending the raw data to the central server, only the updated model parameters are sent and aggregated into a new global model. This aggregation takes place in a secure and private manner, ensuring that personal data is not exposed or compromised.
Finally, it is important to note that there are many other ethical issues related to the use of data augmentation in medicine. For instance, there is a risk that synthetic data generation may lead to oversimplification of complex medical problems, ignoring the complexity of real-life situations. In the context of the future AI Act, and the European Commission’s ‘Ethics Guidelines for Trustworthy AI’, the analysis of technologies as complex, and with such a broad impact, as AI systems in support of medical decision-making is becoming increasingly crucial.