Today, we will talk about the FUNDAMENTAL role of data quality in Machine Learning, and why poor data quality leads to bias and discrimination.
Data quality plays a critical role in Machine Learning. Why? Because each Machine Learning model is trained and evaluated using datasets, and the characteristics and quality of these datasets will DIRECTLY influence the outcome of a model.
One definition of “data quality” is whether the data used are “fit for the purpose”. Consequently, the quality of data depends largely on the purpose of its use.
What causes discrimination and bias in the predictions of an algorithm? Several reasons, but we can say, roughly, that one of the main reasons is the difference between the context in which these predictions are going to be implemented and the quality of the data with which that algorithm has been trained.
These mismatches can have VERY SERIOUS consequences when the predictions through the Machine Learning algorithms are used in high-risk contexts such as predictive justice, staffing, finance, or insurance.
Of particular concern are recent examples showing that Machine Learning models can reproduce, or amplify, unwanted social biases reflected in the datasets.
Examples of these problems include gender discriminations in language translations arising through natural language processing, skin tone discriminations in facial recognition systems due to poor data quality.
As an example, Amazon cancelled the development of an automated recruitment system because the system amplified gender biases in the technology industry.
And, in this other paper, “Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings” showed that the low-dimensional inlays of English words inferred from news articles reproduce gender discrimination by completing the analogy “man is to the computer programmer as woman is to X” with the stereotype “housewife”.
Employers now use similar systems to choose their employees, monitoring their activity to keep them productive, healthy and predicting their failure, success, quitting or even suicide, so that they can take the first steps to mitigate the risks.
Facial Recognition Technology is more than challenged by the discrimination it produces. Joy Buolamwini in this video explains how facial recognition software does not recognize her face because she is a black woman, and explains why.
In turn, Joy Buolamwini and Gebru in the paper “Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification” found that three companies developing Facial Recognition Technology recognized almost 100% of lighter-skinned men (white men). While error rates for darker-skinned women were as high as 33%.
Why? Because of the lack of data sets labeled by ethnicity.
Discrimination in algorithmic data-based decision making can occur for several reasons.
Discrimination can occur during the design, testing and implementation of algorithms used for facial recognition, through biases that are incorporated, consciously or not, into the algorithm itself.
If there are differences in the performance of an algorithm, it is usually very difficult and sometimes impossible to eliminate the bias through mathematical or programmatic solutions. An important cause of discrimination is the quality of the data used to develop algorithms and software.
To be effective and accurate, facial recognition software needs to be fed with large amounts of facial images. More facial images lead, in principle, to more accurate predictions.
However, accuracy is not only determined by the amount of processed face images but also by the quality of such face images. The quality of the data also requires a representative set of faces reflecting different groups of people.
But, as we said before, to date, the facial images used to develop algorithms in the Western world often over-represent white men, with fewer women and/or people of other ethnic origins.
Measuring life through algorithms means that predictions, classifications and decisions about people can be made based on algorithmic models formed from large sets of historical trend data.
The risk of unintended misuse of the dataset increases when the developers are not experts, either in Machine Learning or in the domain where Machine Learning will be used.
This concern is particularly important due to the increased prevalence of tools that “democratize AI” by providing easy access to datasets and models for general use.
And, for this very reason, it is SO IMPORTANT that organizations using predictive algorithms document the origin, creation and use of datasets as a first step to avoid discriminatory results.
But, despite the importance of data for Machine Learning, there is no standardized process for documenting machine learning datasets. In fact, it is a process that is rarely talked about.
In addition, ranking and scoring algorithms also pose challenges in terms of their complexity, opacity and sensitivity to data influence.
End users and even model developers face difficulties in interpreting an algorithm and its ranking results, and this difficulty is further compounded when the model and the data it is trained on are proprietary or confidential, as is often the case.
But, how would be the decision making process of an algorithm? It is very well explained in the paper: “The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards” with the following diagram, and in the following way

It all starts with a question or a goal. A set of data labeled PREVIOUS ANSWERS is selected to answer the guiding question.
And, these data, are the ones used to train the algorithm, so that it will answer the question asked. That question, or objective, can be from identifying people through facial recognition technology, to evaluating the personality of a candidate in a test, or 30-second video, or scoring the risk of non-payment of a group of individuals living in a certain zip code, or predicting the risk of committing crime of certain criminals, or predicting when a user is going to die, or predicting if he has risk of suffering a depression or committing suicide, or predicting what movies he wants to see, or song he wants to listen to…
In this way, PAST ANSWERS are used to PREVENT THE ANSWERS OF THE FUTURE. This is particularly problematic when the results of past events are contaminated with biases (often unintentional), and we add to this the dubious ability of algorithms to predict behaviour and events.
Models often come under scrutiny (i.e., they are reviewed) but only after they are built, trained and deployed. If a model is found to continue to repeat a bias, for example, excessive indexing for a particular race or gender, the data specialist returns to the development stage to identify and address the problem.
PROBLEM: This feedback loop is costly and does not always mitigate damage. The time and energy involved in performing this scrutiny is extremely costly and, if the algorithm is in use, it may already have caused the damage that was intended to be avoided: discrimination by sex, race, age, sexual orientation, etc.
Another big problem that affects the quality of the data is the origin of that data. The paper “Data quality and artificial intelligence – mitigating bias and error to protect fundamental rights” by the European Union Agency for Fundamental Rights (FRA) describes comparative data on the use of Internet data by businesses and highlights the bias in Internet data at a general level in the EU.
Internet data can only reflect a subset of the whole population, which is related to the limited access to the Internet and the different levels of participation in online services such as social networks.
Many organizations use data from the Internet, such as insurance companies using data from social networks to create a risk scoring system of potential customers, or the development of facial recognition algorithms based on Internet images.
Among companies using big data, the most important source is location data from devices, which is mainly information about where people are and how they move, measured through smartphone information.
The same paper states that, EVERY SECOND, companies using Big Data make use of such data (49%). Similarly, 45% of the Big Data used by companies comes from social networks.
Other data sources include intelligent company devices or sensors, which are used by 33% of all companies using Big Data analysis.
This shows us that mobile and social network data are important sources for big data analysis, which can potentially be used for the development of automatic learning algorithms and business decisions. For example, in the area of insurance and recruitment, these unconventional sources of data types are increasingly used.
The use of Internet data raises many questions regarding who is included in the data and the extent to which the information included is appropriate for its purpose.
The application of data protection law to the issue of data quality in building AI technologies and algorithms is not clear.
Data protection law provides minimum guidance on the issue: the Data Accuracy Principle in the General Data Protection Regulation (GDPR) is related to data quality, but in a very limited sense, as it only focuses on the obligation to keep personal data updated.
At this point, what solutions could be provided? As I said in the article, there is no legal solution, but there are ethical solutions such as:
- Fundamental Rights Impact Assessment to ensure that technologies are applied in a way that complies with fundamental rights, regardless of the context in which they are used. Such an assessment must evaluate ALL the rights affected.
- Data Protection and Ethics Impact Assessment.
- As a way to mitigate the opacity of proprietary algorithms, organizations should show where the data they have used to train their algorithm comes from and how suitable it is for their purpose.
- Technical solutions. There are technical solutions developed in these papers (as an example): “Datasheets for datasets” and “The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards“.
And I also recommend reading “The Big Data Challenge for Public Statistics” (Spanish) by Alberto González Yanes.
As always, thanks for reading me.