Which are the main legal and ethical issues in Machine Learning:
- Invasion of the privacy of individuals.
- Lack of transparency in automated decision making.
- Profiling, its lack of regulation and the resulting discriminations and biases.
- Our autonomy, dignity, and freedom are at risk.
How can we better understand the BIG PICTURE? Explaining how and why Big Data Analytics was created.
Why are we currently in a situation where privacy and lack of transparency have become central legal issues?
Obviously, it is due to rapid technological development, but perhaps it is useful for our discussion on transparency, privacy and personal profiling to dig a little deeper. Thus, by understanding a little more about how technology has radically changed our world in recent years, we can find the best legal and ethical solutions.
To do this, we need to talk about certain milestones that have marked the history of technology:
1st Milestone: Moore’s Law:
“The number of transistors on each computer chip doubles every 18 months.” What does this mean for practical purposes? That every 18 months you could buy double the power and storage for the same amount of money.
2nd milestone: Beginning of the cognitive era of technology.
The era of Big Data cognitive computing began in 2007, when it became widely possible to “upload data to the cloud”, because effective shared memory software was available, so that thousands of computers could function as one.
How did this happen?
3 VERY important innovations:
- In 2003, Google published a document that included a basic innovation called the Google File System (GFS). This software allowed Google to access and manage a large amount of data from thousands of computers. At that time, Google’s main goal was to organize all the information in the world through its search engine.
- However, they could not do so without their second basic innovation, MapReduce, which was released in 2004. These two innovations allowed Google to process and explore a large amount of data in a manageable way.
- Google shared these two basic innovations with the open source community, so the community could benefit from their ideas. Even better, the community was able to improve the software and, as a result, Hadoop was created in 2006. Hadoop is open source software that allows hundreds of thousands of computers to function as one giant computer.
3rd milestone: Big Data available to everyone.
What were the consequences of the creation of Hadoop?
That the Big Data was available to everyone. Why? Because with Hadoop, easily accessible storage capacity for computing emerged.
Thanks to Hadoop, Internet platforms were able to store all their data on many computers while still having access to their data.
For example, Facebook, LinkedIn, and Twitter already existed in 2006. They started to build on Hadoop immediately. This is why these internet platforms went global in 2007, because they could see, store and analyze every click of every user on every website.
This gave them a better understanding of what users were doing. That’s how Big Data Analytics was born.
Thanks to Hadoop, other companies were born in 2007, including Airbnb. Amazon also launched Kindle and the first iPhone was launched.
According to AT&T, mobile data traffic on its national wireless network increased by more than 100,000 % between January 2007 and December 2014.
What were some other consequences?
As a very illustrative example, in 2007, the cost of DNA sequencing began to fall dramatically as the biotech industry shifted to new sequencing techniques and platforms, taking advantage of all the computing and storage power it was exploiting.
This tools change was a turning point for genetic engineering and led to the rapid evolution of DNA sequencing technologies that has occurred in recent years.
As these two graphs indicate, the year 2007 was clearly a turning point.
So much so that, in 2001, it cost $100 million to sequence the genome of a single person.
On September 30, 2015, Popular Science reported: “Yesterday, the personal genetics company Veritas Genetics announced that it had reached a milestone: participants in its limited but ever-expanding Personal Genetics Program can sequence their entire genome for only $1,000.”
In 2007 was when Watson was created. “A special-purpose computer system designed to expand the computer’s deep questions and answers, deep analysis, and understanding of natural language.
Watson became the first cognitive computer, combining automatic learning and artificial intelligence.
Today, Watson is busy ingesting all the known medical research on topics such as cancer diagnosis and treatments.
But, Watson is not just a big search engine or digital assistant. Nor can we limit his definition by saying that it is just a big computer that is programmed by software engineers to perform certain tasks that they design.
Watson represents nothing less than the “cognitive age of computing”. What made Watson so fast and accurate was not that he was actually “learning” per se, but its ability to improve itself by using all his Big Data and networking capabilities to make faster and faster statistical correlations about more and more raw material.
What were the consequences of this rapid growth?
Well, that Physical Technologies, understood as New Technologies, are going in a totally different direction and speed than Social Technologies, understood as our institutions, governments, culture, laws, education, etc.
How does the human species adapt to these technological changes?
Edward Teller drew this graph in a conversation with Thomas Friedman which he reproduced in his book “Thank you for being late”, in which he drew two curves. One represents scientific and technological progress, and the other represents humanity’s ability to adapt to these changes.
A thousand years ago, scientific and technological progress increased so gradually that it could take the world 100 years to undergo a dramatic change. For example, the longbow, as a weapon, took centuries to move from development to military use in Europe in the late 13th century.
In 1900, it took 20 to 30 years for the technology to take a big enough step to make the world comfortably different. For example, the introduction of the automobile or the airplane.
Then the slope of the curve began to go almost directly up and off the chart with the convergence of mobile devices, broadband connectivity and the cloud.
These great innovations quickly spread to millions of people around the globe, enabling them to drive change that went much further, faster and cheaper.
And, today, the time frame for technological and scientific innovation has become very short, we are talking about 5 to 7 years, and the big problem is that it is affecting the time frame that humanity needs to adapt to these great changes.
1000 years ago it took 3 or 4 generations to adapt to something new. In the year 1900, the adaptation time was reduced to 1 generation. And now, that time frame is 9 to 15 years to get used to something new.
The black dot in the graph, indicated by the red arrow, illustrates that the rate of technological and scientific change is now faster than the average rate at which most people can absorb all these changes.
This has many negative consequences in our society that we are already experiencing.
Laws and public administrations are struggling to keep up. Technology companies do not want to abide by obsolete rules or, worse, there are no laws regulating their technology and campaign at will (this is what is happening right now), and citizens do not know what consequences all this progress mismanaged by our institutions is bringing to their personal lives.
If it now takes us 8 to 10 years to understand a new technology and develop new laws and regulations to safeguard society, how can we regulate a technology that comes and goes, or mutates, in 5 to 7 years?
This is a big problem and one of the big challenges is HOW WE EDUCATE OUR PEOPLE.
What is the solution to this complex situation?
We must rethink our social tools and institutions so that they allow us to keep pace. If we could improve our ability to adapt, just a little, it would make a significant difference.
Edward Teller drew a second graph showing what he saw as the solution to this rapid growth.
The dotted line simulates our faster learning as well as more intelligent governance and therefore crosses the line of technology/science change at a higher point.
We must rethink our social tools and institutions to enable us to keep pace. If we could improve our ability to adapt, just a little, it would make a significant difference.
Government must be as innovative as innovators, and this is done by forming multidisciplinary teams.
The time for static stability is over. The new type of stability is dynamic stability, and this new dynamic stability began in 2007.
Where are we now?
Since 2007, the internet platforms, which have become technological giants, such as Facebook, LinkedIn, Airbnb, Amazon, Twitter, among others, through Big Data analysis have had the opportunity to store all our data in one place and, therefore, have an enormous knowledge of the market, much greater than traditional companies.
The main consequence for users was, on one hand, the benefit of a series of new services but, at on the other hand, a TOTAL LOSS OF CONTROL OF PERSONAL DATA.
And, what has become dangerous and discriminatory in our society is the possibility of making profiling and inferences through automated decisions thanks to Big Data Analytics and Machine Learning algorithms.
But, how does automated decision making through Machine Learning and Artificial Intelligence algorithms work?
There has been a proliferation of sources where this data come from:
- The genomics success have helped to generate enormous amount of data (only partially used).
- Data collected through applications, wearables, computers, mobile devices… used in all contexts (Health, Insurance, Banks, Governments, Human Resources, mass market) . The most dangerous are observed data, they are collected from users in a non voluntary way: misuse of the mobile camera that collects biometric data of our face without explicit consent, fingerprints, wi-fi tracking connections, location data, our tweets, our “likes” on Facebook, the search history on Google or other search engine, which pages we visited and where we click…
We have no control over our personal data. This personal data, along with all other data, is collected, stored, analyzed, and used to make predictions about our behavior and to make automated decisions about us.
This violates one of the most important principles in GDPR, the Data Minimization Principle, which states that only data necessary to fulfil the purpose should be collected. This leads us to the Purpose Limitation Principle, which means that data collected can only be used for the purpose that has been transparently communicated to the users and, thus, avoid that there are other purposes that are INCOMPATIBLE with the initial one, and that should be the guide in all phases of the data flow. THIS IS ALMOST NEVER FULFILLED.
The Transparency Principle obliges organizations to notify us of the entire process to which our data will be submitted, and also grants us the right to access our information. But, unfortunately, there are other rules that allow companies to be opaque: Trade Secrets Directive, Intellectual Property Rights, and we can only access our data ex-ante, or before an automated decision, or profile, is made.
The current situation shows us that we have no control over our personal data.
The next point (3) tells us about database security. The GDPR says data has to be pseudonymized, but there are still privacy problems because profiles can still be made.
The most appropriate decision to take is to make the databases anonymous. There are many anonymisation techniques, and the three parameters are:
- There is no possibility of makiing inferences.
- No possibility to link and relate different databases.
- There is no possibility of singling-out an individual.
But total anonymization does not exist, and the biggest problem is re-identification, in which anonymous data is linked to publicly available information in order to identify the individual to whom data belongs.
The next point (4) is the automated decision-making and how the system is trained to make this decision.
Profiling, and automated decision-making (5), which is where predictive models about the individuals’ behavior, or group of individuals, or decision-making that can be discriminatory. Today, we cannot access to them.
That is why we have serious problems with transparency. Organizations should let us know:
- What data they’ve used to make the decision.
- What implications does it have on the personal life of the individual or groups of individuals.
- Who have they shared data with.
There is no answer from organizations to these questions.
Let’s take a look at examples of real cases:
- Wearables and health apps.
- They turn the human body into the subject of their scientific research and use our data for their own benefit.
- They collect SPECIAL CATEGORY of personal data (Super-protected by the data protection law). They need EXPLICIT CONSENT, from the user. NOT SOLVED.
- There is no clear and well defined definition of the PURPOSE (For what do you want my data).
- They make INFERENCES, PROFILES AND EXTRACT PATTERNS, they do not describe what inferences they make, nor the implications in the real life of their users, nor how we can access that information in order to modify it, in case it is wrong, or refuse to use it. OPACITY
- They transfer our data to third parties that PROCESS the information in their behalf. They do not name these organizations, they do not ask for explicit consent from the user, they do not describe what these third parties will do with the users’ sensitive personal data.
- Interested parties: Insurance companies, Banks, Governments and Social Security Systems, among others.
2. Watson for Oncology
- DISCRIMINATION AND METHODOLOGICAL BIASES.
- Discrimination. The capacity of patients to pay for a specific treatment, or insurance status, which may affect clinical recommendations.
- Methodological bias. Biases in the training phase, the data used to train the system, and in the choice of treatment protocols to be implemented.
- Discrimination of patients who are not sufficiently represented and, therefore, lead to erroneous conclusions if they are not considered sufficiently.
- EXPLICABILITY. Explain why and how it gives treatments a score in terms of their recommendation. The “Black Box” problem.
- Ethical risks such as patient’s PRIVACY, informed and EXPLICIT CONSENT and the patients’ AUTONOMY, understood as freedom to make a decision.
- RESPONSIBILITY in case of medical error (methodological bias). If an artificial intelligence system fails in the diagnosis, to whom is the responsibility attributed? to the developer of the AI system? to the AI supplier? to the doctor who made the decision?
Which are the possible solutions in this scenario?
THE SOLUTION does not exist yet, but we can make ethical decisions that will help to advance transparency and ethics.
Three major actions can be taken:
- To create an ethical framework in which the principles that should govern the technologies applied in the organization are present.
- A code of conduct in the handling of such data that must be passed on to all persons handling data.
- Governance of data: management, strategy, analysis, business decisions…