Transparency on AI Provenance

Melodena Stephens
Jun 12
6 min read

One of the things we do not have enough discussions about is AI provenance.

Provenance is the place of origin.

The current narratives on AI origins are limiting as AI is not a single product but a complex global aggregated and distributed technology, suggesting we need distributed governance. This is clear with the recent White House deal signed on 12 June 2025 with China on rare earth metals, which also shows how countries with critical minerals, talent and specialized manufacturing, may have a national competitive edge. Countries may also have an edge if they have data (and know how to value it), funding abilities and raw manpower that can feed AI or hone AI.

In this article I will briefly go in the AI provenance of four things – motives, data, algorithm(s) and hardware. There are more but this is a good starting point.

Motives

When building AI we tend to focus on lofty things like AI purpose - “save the world”, “better healthcare”, “democratizing AI”, “peace through security” etc. But purpose and motives are two different things.

Simply put – motive is the "why" behind the action, while the purpose is the "what". Motives are colored by emotions (which are not rational but an important part of being human).

Someone who was an immigrant from Nigeria fleeing violence and corruption, may view security differently that say someone who works in the humanitarian sector working with refugees of violence. Both may focus on security and safety, but the motives will color purpose differently. Take the example of Narxcare, which Wired did a great expose on. The founders of Narxcare first created an AI in response to the trauma of their friend being killed by a released convict. This motive found itself in the healthcare sector to reduce opioid abuse. Does it matter? In the healthcare sector, doctors take The Hippocratic Oath, focusing on doing good for patients. If that was the motive, even if you flagged a patient for opioid abuse, rather than report them for possible prosecution, the focus would be to get the patient help for substance abuse and change the systems (insurance policies, more opioid clinics etc.). The provenance of motives matters especially as AI moves across systems. Unfortunately, we rarely focus on this topic leading to dangerous and maleficent results.

Data

Data provenance matters as it leads to bias. We need to ask (1) is the data historically relevant (2) is the data accurately measuring what we want to measure (3) is the data contextually and culturally relevant.

Who/what, how and when data was collected matters.

I am a qualitative researcher so when people come up with measures for fuzzy human things like intelligence, happiness or other vague concepts – I like to push back. How were these defined and measured (and here the data on the tool used to measure them matters). Often it is a research paper, taken out of context. For example a recent post of the use of synthetic data for market research cited one paper that used a simulation of 25 AI agents and another paper of a simulation of 1000 people [accuracy 85% using data from a two-hour audio interview with an AI interviewer using an interview protocol from American Voices project]. They used measures like the big 5 Five Personality Inventory, the General Social Survey for example. Is this a good measure?

Can this study findings (data) be extended to other groups and project areas? Data provenance matters – in this case we may need more replication studies but these are difficult to publish as novel so it may end up not being done and the foundational study being used instead. Both studies are in ArXiv, not in a journal yet.

Early in the covid pandemic we used very sophisticated AI models to track the progression of the disease, and they were wrong because it was based on the Spanish flu (just because we have data does not mean it was correct). We were unable to predict the way the virus would morph and the Delta virus disseminated rampantly across India. Predictive analytics, which we use AI for is often based on past data or current big data (so we need to be careful about the model being contaminated with data that is non-relevant). Do we ask enough questions on data? While there is some work on what is called metadata – it is not enough. Especially as we open-source databases.

Let me give another example. In pharmaceutical research, data provenance in clinical trials is used to tracking origins, modifications and responsible individuals. It protects the integrity of clinical trials. Till recently the purpose of human clinical trials is to see how the human body reacts to new drug and therapy interventions and prevent unintended effects to the human and the next generation (so there is a reason why we look at these trials over time). India is a popular place for clinical trials (think data provenance), also because for a long time, there were also many marginalized people who were unable to understand their rights when participating in these trials nor report side-effects. Would this affect the data? India ended up changing itspolicy with respect to clinical trials but that is another story.

Where we use data – we also need discussions ion what we are missing. Why do you think a bot which can answer questions on health and maths is more intelligent than a human – what are we missing (Intuition? Experience?).

Algorithms

These are instructions for the AI (which are created by humans, or the AI, or a combination of both) to solve for a problem or to reach an objective (here we see how motives may affect the algorithm).

There are many provenance issues when looking at algorithms, here are a few (please feel free to add more).

1. What is the statistical model (or the maths) behind the model? Very often when you are using these models there is an error rate (false positives and false negatives): Who decides the acceptable rates? Each statistical model also has limitations. So you want some transparency on the model provenance.

2. Weights – as with any model or equation, you assign weights to model parameters and decide its importance in the equation. For example, in healthcare, is the model weight determined by productivity, or profits, or patient insurance liability? Each one of these, with a change in model weights, will lead to a different outcome. Who decides? Are we inferring this as it is a blackbox? Think of a private sector AI algorithm healthcare model whose focus is profits being introduced into a public healthcare system where the focus should not be profits. Model weights maybe open source but are often proprietary. If you are paying, my guess you have a right to know, especially if you are using it for a different purpose (facial recognition for security versus for employee versification).

3. The code itself. More and more we are using AI to code. So an amateur can use AI to code via a database and create a new algorithm. Is this a good thing? See my blog on Builder AI. The provenance of code is important for various reasons like security. It will help you understand how it “fits” into the larger ecosystem of code. We seem to be spending a lot of money on “debugging” and “cybersecurity” looking at code vulnerabilities and this is an outcome of algorithm provenance.

Hardware

Here I refer to not just AI hardware but the Digital Public Infrastructure hardware. It includes sensors, robot actuators, cameras etc that are needed to help AI “sense”.

Hardware provenance is an area of geopolitical tensions. The issue of the where your hardware is designed and manufactured or assembled is one fraught with much anxiety. Did you know that Apple has close to 100 direct suppliers and they come from ~ 19 countries: China, Czech Republic, France, Germany, Ireland, Israel, India, Italy, Japan, Malaysia, Mexico, Netherlands, Philippines, South Korea, Singapore, Taiwan, Thailand, Vietnam, and the USA. However, if you take the components and critical minerals in the supply chain, the components of the silicon chip travels 40,000 km and crosses 70+ international borders. Does the provenance of the hardware matter?

Yes, from an economic point of view as supply chain fragmentation and choke points show us. Also it matters from a security perspective. Take the example of the backdoor for hardware to update and maintain or repair itself. While you may manage to print a chip in one country, other hardware – the gyroscope for the VR glasses, the Modem for 5G access may come from other countries and these all could be security vulnerabilities. Provenance of hardware is currently captured through trade agreements or trade restrictions.

Are there other issues I have not mentioned? Like the provenance of funding. Nothing is free so most likely there is a trade – for example access to data or for power or profits. How about provenance of humans used to train models and their expertise and cultural background? We don’t’ speak much about this either.

You will find more articles like this at: www.melodena.com

Transparency on AI Provenance

Motives

Data

Algorithms

Hardware

Recent Posts

Comments

Post Archive

Tags