When AI Gets It Wrong: Bias, Hallucination, and Why It Matters
Last time around we established that generative AI learned from us. From our words, our images, our published knowledge, and our publicly shared lives. That is an extraordinary foundation to build on. It is also, as we are about to explore, where some of the most serious problems begin.
There is a common assumption that AI systems are neutral. People assume that because it is a machine doing the processing, the output must be objective. This assumption is wrong, and it is worth understanding exactly why. Bias affects AI broadly, both traditional systems and generative ones alike, because both kinds learn from data shaped by human history. Hallucination is different and belongs specifically to generative AI. A facial recognition system or a credit scoring tool can get a classification wrong, but it does not invent a fluent, detailed, entirely fictional answer the way a large language model can. These are two different problems sitting inside two different kinds of technology.
Two Different Problems, Two Different Technologies
As we discussed in the first piece, the term AI covers a much wider range of technology than most people realise. The most familiar one, Large Language Models (or chatbots as they are colloquially known) is a type of generative AI. However, traditional AI also includes facial recognition, credit scoring models, recommendation engines and most of the classification tools that have been making decisions about people for the best part of two decades. These systems are not generative. They are not writing sentences or generating images from scratch. They are sorting, scoring and predicting based on patterns learned from data, and they can absolutely be biased, in the sense that they perform less accurately for some groups of people than others.
Generative AI is a different animal. Large language models (LLMs) and image generators produce new content by predicting what is statistically most likely to come next, word by word or pixel by pixel. This is the technology that learned from the entire corpus of human writing and imagery we discussed last time, and it is also the technology that introduces a second, distinct failure mode that traditional AI does not really have in the same way: hallucination.
Bias is fundamentally a data and representation problem, and it shows up in both traditional and generative AI. Hallucination is fundamentally an architecture problem, specific to generative systems, and it cannot be solved simply by feeding the model better or more diverse data. We will come back to why that distinction matters later.
Bias: The Problem That Starts Before the Algorithm
AI outputs affect who gets a job interview, who is approved for a loan, who gets flagged by a security system, and whose medical symptoms get taken seriously. AI systems, whether traditional or generative, learn from our data. If that data reflects a world shaped by centuries of inequality, exclusion, and discrimination, the AI learns that world, and that is the baseline from which the outputs are generated.
Consider hiring. Researchers have documented multiple cases of AI recruitment tools systematically downranking candidates from certain backgrounds, not because a programmer wrote a rule saying to do so, but because the system learned from historical hiring decisions made by humans who, consciously or not, showed bias towards certain profiles. Amazon famously scrapped an AI recruiting tool in 2018 after discovering it had taught itself to penalise CVs that included the word "women's", as in women's college or sports team, because the majority of successful candidates in its training data were men.
Facial recognition is where this problem has been most rigorously exposed, and the research that exposed it changed the industry.
In 2018, Joy Buolamwini, then a graduate researcher at MIT Media Lab, and Timnit Gebru, then at Microsoft Research, published Gender Shades, a study testing the facial analysis systems sold by IBM, Microsoft and Face++. Across 1,270 unique faces, they found severe gender and skin-type bias in gender classification, with the worst failure rate on darker female faces running above one in three, on a task where chance alone should produce a 50% success rate. IBM's error rate on darker-skinned women's faces came in at close to 35%, against an accuracy rate of 99% for lighter-skinned men across all three companies.
A follow-up study by Inioluwa Deborah Raji and Buolamwini found that the companies named in the original Gender Shades audit substantially improved the accuracy of their gender classification systems in the year that followed, under direct public pressure to do so. IBM and Microsoft moved quickly to improve their algorithms in response to the paper, and that swift improvement became one of the clearest demonstrations of an individual researcher's influence on an entire industry. It was later reported that IBM's eventual decision to exit the facial recognition business altogether traced directly back to the questions Gender Shades had forced it to confront.
Buolamwini went on to found the Algorithmic Justice League, and has been clear that:
You have to be intentional about being inclusive, because those in power reflect the current inequities that we have.
Her term for the underlying problem, “the coded gaze", describes what happens when the people building a system share a narrow set of assumptions about whose face, whose voice, and whose data counts as the default.
However, the issues that studies like these surfaced have not been universally accepted by major AI labs. While co-leading Google's Ethical AI team, Gebru co-authored a 2020 On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? with Emily M. Bender, Angelina McMillan-Major, and Margaret Mitchell. It highlighted the environmental costs, financial costs, and tendency of large language models to encode and amplify societal biases at scale. When Gebru refused to either retract the paper or remove her name from it she was exited from the company (although Google maintains that she resigned). Her departure triggered a significant internal walkout and a great deal of public scrutiny of how seriously the industry actually takes its own ethics research. Gebru went on to found the Distributed AI Research Institute, DAIR, built specifically to centre the needs of communities, with a particular focus on the African continent and the African diaspora, rather than the commercial priorities of the labs she had left.
More recently, Gebru has become one of the more prominent voices arguing against the industry's default assumption that bigger, more general-purpose models are automatically better. Her case, in short, is that a model trained to do one task well, with a clearly bounded purpose and a dataset you can actually account for, is easier to audit, easier to govern and very often more useful than a sprawling general model asked to do everything for everyone. It is a different kind of governance argument to the one she made in 2020, but it comes from the same place: systems are easier to hold accountable when you can actually see what they were built to do and what they were trained on.
Google has since partnered with Dr Ellis Monk, a sociologist at Harvard whose research focuses on the social and psychological consequences of skin tone, to develop the Monk Skin Tone Scale. It was built specifically to replace the older Fitzpatrick scale, which was designed for dermatology rather than image-based AI and badly under-represented the range of skin tones it was being used to categorise. The Monk Scale is now an open resource that other developers can use to test and train their own systems more fairly.
What ties Buolamwini, Gebru and Monk together is not just that they found problems. It is that the industry, under enough pressure, applied itself to try to find solutions it might otherwise not have considered. Less than 30% of the global AI workforce is female, and that figure drops further at senior levels. The teams building these systems are still not representative of the populations those systems affect, and that remains a governance failure. But the evidence from Gender Shades onward also tells us something genuinely encouraging: when people with the right expertise and the standing to be heard apply pressure, the industry can and does respond. Diversity is not just the right thing to aim for. It has a measurable track record of working.
Hallucination: A Different Kind of Wrong
The second thing generative AI gets wrong is different in character but equally significant in consequence. We touched on this at the end of the last piece, via the humble autocorrect fail. AI systems, including the most sophisticated generative ones, can produce outputs that are completely, but very confidently wrong.
The technical term is hallucination. It means a generative system producing information that sounds entirely plausible and authoritative but is factually incorrect, sometimes spectacularly so. Lawyers have submitted AI-generated court documents citing cases that do not exist. Journalists have received AI-generated background briefings containing fabricated quotes attributed to real people. Medical information tools have produced treatment recommendations that were dangerously inaccurate. This is not a phenomenon traditional AI systems share. A facial recognition tool gets a classification wrong. It does not invent a confident, detailed, entirely fictional answer the way a large language model can.
This happens for the same reason as the autocorrect problem, just at far greater scale and sophistication. These systems are not checking facts. They are generating the most statistically probable response based on patterns in their training data. If the most probable-sounding answer happens to be wrong, the system has no internal alarm that triggers. It simply produces an incorrect output fluently, articulately and without hesitation.
It is tempting to assume this is just another data problem, fixable in the same way bias is fixable: better data, more diverse data, more careful curation. I want to be honest that I do not think that holds up, and I have been persuaded out of an earlier version of my own thinking on this point by researchers who study the architecture directly rather than its outputs.
As Diverse AI’s Founder, Toju Duke stated:
The solution to AI hallucinations cannot be addressed by any existing AI governance framework or responsible AI process today. If there was a solution to it, it would have been addressed by now.
Instead, she points to the fundamental architecture of large language models as the source of the problem. By training on vast datasets using neural networks, these systems do not only predict the next word or match pattern sequences. They also fabricate and make up results in order to produce a convincing answer. Because the neural networks powering these models remain what the AI community calls a black box (opaque and largely resistant to interpretability), even the most rigorous governance framework cannot often pinpoint the source or cause of any specific hallucination after the fact.
That is a different category of problem to the one Buolamwini, Gebru and Monk were attempting to address. Their work succeeded because the failure was measurable, traceable to specific training data, and fixable by retraining on better-balanced datasets and holding companies accountable for the results. Hallucination does not offer that same foothold. The system is not drawing on a skewed sample of faces. It is generating the most statistically probable-sounding answer it can produce, and when that answer happens to be wrong, there is no internal alarm that fires. It comes out fluent, confident, and indistinguishable in tone from the times the system gets it right.
The practical implication is straightforward. AI outputs need human verification, every time, for anything that matters. It is a basic requirement of responsible use, in the same way that you would not publish a document without proofreading it, or present a paper written by someone else without checking for quality and accuracy.
Where Diversity Helps, and Where It Cannot
Both of these problems sit under the same broad heading of "AI getting it wrong", but they need different responses.
For bias, diversity in the teams designing, developing, deploying, testing and governing AI systems is not a values statement. It is a mechanism with a track record. Buolamwini and Gebru's research forced IBM and Microsoft to fix measurable accuracy gaps within months. Monk's expertise gave Google a better tool than the one it had been using by default. Gebru's current push for smaller, task-specific models is itself an argument for governance built by people who understand what they are accountable for, rather than systems so large and general that nobody can fully audit them. When the people building a system have not experienced the harm it can cause, that harm is less likely to be caught, less likely to be prioritised and less likely to be fixed. When teams come from a wider range of backgrounds, disciplines, geographies, and lived experiences, the system gets interrogated from more angles, and it gets better.
For hallucination, diversity matters too, but differently and with real limits. A more diverse team is more likely to notice when hallucinated content causes disproportionate harm to a particular group, more likely to push for honest disclosure about a system's limitations, and more likely to design appropriately narrow, well-scoped use cases rather than open-ended general deployment. What diverse teams cannot do, on the evidence we currently have, is reach into the black box of a large neural network and remove the architectural tendency to fabricate. That is a harder problem, and pretending otherwise does a disservice to both the people working on it and the people relying on the output.
Next up: Agentic AI. What it means for an AI system to act autonomously on your behalf, and why everything we have covered so far makes the governance of that capability one of the most important conversations of our time.
