Large language models (LLMs) are often chosen as the default infrastructure for translation, conversation and content generation in artificial intelligence. They are now used to interpret contracts, deliver mental health advice, write educational materials and simulate dialogue across cultures. Yet, the models powering tools such as Google Translate, ChatGPT, Gemini and others consistently reproduce Western-centric norms, gender stereotypes and hierarchical valuations of language.
These are algorithmic breadcrumbs of imperial knowledge systems — systems that prioritize certain voices while muting or erasing others. As Kamran argues in
"Decolonizing Artificial Intelligence," these technologies “inherit and reinscribe logics of coloniality in digital form,” making modern AI a continuation of historical epistemic violence.
The Data Doesn’t Lie: LLMs Reflect Western and Gender Norms
Multiple studies reinforce Kamran's concerns. For instance, Liu’s 2024 comprehensive analysis of cultural bias in LLMs found that output accuracy and appropriateness dropped significantly when prompts were grounded in non-Western settings or dialectic variants. Similarly, Tao, Viberg, Baker and Kizilcec revealed that same year that current models exhibit measurable alignment with Western cultural frameworks — particularly when tested against culturally diverse datasets covering educational and sociopolitical topics. These findings, among others, confirm that digital neutrality is a myth and that AI’s linguistic behavior reflects broader geopolitical inequities.
A similar pattern is evident in educational contexts. Boateng and Boateng, in their 2025 study "Algorithmic Bias in Educational Systems," found that algorithmic decision-making tools used in schools and universities reinforce structural inequalities by prioritizing Western-centric academic profiles and suppressing alternative educational trajectories. Their work provides critical insight into how generative AI-driven curricula perpetuate exclusion by favoring dominant cultural references and omitting global diversity.
One striking example comes from Prates, Avelar and Lamb, whose empirical work in 2020 revealed that AI translation systems exhibit a strong tendency toward male defaults — particularly for professions stereotypically associated with male dominance, such as those in science, technology, engineering and mathematics (STEM). Their study demonstrated that gender-neutral sentences in languages like Finnish and Turkish were disproportionately translated into English with male pronouns for professional roles (“he is a doctor”) and female pronouns for domestic roles (“she is a nurse”).
These outcomes are not incidental or circumstantial. They result from massive datasets that train machines to replicate entrenched cultural assumptions — thereby reinforcing bias at scale and speed.
Related Article: Higher Education’s AI Dilemma: Powerful Tools, Dangerous Tradeoffs
How Training Data Reinforces Colonial Knowledge
The colonial dynamics embedded in LLMs are inseparable from their training datasets. Most commercial models are built on vast quantities of data scraped from English-language, Western-centric internet sources. Wikipedia, digitized news sites, open books, forums and social platforms constitute the bulk of this data — sources where English dominates and where global linguistic hierarchies remain largely unchallenged. In this context, languages spoken across the Global South are often grossly misrepresented. Their cultural references are mistranslated, stripped of nuance or omitted altogether.
Findings from my research, "AI-Driven Biases in Curriculum," demonstrate that this issue extends beyond language translation and into educational content. In a review of over 1,000 AI-generated syllabi, 72% of cultural references were drawn from Western traditions, compared to 50% in human-designed syllabi. Non-Western perspectives appeared in only 8% of AI-generated outputs. The discrepancies were mirrored across NLP tasks, where Indigenous, African and Southeast Asian dialects remained unsupported or inaccurately represented. These absences reflect a continuation of the marginalization enforced by colonial and postcolonial systems of knowledge.
What emerges from this data landscape is a digital knowledge regime that reiterates historical exclusions. AI does not just misunderstand marginalized languages — it often does not see them at all. When the training set becomes the canon, the legacy of colonial domination is maintained in the very models that promise global inclusion.
When Machines Misfire on Culture and Identity
As described in "Data Feedback Loops," written by two students from Stanford University, datasets scraped from the internet have been critical to large-scale machine learning (ML). Yet this very success introduces a new risk: as model-generated outputs begin to replace human annotations as sources of supervision, they feed back into the training loop.
This creates a self-reinforcing cycle. The more data systems ingest from dominant languages and ideologies, the more those systems reproduce and amplify the same. For Indigenous languages, this absence is not just a technical limitation — it is a form of digital erasure. AI cannot preserve what it has never seen, cannot translate what was never labeled and cannot respect what was never validated within its training ecosystem.
The implications are significant. Misrepresenting a pronoun may seem like a minor translation flaw, but it signals which identities are normalized — and which are not. Omitting entire concepts limits intercultural understanding and restricts access to epistemologies that fall outside Western frames. It also constrains the AI-generated curricula students encounter in schools, universities and online learning platforms.
AI Doesn’t Just Reflect Bias — It Amplifies It
Beyond language, large language models perpetuate stereotypical associations between gender and occupation. As shown in Prates et al., AI tools routinely assign male pronouns to roles such as doctor, engineer and scientist, while assigning female pronouns to caregiver, teacher or cleaner. These choices may seem subtle, but across billions of interactions, they become statistically and socially significant. They influence hiring systems, résumé-parsing algorithms and educational guidance tools, reinforcing biased career pathways.
The path forward requires a decolonial reorientation:
- AI systems must diversify their training data to include Indigenous, African, Asian and Latin American texts, not as token references, but as foundational sources. These texts must be labeled and interpreted by linguists and cultural scholars with epistemic fluency in the languages and worldviews they represent.
- NLP architectures must be paired with fairness-aware algorithms, adversarial debiasing and reweighting techniques capable of identifying and mitigating stereotypes in real-time output. Transparent model documentation, such as data cards and training summaries, should disclose linguistic distribution and cultural weighting. Ethical, opt-in frameworks must replace extractive scraping methods, allowing communities to contribute their languages on their terms.
- AI education must evolve to include linguistic fairness as a core topic. Developers, designers and policy architects must be taught how language, power and history interact in digital systems. The colonialism embedded in code can only be addressed if it is first recognized.
Related Article: Neurosymbolic AI Is the Key to Fixing AI's Language Comprehension Problem
Whose Language Does AI Speak?
When large language models translate, predict and teach, whose voice do they mirror? When they fail to represent Indigenous knowledge, feminized labor or non-Western frameworks, what kind of digital future are they building? Unless addressed, the linguistic and cultural patterns embedded in AI will calcify old hierarchies in new systems. In this way, the neutrality of AI becomes a myth. The fight for algorithmic fairness is technical, linguistic, historical and deeply political. To decolonize AI, we must first ask: Whose language are we teaching machines to speak, and whose silence are we encoding in return?
Learn how you can join our contributor community.