Whose Culture Does AI Learn? The EU’s Copyright Battle Is Ignoring the Languages That Need Protection Most

Whose Culture Does AI Learn? The EU's Copyright Battle Is Ignoring the Languages That Need Protection Most

The European Parliament’s battle over AI training data is usually framed as a struggle between Big Tech and creative workers. But the deepest wound may be dealt to languages and traditions that neither side is seriously taking into account.

Martti Asikainen 27.3.2026 | Photo by Adobe Stock Photos

I have spent hours trying to get AI tools to work in Meänkieli, one of Sweden’s five officially recognised national minority languages, spoken along the border between Finland and Sweden. Swedish legislation explicitly obliges authorities to protect and promote it (SFS 2009:724; Council of Europe 2024; Pirinen 2025).

When I ask one of the leading language models to write a paragraph in Meänkieli, the results range from outright failure to a digital shrug. Ask the same in English, and it knows Shakespeare like the back of its hand. This gap is not a technical accident, but a predictable consequence of a system trained on data from dominant languages — a system in which Meänkieli, like dozens of other EU minority languages, barely appears even in footnotes.

This is also a problem for which the European Parliament’s resolution on copyright and generative AI, adopted in March 2026, offers no solution. Nobody has even thought to ask what it means to protect a minority language in a world where digital capability is increasingly a prerequisite for a language being experienced as living and relevant at all. At the same time, on the other side of the scales are creators’ rights to their own work.

The Law Is Written for Those Who Have a Lawyer

After the Parliament’s vote, headlines focused on familiar confrontations: artists versus algorithms, Hollywood versus Silicon Valley, Brussels versus the large American technology companies. Rapporteur Axel Voss called for clearer rules, legal certainty, and remuneration for rightsholders. The Parliament also demanded greater transparency and the establishment of trusted intermediary structures to support documentation and compliance.

These are legitimate concerns. They are also concerns raised by parties with lawyers, collecting societies, and market leverage. Frisian poetry has not been mentioned in the speeches even once. That may sound like a joke, but it actually matters more than one might imagine at first glance — because the legal architecture is very real, and it operates unfairly against small creators.

Article 4 of the 2019 Copyright in the Digital Single Market Directive created an exception for text and data mining, including for commercial purposes, unless the rightsholder explicitly reserves their rights (EU 2019/790). For works made publicly available online, that reservation is expected to be expressed through machine-readable means: metadata, website terms and conditions, and similar signals. The AI Act later added a requirement for providers of general-purpose AI models to publish sufficiently detailed summaries of their training content (EU 2024/1689).

Formally, this is a kind of opt-out system. In practice, it works by shifting responsibility onto individual creators and small institutions. They must know the rules, understand the technical protocols, express their rights — and then somehow verify whether the technology companies have actually complied with the law (EU 2025; EUIPO 2025; Ziaja 2024).

Europe Is Many Things

The Parliament’s own resolution acknowledges that rightsholders currently cannot easily or effectively exercise their opt-out right, and that the current situation creates a systemic imbalance (EP 2026a). Better registries and intermediary structures might help at the margins. The challenge, however, is that the proposed method imagines copyright management as something like a level playing field: create something, reserve your rights, negotiate a licence, get paid.

This model may work for large publishers, major music companies, collecting societies, and studios with in-house legal departments. It works far less well for a small Kashubian folk music archive, a Ladino literary journal, or a filmmaker who has produced one of the very few works ever made in Cornish (EP 2026a, 2026b; EU 2025). By favouring those with the technical capacity and legal resources, the opt-out system selectively protects the cultural output of dominant languages.

Yet Europe consists of far more than large commercial actors. It is also the poet writing in Võro, the novelist working in Aragonese, the keeper of oral tradition recording songs in Sorbian or Aromanian. The European Union has 24 official languages, but it also contains around 60 regional and minority languages, many of them vulnerable or endangered (Pasikowska-Schnass 2020, 2016; Council of Europe 1992/1998). For these creators, the proposed system is not just impractical. It is functionally invisible.

A Language Disappears When It Disappears from Daily Life

AI language models learn from data. The data on which they are trained shapes not only their capabilities but also their entire sense of what is normal, important, and worth expressing. Language use on the internet is today heavily concentrated. English alone accounts for roughly half of all websites whose content language can be identified (W3Techs 2026).

The research is unambiguous. Low-resource and marginalised languages receive weaker datasets, poorer benchmarks, and significantly worse model performance than high-resource languages (Grützner-Zahn & Rehm 2024; OECD 2023; Zhong et al. 2024; Alam et al. 2024; Micallef et al. 2025; Nuha et al. 2026). The consequences of this are not limited to specialists and enthusiasts, but affect each and every one of us.

As children, teachers, and ordinary users increasingly turn to AI assistants to explore literature, history, and cultural identity, the heritage they encounter will be the heritage of the majority — the heritage that is well represented in training data. The rest risks becoming harder to find, harder to generate, harder to search — and therefore easier to overlook. This is especially true as AI models increasingly displace traditional search engines such as Google Search.

Languages do not disappear only when people stop speaking them. They disappear when they cease to be useful in the institutions and tools around which our everyday lives and routines are organised. Search, translation, writing assistance, education, administration, and entertainment are increasingly mediated by AI. If these systems do not support a given language, users come to experience that language as unnecessary, deficient, inadequate, and perhaps even old-fashioned.

Yet this does not happen all at once. The experiences accumulate interaction by interaction — when a Sorbian writer finds that language models cannot reliably support her work, or when a Sámi musician realises that his recordings appear in no AI training corpus, or when an Aromanian songwriter cannot get translation tools to handle his lyrics convincingly. These are not edge cases to be tidied up later, but part of a much larger problem for which there is no solution in a market-driven world.

The Parliament’s resolution explicitly links copyright, creativity, and Europe’s cultural diversity (EP 2026a). The Council of Europe’s language framework exists precisely because regional and minority languages require active support to remain living parts of European cultural identity (Council of Europe 1992/1998). And yet a market-centred system for AI licensing and opt-outs will, by its nature, favour those who already have scale — which means it will favour dominant languages, large rightsholders, and well-resourced collecting societies.

The market will not solve this problem. Cultural policy recognised decades ago that markets alone cannot preserve linguistic diversity, because demand for smaller languages is limited. That is precisely why active public support mechanisms exist. The copyright debate around AI has not yet woken up to this. It should.

Infrastructure Decides, If We Choose It

The solutions are not conceptually complicated, even if they are politically awkward. Registry mechanisms should carry an active cultural-preservation function, not merely a passive compliance one. A transparency registry designed to help large rightsholders document their opt-outs will, in practice, also serve only large rightsholders.

If the same infrastructure were designed from the outset to include cataloguing support for minority-language archives — with technical assistance and multilingual guidance built in — it could serve the whole of Europe’s cultural ecosystem. In practice, this would mean, for example, that a Ladino literary journal or a Meänkieli sound archive could register its materials and express its opt-out rights without a lawyer or an IT specialist. At the moment, this is not realistic for either.

It would also be worth considering whether a portion of the licensing revenues arising from any AI copyright settlement should be reinvested in cultural diversity. The idea is not exotic. A targeted fund for digitisation, corpus building, educational materials, and creative production in regional and minority languages would be consistent with commitments the EU has already made. What is currently missing is simply the connection: the money comes from technology companies training their models on European cultural heritage, but some of it never finds its way back. That connection could be made explicit.

The EU should treat digital usability as a matter of language survival, not merely language politics. Reviving a dead language is far more challenging than revitalising a dying one. Earlier work on the digital survival of lesser-used languages reached the same conclusion: if a language is absent from the infrastructure of digital life, it becomes harder to maintain as a living medium (Pasikowska-Schnass 2020).

In the age of AI assistants, that logic becomes sharper still. If models deployed across Europe systematically underperform in Europe’s own recognised minority languages, that is not a neutral technical outcome. It is a policy failure — one that can be named and addressed before the damage becomes irreversible (Grützner-Zahn & Rehm 2024; Pirinen 2025). None of this requires dismantling the copyright framework the Parliament is trying to build — it requires ensuring that framework is designed with the full range of European creators in mind, not just the ones who can afford to negotiate.

Before the Datasets Are Sealed

The EU’s AI copyright debate is, at bottom, about who gets to benefit from the AI transition and who gets left behind. The loudest participants are well-resourced, tightly organised, and overwhelmingly embedded in the architecture of major languages. They will probably secure some kind of settlement.

The question is ultimately whether, in securing that settlement, the EU will also remember what it has repeatedly emphasised in its own commitments. Europe’s languages are not relics but living instruments that carry daily life and build the culture around us. This conversation needs to happen inside the copyright debate, not beside it. Before the datasets are sealed, before the registries are finalised, before the licensing structures become the default architecture of Europe’s AI culture.

When a language disappears from the infrastructure of the digital age, no legislation can bring it back. If Meänkieli does not work in modern tools, it will begin to look unnecessary in the eyes of young people. And it will look that way especially to those who are only just learning what a language can do.

References

Alam, F., Hettiarachchi, H., Braud, C., Rani, P., Uyangodage, L., Abdul-Mageed, M., & Nakov, P. (2024). LLMs for low resource languages in multilingual settings: Tutorial. Association for Computational Linguistics. https://aclanthology.org/2024.eacl-tutorials.5/

Council of Europe. (1992/1998). European Charter for Regional or Minority Languages. https://rm.coe.int/16800cb5e5

Council of Europe. (2024). Eighth evaluation report on Sweden. https://rm.coe.int/sweden-eval-iria-8-en/1680aee227

European Parliament (EP). (2026a, March 10). Copyright and generative artificial intelligence — opportunities and challenges (Resolution P10_TA(2026)0066). https://www.europarl.europa.eu/doceo/document/TA-10-2026-0066_EN.html

European Parliament (EP). (2026b, March 10). Protecting copyrighted work and the EU’s creative sector in the age of AI (Press release). https://www.europarl.europa.eu/news/en/press-room/20260306IPR37511

European Union (EU). (2019). Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market. https://eur-lex.europa.eu/eli/dir/2019/790/oj

European Union (EU). (2024). Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689

European Union (EU). (2025). Generative AI and copyright: Training, creation, regulation. European Parliament. https://www.europarl.europa.eu/RegData/etudes/STUD/2025/774095/IUST_STU(2025)774095_EN.pdf

European Union Intellectual Property Office (EUIPO). (2025). The development of generative artificial intelligence from a copyright perspective. https://www.euipo.europa.eu/en/publications/genai-from-a-copyright-perspective-2025

Grützner-Zahn, A., & Rehm, G. (2024). Surveying the technology support of languages. Association for Computational Linguistics. https://aclanthology.org/2024.tdle-1.1.pdf

Micallef, K., Gatt, A., & van der Plas, L. (2025). Benchmarking large language models against smaller language-specific models on Maltese. https://aclanthology.org/2025.findings-acl.1053.pdf

Nuha, U., Fersini, E., & Passarotti, M. (2026). Towards the first NLP benchmark for Ladin. https://aclanthology.org/2026.findings-eacl.55/

OECD. (2023). AI language models: Technological, socio-economic and policy considerations. OECD Publishing. https://www.oecd.org/content/dam/oecd/en/publications/reports/2023/04/ai-language-models_46d9d9b4/13d38f92-en.pdf

Pasikowska-Schnass, M. (2016). Regional and minority languages in the European Union. European Parliamentary Research Service. https://www.europarl.europa.eu/EPRS/EPRS-Briefing-589794-Regional-minority-languages-EU-FINAL.pdf

Pasikowska-Schnass, M. (2020). Digital survival of lesser-used languages. European Parliamentary Research Service. https://www.europarl.europa.eu/RegData/etudes/BRIE/2020/652086/EPRS_BRI(2020)652086_EN.pdf

Pirinen, F. A. (2025). Language technology for the minority Finnic languages. https://aclanthology.org/2025.iwclul-1.6.pdf

SFS 2009:724 (2009). Lag om nationella minoriteter och minoritetsspråk. https://www.government.se/contentassets/16ba706f40854a87b910941caf3891d1/language-act-in-english.pdf

W3Techs. (2026). Usage statistics of content languages for websites. https://w3techs.com/technologies/overview/content_language

Zhong, T., et al. (2024). Opportunities and challenges of large language models for low-resource languages in humanities research. https://arxiv.org/pdf/2412.04497

Ziaja, G. M. (2024). Text and data mining opt-out in Article 4(3) CDSMD. Journal of Intellectual Property Law & Practice, 19(5), 453–466.

Authors

Martti Asikainen

Communications Lead
Finnish AI Region
+358 44 920 7374
martti.asikainen@haaga-helia.fi

This text has been created as part of the Artificial Intelligence and Equality in the Workplace project (ReiluAI), funded by Haaga-Helia University of Applied Sciences and the Finnish Work Environment Fund.

PrevPrevious

NextNext

Finnish AI Region
2022-2025.
Media contacts