AMÁLIA and the future of European Portuguese LLMs

jjohnbarron 3 days ago 27 commentsRead Article on duarteocarmo.com

RU version is available. Content is displayed in original English for accuracy.

⚡ Community Insights

Discussion Sentiment

67% Positive

Analyzed from 777 words in the discussion.

Discussion (27 Comments)Read Original on HackerNews

pu_pe•about 2 hours ago

I'm not sure the direction should be to finetune a small local model for each country or language. These models are already not particularly great at information retrieval, so I doubt anyone would use them for questions like the author suggests (ie who was the president between X and Y). Similarly, they are a little too lightweight to be used for translations too.

If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.

iugtmkbdfil834•3 minutes ago

I agree, the research is complex enough as is without having to worry about splitting it babel-like into multiple languages.

swiftcoder•about 2 hours ago

It is definitely an interesting problem, because Portugal is a small enough country that the actual total corpus of available texts in (non-Brazilian) Portuguese is potentially problematic.

fy20•32 minutes ago

European Portuguese is the 13th most populous language in Europe. Not that small, there are many other European languages in use that are much smaller.

https://en.wikipedia.org/wiki/List_of_languages_by_number_of...

depaulagu•27 minutes ago

> European Portuguese is the 13th most populous language in Europe

that's not impressive

senko•5 minutes ago

Hello from 23rd

embedding-shape•about 2 hours ago

I don't think so, Portugal the country might be small, with a small population, but there is ~250 million "Lusophones" (native Portuguese speakers), making it the fifth-most spoken native language in the world, I'd hardly call that small :) And before everyone screams; yes, European Portuguese is different from Brazilian Portuguese, but they're still both Portuguese and understand each other, so it's not like the text from one cannot be used to train a model for the other, or vice-versa.

All in all, I don't think that's a major issue here.

swiftcoder•about 2 hours ago

The authors are pretty clearly trying to draw only from European Portuguese sources - I feel like there's a fairly widespread attitude here that the language is being overwhelmed by the sheer number of Brazilian speakers (which there is obviously at least some truth to).

I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)

madaxe_again•about 1 hour ago

Man, there’s an attitude up here in trás-os-montes that the rest of Portugal has spoken unrecognisable trash for a century. It took me years to realise I’d learned hilariously antique Portuguese by moving there.

Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.

philipwhiuk•about 1 hour ago

> I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English).

That's easy to say when you're not on the other end of US defaultism.

mghackerlady•about 1 hour ago

Right, but most of those speak brazilian portuguese. There's so much less european portuguese text that it becomes impossible for a model to not speak brazilian portuguese if not trained in a way that ignores brazilian sources

KK7NIL•about 2 hours ago

The whole point of this project is to have an LLM that speaks European Portuguese, not Brazilian Portuguese.

embedding-shape•about 2 hours ago

Right, and my point is that if you use 80% Brazilian Portuguese during base model training + 20% European Portuguese as post-training, you pretty much get exactly that, except with a ton more of available training data.

madaxe_again•about 2 hours ago

Mutually intelligible, yes, but far from perfectly so. I speak both, as a native anglophone, and the difference is not so much “US vs British English” so much as “Guyanese English vs British English”. Like, fundamental points of grammar differ, the spoken rhythm and syllabic stress differs (poetry does not translate well between them), never mind just vocabulary. Continental Portuguese people tend to find it easier to understand brasileiros than vice versa, largely due to mostly one-way cultural exports, but to try to roll both into a single model would create a creole at best.

embedding-shape•about 2 hours ago

I agree, they're not the same. But they're far closer than other languages who don't come from the same families.

algoth1•about 2 hours ago

Wouldnt it be easier to fine tune a model to convert the Brazilian Portuguese corpus into European Portuguese and then use that corpus?

hartator•about 2 hours ago

What a waste of time and money.

Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.

embedding-shape•about 2 hours ago

What LLM isn't forced into a specific language? That'd be a weird language model no one could understand, you need to chose at least one language, ideally the same as the creators speak.

Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.

Miraste•about 2 hours ago

To my knowledge, all major LLMs are multilingual. This article could really have used an evaluation of existing models' European Portuguese capabilities.

numpad0•about 1 hour ago

yeah, they seem all confined to being an American-consultant-Chinese-authoritarian split personality with broad second language capabilities. I suppose they become too incoherent otherwise.

cess11•about 2 hours ago

E.g. gemma3:4b can fake simple conversations in several european languages, including portuguese, swedish and finnish.

It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.

mistrial9•about 2 hours ago

> makes you missed out on most of the world knowledge

and, who knows what will happen to grammar ?

KK7NIL•about 1 hour ago

This is how Europe thinks they can catch up on tech, by having the government fund vanity projects which will be made obsolete by more general techniques in 6 months.

simianwords•about 1 hour ago

Domain specific models will never be a thing. You don't get generalised intelligence with that.

https://simianwords.bearblog.dev/why-domain-specific-llms-wo...