RU version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
67% Positive
Analyzed from 777 words in the discussion.
Trending Topics
#portuguese#language#european#brazilian#model#languages#small#don#english#same

Discussion (27 Comments)Read Original on HackerNews
If the budget is indeed so modest (5.5 million euros!), I would focus completely on preparing datasets and making sure all open cultural artifacts that we can find are well documented in them. That way every model, private or open, that gets trained in the future could better represent the culture and language of your country.
https://en.wikipedia.org/wiki/List_of_languages_by_number_of...
that's not impressive
All in all, I don't think that's a major issue here.
I don't necessarily personally feel like preserving European Portuguese in amber is a worthwhile goal (anymore than it is productive for Brits to be prickly about the meteoric rise of US English)
Then again, if you go to Miranda de Douro, they’ll say the rest of Portugal has been talking nonsense for the last 700 years, so the purists at least always have their concents to retreat to if they so choose.
That's easy to say when you're not on the other end of US defaultism.
Trying to force a LLM into a specific language makes you missed out on most of the world knowledge.
Besides, there is knowledge that is locked behind languages, there are things known in Portuguese that aren't known in other languages, and the same for other languages too. More accessibility to those ideas wouldn't hurt.
It's just a database. If you push text in one language into it, it'll likely crap out stuff in that same language, unless the system prompt that also goes in with your query causes it not to.
and, who knows what will happen to grammar ?
https://simianwords.bearblog.dev/why-domain-specific-llms-wo...