HI version is available. Content is displayed in original English for accuracy.
Advertisement
Advertisement
⚡ Community Insights
Discussion Sentiment
67% Positive
Analyzed from 610 words in the discussion.
Trending Topics
#levenshtein#distance#data#using#fuzzy#https#few#compare#trie#approach

Discussion (17 Comments)Read Original on HackerNews
i tried fuzzy matching using a cleverly-assembled regexp approach which works surprisingly well: https://github.com/leeoniya/uFuzzy
As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives.
These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.
What we do was a combination of big data engine, like Apache Spark, a few comparison algorithms like Levenshtein, and ML. AI was not treated as an option to do such things at all! :)
What we did was to use Apache Spark to apply the static algorithms, if we get confident results like less than 10% equality or more than 90% of equality, we treated those as sure signs for records be duplicated or not. Records that were somewhere in the middle, we sent to Machine Learning libraries for analysis. Of course some education was needed for statistical basis. And hard to be automatically analyzed, we placed in a report for human touch ;)
We got relatively good results. It was a Scala based app, as far as I remember :)
Now with AI, it is much more easy... And boring! :D No complexities, no challenges.
I really hate, when people down-vote, without giving any feedback what they don't like.
Levenshtein, in combination with Machine Learning and big data engines, like Apache Sparks, can do a good job comparing content as well ;)
Wanted to share another approach, and ideas to people who are interested in comparing strings, doing fuzzy searches, and searching for duplicated content.