Show HN: CLI tool for detecting non-exact code duplication with embedding models

rkochanowski•about 4 hours ago

I built Slopo to solve one specific problem: finding similar code that is hardest to detect by other tools, coding AI agents, and humans.

It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev

Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.

The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.

realxrobau•about 3 hours ago

If it did PHP I would love to run it over WordPress. What would it take to add that?

rkochanowski•about 3 hours ago

PHP support can be easily added, I will release a new version soon.

raro11•about 2 hours ago

Thank you

BrandiATMuhkuh•about 1 hour ago

What a simple and smart idea. Wonderful

murats•about 3 hours ago

Nice idea. I can see this being useful before refactors, especially when the duplication is semantic rather than copy paste.

philajan•about 1 hour ago

This is neat. Have you noticed any difference in duplicate detection between strongly typed and loosely typed languages / code bases?

rkochanowski•25 minutes ago

No. It depends the most on general code quality and architecture. Some implementations require more code similarity by design. Some languages, like Java, may tend to have more duplication, but it's only a theoretical guess. It also depends on what kind of software is developed with what language.

If you are interested in data, you can check my article. Analysis was done with this tool, but a previous version where exact-copy duplicates were excluded from analysis. https://rkochanowski.com/article/analysis-code-duplication/

forhadahmed•about 1 hour ago

self plug (for similar tool): https://github.com/forhadahmed/refactor

hdz•about 2 hours ago

Very nice. I can imagine putting this into a pre push hook to keep things clean after an initial sweep.

NYCHMPAI•about 3 hours ago

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

SpyCoder77•about 1 hour ago

I think that this is pretty cool, but is there any reason why we would want to remove similar/possible duplicate code?

rkochanowski•4 minutes ago

Recently there was a popular article on HN saying that sometimes code duplication is better than abstraction, so I assume that this question is not a joke.

While testing this tool, one detected duplication was interesting for a use case. Permission check logic was duplicated and placed in different distant places in the codebase. The code was similar, but not identical, the logic was not the same. One version had stricter checks. I analyzed this with the coding agent, and we found out that both versions are used for the same thing, which means that in some cases validation is insufficient. Having only a single validation place, this bug could be prevented or easily detected.

rufius•27 minutes ago

(without sarcasm) Is this a serious question?

If so - maintainability, testability. This is old software engineering best practice at this point.

You shouldn’t hyper optimize for deduplication, but it’s usually worth considering. Fewer places to fix issues or improve as well.

klibertp•5 minutes ago

I tend to follow the "rule of 3": a second similar implementation is OK, introducing the third triggers a refactor. As with everything, this isn't dogma, and sometimes the second implementation is already too much, while at other times you get tens of similar code sections (in codegen, repeating patterns with almost no changes is a virtue). But it's a good rule of thumb.

On testability: two implementations can be tested against each other, leading to greater coverage with less test code. It doesn't work that way for 3+ implementations, which is another reason not to have that many.

Zopieux•31 minutes ago

Have you written software before?

Show HN: CLI tool for detecting non-exact code duplication with embedding models

⚡ Community Insights

Discussion (16 Comments)Read Original on HackerNews