I’ve had a need lately to detect duplicate text files, or strings, using any means possible. It turns out that Ruby was the best answer. Here’s a review of what finally worked.
sudo gem install Text
string1 = "hello"
string2 = " hello"
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
string1 = ic.iconv(string1 + ' ')[0..-2]
string2 = ic.iconv(string2 + ' ')[0..-2]
Text::Levenshtein.distance(string1,string2)
I first tried the Levenshtein gem (not the Text) gem. It is broken as comparing ‘hello’ with ‘ hello’ returns 5 instead of 1. This is bad. The implementation in the Text gem works but is much slower. This makes sense because it’s doing a lot more work to find the real answer.
I also tried the Levenshtein algorithm written in C for mySQL, by Josh Drew . The only issue I had was that it doesn’t play well with text columns. You need to do things as varchar. It also doesn’t handle substr() results. At first I thought it was a bit slow. Now that I know the Ruby Levenshtein gem is broken, the mySQL implementation is beginning to look fast again.
My personal dupe detection technique is to take the middle 50% of the text body. This prevents too much noise created by any headers or footers that might be present but different in otherwise similar text. If you use the mySQL compiled solution (much much faster than Ruby Text), then you’ll want to create a separate native column containing the text excerpt you want to use for comparison.
The final step I took was after I decided the Text gem implementation took too long. I used Ruby Inline and found a C implementation of Levenshtein. Works fast and wonderfully well:
builder.c "
#include <stdlib.h>
#include <malloc.h>
#include <string.h>
static int levenshtein_distance(char *s,char*t)
/*Compute levenshtein distance between s and t*/
{
//Step 1
int min, k,i,j,n,m,cost,*d,distance;
n=strlen(s);
m=strlen(t);
if (n==0) return m;
if (m==0) return n;
d=malloc((sizeof(int))*(m+1)*(n+1));
m++;
n++;
//Step 2
for(k=0;k<n;k++)
{
d[k]=k;
}
for(k=0;k<m;k++)
{
d[k*n]=k;
}
//Step 3 and 4
for(i=1;i<n;i++)
{
for(j=1;j<m;j++)
{
//Step 5
if(s[i-1]==t[j-1])
{
cost=0;
} else {
cost=1;
}
//Step 6
min = d[(j-1)*n+i]+1;
if (d[j*n+i-1]+1 < min) min=d[j*n+i-1]+1;
if (d[(j-1)*n+i-1]+cost < min) min=d[(j-1)*n+i-1]+cost;
d[j*n+i]=min;
}
}
distance=d[n*m-1];
free(d);
return distance;
}
"
end
The C inline is easily 20x faster. Scary faster.
4 responses so far ↓
1
Mark Wilden // Oct 5, 2008 at 1:56 pm
I wonder if the
ic.iconv(string1 + ‘ ‘)[0..-2]
workaround is still necessary? The original blog article is a couple of years old.
2
admin // Oct 16, 2008 at 10:42 pm
I found the blog article because I observed the problem. It seemed to fix the issue, so I would venture to guess that it is still relevant. But I’m no expert in ruby+mysql+unicode
3
Roger Rohrbach // Dec 23, 2008 at 6:41 pm
Seth Schroeder has a non-recursive Ruby version (http://www.nearinfinity.com/blogs/page/seths?entry=counting_beignets_soundex_levenshtein_or) that’s acceptably fast, and uses the Damerau-Levenshtein method, which considers a transposed character a one-character edit.
4
Roger Rohrbach // Dec 23, 2008 at 7:04 pm
And Hal Fulton (”The Ruby Way”) has an even better version: http://www.informit.com/articles/article.aspx?p=683059&seqNum=36
Leave a Comment