N-gram Similarity is a powerful and flexible algorithm in the field of fuzzy text matching that calculates text similarity by breaking text into smaller chunks (N-grams). This documentation details the principles, implementation, and application scenarios of this algorithm.
N-grams are contiguous sequences of N items (characters, words, etc.) extracted from text. N-gram similarity calculates similarity by comparing the overlap of N-grams between two texts.
Using the Dice coefficient, the formula for calculating N-gram similarity is:
Similarity = 2 × |A ∩ B| / (|A| + |B|) Where: - A and B are the N-gram sets of the two texts - |A ∩ B| is the size of the intersection - |A| and |B| are the sizes of the two sets respectively
Below is the TypeScript implementation of the N-gram similarity algorithm used in our Fuzzy Text Matching Tool, using character-level bi-grams (N=2):
/**
* Calculate the N-gram similarity between two strings
* Using character-level bi-grams (n=2)
* @param str1 The first string
* @param str2 The second string
* @param threshold Optional threshold, if set, returns null when similarity is less than threshold
* @returns Similarity (between 0-1), or null when below threshold
*/
export function calculateNGramSimilarity(str1: string, str2: string, threshold?: number): number | null {
// If strings are too short, return edit distance similarity
if (str1.length < 2 || str2.length < 2) {
const maxLength = Math.max(str1.length, str2.length);
if (maxLength === 0) return 1; // Two empty strings are considered identical
const editDistance = calculateEditDistance(str1, str2);
// If edit distance returns null due to threshold, return null here too
if (editDistance === null) return null;
const similarity = 1 - editDistance / maxLength;
// Check similarity threshold
if (threshold !== undefined && similarity < threshold) {
return null;
}
return similarity;
}
// Generate bi-grams
const getBigrams = (str: string): Set<string> => {
const bigrams = new Set<string>();
for (let i = 0; i < str.length - 1; i++) {
bigrams.add(str.substring(i, i + 2));
}
return bigrams;
};
const bigrams1 = getBigrams(str1);
const bigrams2 = getBigrams(str2);
// Calculate intersection size
let intersection = 0;
for (const bigram of bigrams1) {
if (bigrams2.has(bigram)) {
intersection++;
}
}
// Calculate Dice coefficient
const totalBigrams = bigrams1.size + bigrams2.size;
if (totalBigrams === 0) return 1; // Prevent division by zero
const similarity = (2 * intersection) / totalBigrams;
// Check similarity threshold
if (threshold !== undefined && similarity < threshold) {
return null;
}
return similarity;
}
// Calculate the N-gram similarity between two strings
const similarity = calculateNGramSimilarity("hello", "hallo");
console.log(similarity); // Output: 0.5 (because they share "he" and "lo" bi-grams, out of 4 total distinct bi-grams)
// Set a threshold of 0.6, if the similarity is less than 0.6, return null
const similarity = calculateNGramSimilarity("hello", "hallo", 0.6);
console.log(similarity); // Output: null (because the actual similarity is 0.5, less than the threshold of 0.6)
The choice of N value has an important impact on N-gram similarity results:
In our implementation, we chose N=2 (bigram) because it provides a good balance between fault tolerance and precision in most application scenarios.
N-gram similarity can be used to identify and correct spelling errors by comparing the similarity between misspelled words and dictionary words to find the most likely correct spelling.
In large document repositories, N-gram similarity can be used to identify similar or duplicate content, even if the format or parts of the content differ.
By comparing the N-gram distribution of unknown text with samples of different languages, the most likely language of the text can be determined.
N-gram similarity can be used to detect similar sections between documents, helping to identify potential plagiarism.
In our implementation, by setting a similarity threshold, we can terminate calculations early for similarities below the threshold, thereby improving efficiency.
For strings with lengths less than 2, we use edit distance similarity as a substitute, ensuring the algorithm works properly in all cases.
We use JavaScript's Set data structure to store N-grams, which automatically removes duplicates and provides efficient lookup operations.
Algorithm | Advantages | Disadvantages | Best Application Scenarios |
---|---|---|---|
N-gram Similarity | Strong fault tolerance, high computational efficiency | Not sensitive to character order | Spell correction, language identification |
Edit Distance | Considers character position, intuitive results | High computational cost | Short text comparison, exact matching |
Cosine Similarity | Considers word frequency, not affected by length | Ignores word order | Document classification, information retrieval |
Jaccard Similarity | Simple and intuitive, suitable for set comparison | Ignores element frequency | Tag comparison, set similarity |
N-gram similarity is a powerful and flexible algorithm in the field of fuzzy text matching, particularly suitable for applications requiring high fault tolerance and computational efficiency. By breaking text into smaller chunks, it effectively captures local similarities and works well even in the presence of spelling errors or text variations.
In our Fuzzy Text Matching Tool, N-gram similarity is one of the core algorithms, providing users with efficient and accurate text similarity calculation capabilities. By combining it with other algorithms such as edit distance, cosine similarity, and Jaccard similarity, our tool can meet various text matching needs.
© 2023 Fuzzy Text Matching Tool. All rights reserved.