N-gram Similarity: Breaking Down Text Comparison
In the world of fuzzy text matching, N-gram similarity stands as a powerful and flexible technique that compares text similarity by breaking text into smaller chunks. This approach is particularly effective when dealing with spelling errors, OCR results, and multilingual text.
What are N-grams?
N-grams are contiguous sequences of N items (characters, words, etc.) extracted from text. Depending on the value of N, we have different types of N-grams:
- Unigram (1-gram): Single items, such as individual characters or words
- Bigram (2-gram): Two consecutive items
- Trigram (3-gram): Three consecutive items
- And so on...
For example, for the string "hello", the character-level bigrams (2-grams) are: he, el, ll, lo。
How N-gram Similarity Works
N-gram similarity calculates similarity by comparing the overlap of N-grams between two texts. The calculation process typically involves these steps:
- Extract N-grams from both texts
- Calculate the size of the intersection between the two N-gram sets
- Calculate the final similarity using a similarity coefficient (such as Dice coefficient, Jaccard coefficient, etc.)
Using the Dice coefficient as an example, the N-gram similarity formula is:
Similarity = 2 × |A ∩ B| / (|A| + |B|)
Where A and B are the N-gram sets of the two texts, |A ∩ B| is the size of the intersection, and |A| and |B| are the sizes of the two sets respectively.
Advantages of N-gram Similarity
1. Strong Fault Tolerance
The N-gram method has strong fault tolerance for spelling errors and text variations. Even if there are errors in the text, many N-grams can still match, resulting in reasonable similarity scores.
2. Language Independence
The N-gram method does not rely on language-specific rules or dictionaries, so it can be applied to text in any language, including non-space-delimited languages like Chinese and Japanese.
3. Computational Efficiency
Compared to some algorithms that require complex dynamic programming (such as edit distance), N-gram similarity calculation is usually more efficient, especially for long texts.
4. Capturing Local Similarities
The N-gram method excels at capturing local similar patterns in text, even if the overall text structure differs significantly.
Practical Application Examples
Spell Correction
If a user inputs "programing" (missing an 'm'), N-gram similarity can identify that "programming" is the most similar correct spelling.
Document Deduplication
In large document repositories, N-gram similarity can be used to identify similar or duplicate content, even if the format or parts of the content differ.
Language Identification
By comparing the N-gram distribution of unknown text with samples of different languages, the most likely language of the text can be determined.
Plagiarism Detection
N-gram similarity can be used to detect similar sections between documents, helping to identify potential plagiarism.
Our Implementation
In our Fuzzy Text Matching Tool, we've implemented a character-level bi-gram (2-gram) similarity algorithm. Here's the core implementation code:
/**
* Calculate the N-gram similarity between two strings
* Using character-level bi-grams (n=2)
* @param str1 The first string
* @param str2 The second string
* @param threshold Optional threshold, if set, returns null when similarity is less than threshold
* @returns Similarity (between 0-1), or null when below threshold
*/
export function calculateNGramSimilarity(str1: string, str2: string, threshold?: number): number | null {
// If strings are too short, return edit distance similarity
if (str1.length < 2 || str2.length < 2) {
const maxLength = Math.max(str1.length, str2.length);
if (maxLength === 0) return 1; // Two empty strings are considered identical
const editDistance = calculateEditDistance(str1, str2);
// If edit distance returns null due to threshold, return null here too
if (editDistance === null) return null;
const similarity = 1 - editDistance / maxLength;
// Check similarity threshold
if (threshold !== undefined && similarity < threshold) {
return null;
}
return similarity;
}
// Generate bi-grams
const getBigrams = (str: string): Set<string> => {
const bigrams = new Set<string>();
for (let i = 0; i < str.length - 1; i++) {
bigrams.add(str.substring(i, i + 2));
}
return bigrams;
};
const bigrams1 = getBigrams(str1);
const bigrams2 = getBigrams(str2);
// Calculate intersection size
let intersection = 0;
for (const bigram of bigrams1) {
if (bigrams2.has(bigram)) {
intersection++;
}
}
// Calculate Dice coefficient
const totalBigrams = bigrams1.size + bigrams2.size;
if (totalBigrams === 0) return 1; // Prevent division by zero
const similarity = (2 * intersection) / totalBigrams;
// Check similarity threshold
if (threshold !== undefined && similarity < threshold) {
return null;
}
return similarity;
}
Comparing N-gram Similarity with Other Fuzzy Matching Algorithms
Algorithm | Advantages | Disadvantages |
---|---|---|
N-gram Similarity |
|
|
Edit Distance |
|
|
Cosine Similarity |
|
|
Choosing the Right N Value
The choice of N value has an important impact on N-gram similarity results:
- Smaller N values (such as 2 or 3): Provide higher fault tolerance but may lead to more false positive matches
- Larger N values: Provide more precise matching but reduce tolerance for spelling errors and variations
In practice, character-level bi-grams (N=2) or tri-grams (N=3) are usually a good balance point for text similarity calculations. For word-level N-grams, 1-grams (single words) to 3-grams are commonly used.
Optimization Tips
1. Threshold Filtering
When processing large volumes of text comparisons, setting a similarity threshold and only returning results that exceed the threshold can improve efficiency.
2. Index Optimization
For large text collections that need frequent comparison, pre-computing and storing N-grams for each text and building an inverted index can speed up queries.
3. Hybrid Algorithms
In some applications, combining N-gram similarity with other algorithms (such as edit distance) can yield better results. For example, N-gram similarity can be used to quickly filter candidates, followed by edit distance for precise ranking.
Conclusion
N-gram similarity is a powerful and flexible technique in the field of fuzzy text matching. It captures local similarities by breaking text into smaller chunks, performing excellently when dealing with spelling errors, OCR results, and multilingual text. While it has some limitations, N-gram similarity provides a good balance between computational efficiency and matching quality in many practical applications.
In our Fuzzy Text Matching Tool, N-gram similarity is one of the core algorithms, providing users with efficient and accurate text similarity calculation capabilities. By combining it with other algorithms such as edit distance, cosine similarity, and Jaccard similarity, our tool can meet various text matching needs.
Related Articles
Understanding Levenshtein Distance in Text Matching
Learn how the Levenshtein algorithm calculates the minimum number of single-character edits required to change one word into another.
Cosine Similarity vs. Jaccard Index: Which to Choose?
A comparative analysis of two popular similarity measures and guidelines on when to use each.