6 min read

N-gram Similarity: Breaking Down Text Comparison

algorithmngramtext-similarity

In the world of fuzzy text matching, N-gram similarity stands as a powerful and flexible technique that compares text similarity by breaking text into smaller chunks. This approach is particularly effective when dealing with spelling errors, OCR results, and multilingual text.

What are N-grams?

N-grams are contiguous sequences of N items (characters, words, etc.) extracted from text. Depending on the value of N, we have different types of N-grams:

  • Unigram (1-gram): Single items, such as individual characters or words
  • Bigram (2-gram): Two consecutive items
  • Trigram (3-gram): Three consecutive items
  • And so on...

For example, for the string "hello", the character-level bigrams (2-grams) are: he, el, ll, lo

How N-gram Similarity Works

N-gram similarity calculates similarity by comparing the overlap of N-grams between two texts. The calculation process typically involves these steps:

  1. Extract N-grams from both texts
  2. Calculate the size of the intersection between the two N-gram sets
  3. Calculate the final similarity using a similarity coefficient (such as Dice coefficient, Jaccard coefficient, etc.)

Using the Dice coefficient as an example, the N-gram similarity formula is:

Similarity = 2 × |A ∩ B| / (|A| + |B|)

Where A and B are the N-gram sets of the two texts, |A ∩ B| is the size of the intersection, and |A| and |B| are the sizes of the two sets respectively.

Advantages of N-gram Similarity

1. Strong Fault Tolerance

The N-gram method has strong fault tolerance for spelling errors and text variations. Even if there are errors in the text, many N-grams can still match, resulting in reasonable similarity scores.

2. Language Independence

The N-gram method does not rely on language-specific rules or dictionaries, so it can be applied to text in any language, including non-space-delimited languages like Chinese and Japanese.

3. Computational Efficiency

Compared to some algorithms that require complex dynamic programming (such as edit distance), N-gram similarity calculation is usually more efficient, especially for long texts.

4. Capturing Local Similarities

The N-gram method excels at capturing local similar patterns in text, even if the overall text structure differs significantly.

Practical Application Examples

Spell Correction

If a user inputs "programing" (missing an 'm'), N-gram similarity can identify that "programming" is the most similar correct spelling.

Document Deduplication

In large document repositories, N-gram similarity can be used to identify similar or duplicate content, even if the format or parts of the content differ.

Language Identification

By comparing the N-gram distribution of unknown text with samples of different languages, the most likely language of the text can be determined.

Plagiarism Detection

N-gram similarity can be used to detect similar sections between documents, helping to identify potential plagiarism.

Our Implementation

In our Fuzzy Text Matching Tool, we've implemented a character-level bi-gram (2-gram) similarity algorithm. Here's the core implementation code:

/**
 * Calculate the N-gram similarity between two strings
 * Using character-level bi-grams (n=2)
 * @param str1 The first string
 * @param str2 The second string
 * @param threshold Optional threshold, if set, returns null when similarity is less than threshold
 * @returns Similarity (between 0-1), or null when below threshold
 */
export function calculateNGramSimilarity(str1: string, str2: string, threshold?: number): number | null {
  // If strings are too short, return edit distance similarity
  if (str1.length < 2 || str2.length < 2) {
    const maxLength = Math.max(str1.length, str2.length);
    if (maxLength === 0) return 1; // Two empty strings are considered identical
    
    const editDistance = calculateEditDistance(str1, str2);
    // If edit distance returns null due to threshold, return null here too
    if (editDistance === null) return null;
    
    const similarity = 1 - editDistance / maxLength;
    
    // Check similarity threshold
    if (threshold !== undefined && similarity < threshold) {
      return null;
    }
    return similarity;
  }
  
  // Generate bi-grams
  const getBigrams = (str: string): Set<string> => {
    const bigrams = new Set<string>();
    for (let i = 0; i < str.length - 1; i++) {
      bigrams.add(str.substring(i, i + 2));
    }
    return bigrams;
  };
  
  const bigrams1 = getBigrams(str1);
  const bigrams2 = getBigrams(str2);
  
  // Calculate intersection size
  let intersection = 0;
  for (const bigram of bigrams1) {
    if (bigrams2.has(bigram)) {
      intersection++;
    }
  }
  
  // Calculate Dice coefficient
  const totalBigrams = bigrams1.size + bigrams2.size;
  if (totalBigrams === 0) return 1; // Prevent division by zero
  
  const similarity = (2 * intersection) / totalBigrams;
  
  // Check similarity threshold
  if (threshold !== undefined && similarity < threshold) {
    return null;
  }
  
  return similarity;
}

Comparing N-gram Similarity with Other Fuzzy Matching Algorithms

AlgorithmAdvantagesDisadvantages
N-gram Similarity
  • Strong fault tolerance
  • Language independence
  • High computational efficiency
  • Not sensitive to character order
  • N value choice affects results
Edit Distance
  • Considers character position
  • Intuitive and easy to understand results
  • High computational cost
  • Not suitable for long texts
Cosine Similarity
  • Considers word frequency
  • Not affected by text length
  • Ignores word order
  • Requires vectorization

Choosing the Right N Value

The choice of N value has an important impact on N-gram similarity results:

  • Smaller N values (such as 2 or 3): Provide higher fault tolerance but may lead to more false positive matches
  • Larger N values: Provide more precise matching but reduce tolerance for spelling errors and variations

In practice, character-level bi-grams (N=2) or tri-grams (N=3) are usually a good balance point for text similarity calculations. For word-level N-grams, 1-grams (single words) to 3-grams are commonly used.

Optimization Tips

1. Threshold Filtering

When processing large volumes of text comparisons, setting a similarity threshold and only returning results that exceed the threshold can improve efficiency.

2. Index Optimization

For large text collections that need frequent comparison, pre-computing and storing N-grams for each text and building an inverted index can speed up queries.

3. Hybrid Algorithms

In some applications, combining N-gram similarity with other algorithms (such as edit distance) can yield better results. For example, N-gram similarity can be used to quickly filter candidates, followed by edit distance for precise ranking.

Conclusion

N-gram similarity is a powerful and flexible technique in the field of fuzzy text matching. It captures local similarities by breaking text into smaller chunks, performing excellently when dealing with spelling errors, OCR results, and multilingual text. While it has some limitations, N-gram similarity provides a good balance between computational efficiency and matching quality in many practical applications.

In our Fuzzy Text Matching Tool, N-gram similarity is one of the core algorithms, providing users with efficient and accurate text similarity calculation capabilities. By combining it with other algorithms such as edit distance, cosine similarity, and Jaccard similarity, our tool can meet various text matching needs.

© 2023 Fuzzy Text Matching Tool. All rights reserved.