N-gram Similarity is a powerful and flexible algorithm in the field of fuzzy text matching that calculates text similarity by breaking text into smaller chunks (N-grams). This documentation details the principles, implementation, and application scenarios of this algorithm.

Algorithm Definition

N-grams are contiguous sequences of N items (characters, words, etc.) extracted from text. N-gram similarity calculates similarity by comparing the overlap of N-grams between two texts.

Mathematical Definition

Using the Dice coefficient, the formula for calculating N-gram similarity is:

Similarity = 2 × |A ∩ B| / (|A| + |B|)

Where:
- A and B are the N-gram sets of the two texts
- |A ∩ B| is the size of the intersection
- |A| and |B| are the sizes of the two sets respectively

Algorithm Implementation

Below is the TypeScript implementation of the N-gram similarity algorithm used in our Fuzzy Text Matching Tool, using character-level bi-grams (N=2):

/**
 * Calculate the N-gram similarity between two strings
 * Using character-level bi-grams (n=2)
 * @param str1 The first string
 * @param str2 The second string
 * @param threshold Optional threshold, if set, returns null when similarity is less than threshold
 * @returns Similarity (between 0-1), or null when below threshold
 */
export function calculateNGramSimilarity(str1: string, str2: string, threshold?: number): number | null {
  // If strings are too short, return edit distance similarity
  if (str1.length < 2 || str2.length < 2) {
    const maxLength = Math.max(str1.length, str2.length);
    if (maxLength === 0) return 1; // Two empty strings are considered identical
    
    const editDistance = calculateEditDistance(str1, str2);
    // If edit distance returns null due to threshold, return null here too
    if (editDistance === null) return null;
    
    const similarity = 1 - editDistance / maxLength;
    
    // Check similarity threshold
    if (threshold !== undefined && similarity < threshold) {
      return null;
    }
    return similarity;
  }
  
  // Generate bi-grams
  const getBigrams = (str: string): Set<string> => {
    const bigrams = new Set<string>();
    for (let i = 0; i < str.length - 1; i++) {
      bigrams.add(str.substring(i, i + 2));
    }
    return bigrams;
  };
  
  const bigrams1 = getBigrams(str1);
  const bigrams2 = getBigrams(str2);
  
  // Calculate intersection size
  let intersection = 0;
  for (const bigram of bigrams1) {
    if (bigrams2.has(bigram)) {
      intersection++;
    }
  }
  
  // Calculate Dice coefficient
  const totalBigrams = bigrams1.size + bigrams2.size;
  if (totalBigrams === 0) return 1; // Prevent division by zero
  
  const similarity = (2 * intersection) / totalBigrams;
  
  // Check similarity threshold
  if (threshold !== undefined && similarity < threshold) {
    return null;
  }
  
  return similarity;
}

Algorithm Characteristics

Advantages

Strong Fault Tolerance: High tolerance for spelling errors and text variations
Language Independence: Does not rely on language-specific rules or dictionaries, applicable to any language
Computational Efficiency: More efficient computation compared to algorithms like edit distance, especially for long texts
Capturing Local Similarities: Excels at capturing local similar patterns in text

Limitations

Not Sensitive to Character Order: Only focuses on the presence of N-grams, not their order
N Value Choice Affects Results: Different N values may lead to different similarity results
Does Not Consider Semantics: Only focuses on literal similarity, not word meanings

Usage Guide

Basic Usage

// Calculate the N-gram similarity between two strings
const similarity = calculateNGramSimilarity("hello", "hallo");
console.log(similarity);  // Output: 0.5 (because they share "he" and "lo" bi-grams, out of 4 total distinct bi-grams)

Usage with Threshold

// Set a threshold of 0.6, if the similarity is less than 0.6, return null
const similarity = calculateNGramSimilarity("hello", "hallo", 0.6);
console.log(similarity);  // Output: null (because the actual similarity is 0.5, less than the threshold of 0.6)

Choosing the N Value

The choice of N value has an important impact on N-gram similarity results:

N=1 (unigram): Considers only single characters, highest fault tolerance but lowest precision
N=2 (bigram): Considers two consecutive characters, good balance between fault tolerance and precision
N=3 (trigram): Considers three consecutive characters, higher precision but lower fault tolerance
N> 3: Even higher precision, but further reduced tolerance for spelling errors and variations

In our implementation, we chose N=2 (bigram) because it provides a good balance between fault tolerance and precision in most application scenarios.

Application Scenarios

1. Spell Correction

N-gram similarity can be used to identify and correct spelling errors by comparing the similarity between misspelled words and dictionary words to find the most likely correct spelling.

2. Document Deduplication

In large document repositories, N-gram similarity can be used to identify similar or duplicate content, even if the format or parts of the content differ.

3. Language Identification

By comparing the N-gram distribution of unknown text with samples of different languages, the most likely language of the text can be determined.

4. Plagiarism Detection

N-gram similarity can be used to detect similar sections between documents, helping to identify potential plagiarism.

Performance Optimizations

1. Threshold Filtering

In our implementation, by setting a similarity threshold, we can terminate calculations early for similarities below the threshold, thereby improving efficiency.

2. Special Case Handling

For strings with lengths less than 2, we use edit distance similarity as a substitute, ensuring the algorithm works properly in all cases.

3. Using Set Data Structure

We use JavaScript's Set data structure to store N-grams, which automatically removes duplicates and provides efficient lookup operations.

Comparison with Other Fuzzy Matching Algorithms

Algorithm	Advantages	Disadvantages	Best Application Scenarios
N-gram Similarity	Strong fault tolerance, high computational efficiency	Not sensitive to character order	Spell correction, language identification
Edit Distance	Considers character position, intuitive results	High computational cost	Short text comparison, exact matching
Cosine Similarity	Considers word frequency, not affected by length	Ignores word order	Document classification, information retrieval
Jaccard Similarity	Simple and intuitive, suitable for set comparison	Ignores element frequency	Tag comparison, set similarity

Conclusion

N-gram similarity is a powerful and flexible algorithm in the field of fuzzy text matching, particularly suitable for applications requiring high fault tolerance and computational efficiency. By breaking text into smaller chunks, it effectively captures local similarities and works well even in the presence of spelling errors or text variations.

In our Fuzzy Text Matching Tool, N-gram similarity is one of the core algorithms, providing users with efficient and accurate text similarity calculation capabilities. By combining it with other algorithms such as edit distance, cosine similarity, and Jaccard similarity, our tool can meet various text matching needs.

N-gram Similarity Algorithm Documentation