Edit Distance (also known as Levenshtein Distance) is one of the most fundamental and widely used algorithms in the field of fuzzy text matching. This documentation details the principles, implementation, and application scenarios of this algorithm.
Edit Distance is defined as the minimum number of single-character edit operations (insertions, deletions, or substitutions) required to transform one string into another.
Given two strings a and b, their edit distance d(a,b) is defined as:
d(a,b) = min { d(a[1...i-1], b[1...j-1]) + cost(a[i], b[j]), // Substitution d(a[1...i], b[1...j-1]) + 1, // Insertion d(a[1...i-1], b[1...j]) + 1 // Deletion }
where cost(a[i], b[j]) is 0 if a[i] = b[j], or 1 if a[i] ≠ b[j].
Below is the TypeScript implementation of the Edit Distance algorithm used in our Fuzzy Text Matching Tool:
/**
* Calculate the edit distance (Levenshtein distance) between two strings
* The smaller the edit distance, the more similar the strings
* @param str1 The first string
* @param str2 The second string
* @param threshold Optional threshold, if set, returns null when the edit distance exceeds the threshold
* @returns The edit distance, or null when exceeding the threshold
*/
export function calculateEditDistance(str1: string, str2: string, threshold?: number): number | null {
const m = str1.length;
const n = str2.length;
// If the difference in string lengths already exceeds the threshold, we can return null early
if (threshold !== undefined && Math.abs(m - n) > threshold) {
return null;
}
// Create the distance matrix
const dp: number[][] = Array(m + 1).fill(null).map(() => Array(n + 1).fill(0));
// Initialize the first row and column
for (let i = 0; i <= m; i++) {
dp[i][0] = i;
}
for (let j = 0; j <= n; j++) {
dp[0][j] = j;
}
// Fill the distance matrix
for (let i = 1; i <= m; i++) {
// Track the minimum value in the current row
let rowMin = Infinity;
for (let j = 1; j <= n; j++) {
if (str1[i - 1] === str2[j - 1]) {
dp[i][j] = dp[i - 1][j - 1];
} else {
dp[i][j] = Math.min(
dp[i - 1][j] + 1, // Deletion
dp[i][j - 1] + 1, // Insertion
dp[i - 1][j - 1] + 1 // Substitution
);
}
// Update the minimum value in the current row
rowMin = Math.min(rowMin, dp[i][j]);
}
// If the minimum value in the current row already exceeds the threshold, return null early
if (threshold !== undefined && rowMin > threshold) {
return null;
}
}
return dp[m][n];
}
Our implementation includes several key performance optimizations:
By setting a threshold parameter, we can terminate calculations early for distances that exceed the threshold, greatly improving processing efficiency. This is particularly useful when comparing large volumes of text.
Before starting the calculation, we first check the difference in length between the two strings. If the length difference already exceeds the threshold, we can return null directly, avoiding unnecessary calculations.
During the matrix filling process, we track the minimum value in each row. If the minimum value in an entire row exceeds the threshold, we can terminate the calculation early.
// Calculate the edit distance between two strings
const distance = calculateEditDistance("kitten", "sitting");
console.log(distance); // Output: 3
// Set a threshold of 2, if the edit distance is greater than 2, return null
const distance = calculateEditDistance("kitten", "sitting", 2);
console.log(distance); // Output: null (because the actual distance is 3, exceeding the threshold of 2)
The Edit Distance algorithm can be used to identify and correct spelling errors. When a user inputs a potentially misspelled word, the system can calculate the edit distance between that word and all words in a dictionary, recommending those with the smallest distances as possible correct spellings.
In search functionality, edit distance can be used to provide "Did you mean..." suggestions, even when the user's query contains spelling errors or variations.
In data cleaning processes, edit distance can be used to identify and merge similar but not identical records, such as names or addresses with slight variations.
In DNA sequence analysis, edit distance is used to compare the similarity of gene sequences, helping to study gene mutations and evolutionary relationships.
Algorithm | Advantages | Disadvantages | Best Application Scenarios |
---|---|---|---|
Edit Distance | Intuitive, precise, considers character position | High computational cost, doesn't consider semantics | Spell correction, short text comparison |
N-gram Similarity | Captures local patterns, computationally efficient | Not sensitive to character order | Document similarity, language identification |
Cosine Similarity | Suitable for high-dimensional vectors, not affected by text length | Ignores word order, requires vectorization | Document classification, information retrieval |
Jaccard Similarity | Simple, intuitive, suitable for set comparison | Ignores element frequency and position | Tag comparison, set similarity |
Edit Distance is a foundational algorithm in the field of fuzzy text matching, providing an intuitive and effective method to quantify the difference between strings. Despite its limitations, it remains a core component of many text processing applications, especially in scenarios involving spelling errors, text comparison, and fuzzy search.
In our Fuzzy Text Matching Tool, the Edit Distance algorithm is one of the core functionalities, providing users with precise text similarity calculation capabilities. By combining it with other algorithms such as N-gram Similarity, Cosine Similarity, and Jaccard Similarity, our tool can meet various text matching needs.
© 2023 Fuzzy Text Matching Tool. All rights reserved.