KMP / Z Algorithm¶
The Knuth-Morris-Pratt (KMP) and Z algorithms are efficient pattern matching algorithms that find all occurrences of a pattern in a text string. Both algorithms achieve linear time complexity O(n + m) where n is the text length and m is the pattern length.
KMP Algorithm¶
The KMP algorithm uses a failure function (also called LPS - Longest Proper Prefix which is also a Suffix) to avoid unnecessary character comparisons.
How It Works¶
- Preprocessing: Build a failure function that determines the longest proper prefix of the pattern that is also a suffix
- Matching: Use the failure function to skip characters that are guaranteed to match
KMP Implementation¶
#include <vector>
#include <string>
using namespace std;
// Build failure function (LPS array)
vector<int> buildFailureFunction(string pattern) {
int m = pattern.length();
vector<int> lps(m, 0);
int len = 0; // Length of the previous longest prefix suffix
for (int i = 1; i < m; i++) {
if (pattern[i] == pattern[len]) {
len++;
lps[i] = len;
} else {
if (len != 0) {
len = lps[len - 1];
i--; // Don't increment i
} else {
lps[i] = 0;
}
}
}
return lps;
}
// Find all occurrences of pattern in text
vector<int> kmpSearch(string text, string pattern) {
vector<int> lps = buildFailureFunction(pattern);
vector<int> matches;
int n = text.length();
int m = pattern.length();
int i = 0; // Index for text
int j = 0; // Index for pattern
while (i < n) {
if (pattern[j] == text[i]) {
i++;
j++;
}
if (j == m) {
matches.push_back(i - j); // Pattern found at position i-j
j = lps[j - 1];
} else if (i < n && pattern[j] != text[i]) {
if (j != 0) {
j = lps[j - 1];
} else {
i++;
}
}
}
return matches;
}
Z Algorithm¶
The Z algorithm builds a Z-array where Z[i] represents the length of the longest substring starting from position i that is also a prefix of the string.
Z Algorithm Implementation¶
#include <vector>
#include <string>
using namespace std;
// Build Z-array for a string
vector<int> buildZArray(string s) {
int n = s.length();
vector<int> z(n, 0);
int l = 0, r = 0; // Left and right boundaries of the Z-box
for (int i = 1; i < n; i++) {
if (i <= r) {
z[i] = min(r - i + 1, z[i - l]);
}
while (i + z[i] < n && s[z[i]] == s[i + z[i]]) {
z[i]++;
}
if (i + z[i] - 1 > r) {
l = i;
r = i + z[i] - 1;
}
}
return z;
}
// Find all occurrences of pattern in text using Z algorithm
vector<int> zSearch(string text, string pattern) {
string combined = pattern + "$" + text;
vector<int> z = buildZArray(combined);
vector<int> matches;
int m = pattern.length();
for (int i = m + 1; i < z.size(); i++) {
if (z[i] == m) {
matches.push_back(i - m - 1);
}
}
return matches;
}
Example: Pattern Matching¶
Problem: Find all occurrences of a pattern string in a text string.
Sample Input:
- text = "ABABDABACDABABCABAB"
- pattern = "ABABCABAB"
Sample Output:
- KMP matches at positions: 10
- Z algorithm matches at positions: 10
#include <iostream>
#include <vector>
using namespace std;
int main() {
string text = "ABABDABACDABABCABAB";
string pattern = "ABABCABAB";
// Using KMP
vector<int> kmpMatches = kmpSearch(text, pattern);
cout << "KMP matches at positions: ";
for (int pos : kmpMatches) {
cout << pos << " ";
}
cout << endl;
// Using Z algorithm
vector<int> zMatches = zSearch(text, pattern);
cout << "Z algorithm matches at positions: ";
for (int pos : zMatches) {
cout << pos << " ";
}
cout << endl;
return 0;
}
Advanced Applications¶
Multiple Pattern Matching¶
Problem: Find all occurrences of multiple patterns in a text string.
Sample Input:
- text = "ushers"
- patterns = ["he", "she", "his", "hers"]
Sample Output:
#include <vector>
#include <string>
using namespace std;
// Find all patterns in text using KMP
vector<pair<string, vector<int>>> findMultiplePatterns(string text, vector<string> patterns) {
vector<pair<string, vector<int>>> results;
for (string pattern : patterns) {
vector<int> matches = kmpSearch(text, pattern);
results.push_back({pattern, matches});
}
return results;
}
Longest Common Substring¶
Problem: Find the longest common substring between two strings.
Sample Input:
- s1 = "ABABC"
- s2 = "BABCA"
Sample Output: "BABC"
#include <string>
#include <vector>
using namespace std;
string longestCommonSubstring(string s1, string s2) {
string combined = s1 + "$" + s2;
vector<int> z = buildZArray(combined);
int maxLen = 0;
int startPos = 0;
for (int i = s1.length() + 1; i < z.size(); i++) {
if (z[i] > maxLen) {
maxLen = z[i];
startPos = i - s1.length() - 1;
}
}
return s2.substr(startPos, maxLen);
}
Use Cases¶
- Text Search: Finding all occurrences of a pattern in text
- DNA Sequence Analysis: Pattern matching in biological sequences
- Plagiarism Detection: Finding similar text patterns
- String Compression: LZ77 and similar algorithms
- Regular Expression Matching: Building regex engines
Complexity Analysis¶
KMP Algorithm¶
- Time Complexity: O(n + m) for preprocessing and searching
- Space Complexity: O(m) for the failure function
Z Algorithm¶
- Time Complexity: O(n + m) for building Z-array and searching
- Space Complexity: O(n + m) for the Z-array
When to Use Which Algorithm¶
Use KMP when:¶
- You need to find all occurrences of a single pattern
- Memory usage is a concern (KMP uses less space)
- You need the failure function for other applications
Use Z Algorithm when:¶
- You need to find the longest common prefix at each position
- You're working with multiple pattern matching
- You need to find palindromes or other string properties
Optimization Tips¶
- Early Termination: Stop searching if only the first occurrence is needed
- Memory Optimization: Use Z algorithm for single-pass applications
- Pattern Preprocessing: Cache failure functions for repeated pattern searches
- String Concatenation: Minimize string operations in the Z algorithm