# 文本匹配算法

## Rabin-Karp 算法

### overview

hash函数的作用是将每一个字符串都转化为一个数字 —— 称之为哈希值（hash value）。例如，hash(“hello”)=5。如果两个字符串相等，他们的哈希值也是相等的。对于一个设计的很好的hash函数而言，在近似意义上来说，情况恰好相反：不同的字符串，及不可能会有相同的值。RK 算法通过在文本的每个位置计算从该位置开始与模式长度相等的字符串的哈希值来推进。如果当前哈希值等于模式的hash值，它就会从此位置开始做一个全面的比较。

### algorithm

function RabinKarp(string s[1..n], string pattern[1..m])
hpattern := hash(pattern[1..m]);
for i from 1 to n-m+1
hs := hash(s[i..i+m-1])
if hs = hpattern
if s[i..i+m-1] = pattern[1..m]
return i


s[i+1..i+m] = s[i..i+m-1] - s[i] + s[i+m]

### Hash function

hash函数的选取是RK算法性能提高的关键所在。

Rabin fingerprint

[(104 × 256 ) % 101  + 105] % 101  =  65
(ASCII of 'h' is 104 and of 'i' is 105)


### longest-duplicate-substring

1044. Longest Duplicate Substring

description

degree of difficulty: $\color{#e91e63}{Hard}$

Given a string S, consider all duplicated substrings: (contiguous) substrings of S that occur 2 or more times. (The occurrences may overlap.)

Return any duplicated substring that has the longest possible length. (If S does not have a duplicated substring, the answer is “”.)

Example 1:

Input: "banana"
Output: "ana"
Example 2:

Input: "abcd"
Output: ""


Note:

• 2 <= S.length <= 10^5
• S consists of lowercase English letters.

C++ solution using Rabin Karp and binary search with detailed explaination

int low = 0, high = S.length();
while (low <= high) {
int mid = low + (high - low) / 2;
string tmp = validate(mid, S);
if (tmp.length() == 0) {
high = mid - 1;
} else {
if (tmp.length() > ans.length()) {
ans = tmp;
}
low = mid + 1;
}
}


int prime = 19260817;
vector<int> power;
power = vector<int>(S.length(), 1);
for (i = 1 ; i < S.length(); i++) {
power[i] = (power[i - 1] * 26) % prime;
}


for (i = 0 ; i < desire; i++) {
current = ((current * 26) % prime + (str[i] - 'a')) % prime;
}


hash[current] = vector<int>(1, 0);
for (i = desire ; i < str.length(); i++) {
// sliding window to maintain the current substr's hash value
// be aware of overflow
current = ((current - (long long) power[desire - 1] * (str[i - desire] - 'a')) % prime + prime) % prime;
current = (current * 26 + (str[i] - 'a')) % prime;
// if that hash value is not in our set we do nothing and add the value to our map
if (hash.find(current) == hash.end()) {
hash[current] = vector<int>(1, i - desire + 1);
} else {
// otherwise, start a string by string comparason and see if there's a match
for (auto it : hash[current]) {

if (strcmp((str.substr(it, desire)).data(), str.substr(i - desire + 1, desire).data()) == 0) {
return str.substr(it, desire);
}
}

hash[current].push_back(i - desire + 1);
}
}


This is a tricky one on two sides:

1. how to find the length of the lonest string
2. how to compare the string of the same length

For the first point, we can use binary search for answer since if a string of length n is invalid then for all k > n,
there's definetly no solution because length n strings would become a substring of the length k string.
Similarly if a string of length n is valid, we have no use of checking strings with length less than n.
Due to these properties we can use binary search for final answer.
For the second point, we are actually trying to compare a sliding window of string, and Rabin Karp algorithm is perfect for doing so.
The algorithm basically computes the hash value of all the string and start a character by character comparison only if the two strings have the same hash value.
In order to avoid collision we can use a large prime number.
Such as 1e9 + 7, 19260817, 99999989, etc.

The implementation looks as follows:


class Solution {
public:
string longestDupSubstring(string S) {
ans = "";
power = vector<int>(S.length(), 1);
int i;
// precompute all the pow(26, k) 0 < k < S.length() modulus prime
for (i = 1 ; i < S.length(); i++) {
power[i] = (power[i - 1] * 26) % prime;
}
int low = 0, high = S.length();
// code for the binary search, very trivial
while (low <= high) {
int mid = low + (high - low) / 2;
string tmp = validate(mid, S);
if (tmp.length() == 0) {
high = mid - 1;
} else {
if (tmp.length() > ans.length()) {
ans = tmp;
}
low = mid + 1;
}
}

return ans;
}

private:
// large prime number
int prime = 19260817;
string ans;
vector<int> power;
string validate(int desire, string &str) {
// if the desire length is 0, return the empty string
if (desire == 0) return "";
unordered_map<int, vector<int>> hash = unordered_map<int, vector<int>>();
long long current = 0;
int i;
// compute the hash value of the first "length" characters
for (i = 0 ; i < desire; i++) {
current = ((current * 26) % prime + (str[i] - 'a')) % prime;
}
// store the result in a hashmap that maps from hashvalue to starting index
hash[current] = vector<int>(1, 0);
for (i = desire ; i < str.length(); i++) {
// sliding window to maintain the current substr's hash value
// be aware of overflow
current = ((current - (long long) power[desire - 1] * (str[i - desire] - 'a')) % prime + prime) % prime;
current = (current * 26 + (str[i] - 'a')) % prime;
// if that hash value is not in our set we do nothing and add the value to our map
if (hash.find(current) == hash.end()) {
hash[current] = vector<int>(1, i - desire + 1);
} else {
// otherwise, start a string by string comparason and see if there's a match
for (auto it : hash[current]) {

if (strcmp((str.substr(it, desire)).data(), str.substr(i - desire + 1, desire).data()) == 0) {
return str.substr(it, desire);
}
}

hash[current].push_back(i - desire + 1);
}
}

return "";
}
};


## BM算法

Boyer–Moore string-search algorithm