-
Notifications
You must be signed in to change notification settings - Fork 115
Expand file tree
/
Copy pathaho_corasick.qbk
More file actions
143 lines (88 loc) · 6.85 KB
/
aho_corasick.qbk
File metadata and controls
143 lines (88 loc) · 6.85 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
[/ QuickBook Document version 1.5 ]
[section:AhoCorasick Aho-Corasick Search]
[/license
Copyright (c) 2016 Alexander Zaitsev
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt)
]
[heading Overview]
The header file 'aho_corasick.hpp' contains an implementation of the Aho-Corasick algorithm for searching sequences of values. It is primarily used to search for multiple patterns within a corpus.
The Aho-Corasick algorithm works by building a trie (a tree with each node corresponding to an object) of the patterns sequences and traversing the trie to search for the pattern in a given corpus sequence. Additionally, the Aho-Corasick introduced the concept of "failure pointer/failure node" which is the node to be traversed when there is a mismatch.
The algorithm was conceived in 1975 by Alfred V. Aho and Margaret J. Corasick. Their paper "Efficient string matching: An aid to bibliographic search" was published in the Communications of the ACM.
Nomenclature: The nomenclature is similar to that of the Knuth Morris Pratt implementation in Boost.Algorithm. The sequence being searched for is referred to as the "pattern", and the sequence being searched in is referred to as the "corpus".
See more in "Set Matching and Aho–Corasick Algorithm", lecture slides by Pekka Kilpeläinen(http://www.cs.uku.fi/~kilpelai/BSA05/lectures/slides04.pdf).
[heading Interface]
For flexibility, the Aho-Corasick algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the trie in the constructor, and uses 'find()' to make suffix links and perform the search. The procedural interface builds the trie(with building suffix links) and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tries once.
The header file 'aho_corasick.hpp' contains two versions of Aho-Corasick: based on std::map and std::unordered_map. Also there is class AhoCorasick, which you can customize. For every version this header file provide functional and object-based interfaces.
Procedural interfaces:
Procedural interfaces provide interfaces based on iterators.
For Aho-Corasick based on std::map:
``
template <typename T, typename Predicate = std::less<T>, typename RAIterator,
typename ForwardIterator, typename Callback>
bool aho_corasick_map ( RAIterator corpus_first, RAIterator corpus_last,
ForwardIterator pat_first, ForwardIterator pat_last,
Callback cb);
``
For Aho-Corasick based on std::unordered_map:
``
template <typename T, typename Hash = std::hash<T>, typename Comp = std::equal_to<T>, typename RAIterator,
typename ForwardIterator, typename Callback>
bool aho_corasick_hashmap ( RAIterator corpus_first, RAIterator corpus_last,
ForwardIterator pat_first, ForwardIterator pat_last,
Callback cb);
``
Object interface (typedefs):
``
template <typename T, typename Pred = std::less<T>>
using Aho_Corasick_Map = AhoCorasick<T, std::map, Pred>;
template <typename T, typename Hash = std::hash<T>, typename Comp = std::equal_to<T>>
using Aho_Corasick_HashMap = AhoCorasick<T, std::unordered_map, Hash, Comp>;
``
Interface (constructors, etc.) are equal for Aho_Corasick_Map, Aho_Corasick_HashMap and basical AhoCorasick:
``
AhoCorasick();
template<typename ForwardIterator>
explicit AhoCorasick(ForwardIterator patBegin, ForwardIterator patEnd);
template<typename Range>
explicit AhoCorasick(const Range& range);
template<typename ForwardIterator>
void insert(ForwardIterator begin, ForwardIterator end);
template<typename Range>
void insert(const Range& range);
template <typename RAIterator, typename Callback>
bool find(RAIterator begin, RAIterator end, Callback cb);
``
[heading Return value]
The 'find' method returns true, if all Callback callings return true, otherwise returns false.
[heading Requirements]
C++11-compatible compiler required.
For Aho_Corasick_HashMap and aho_corasick_hashmap: by default use std::hash<ValueType> for Hash and std::equal_to<ValueType> as Comparator. If you type doesn't support it, you must use your own functions for this. Without Hash and Comparator algorithm doesn't work.
For Aho_Corasick_Map and aho_corasick_map: by default use std::less<ValueType> as Predicate. If you type doesn't support it, you must use your own functions for this. Without Predicate algorithm doesn't work.
[heading Performance]
Performance of Aho_Corasick_Map and Aho_Corasick_HashMap is similar on small alphabets. On large alphabets Aho_Corasick_HashMap is faster than Aho_Corasick_Map. Remember, that getting hash of element is slow operation. Also if you use Aho_Corasick_HashMap, std::unordered_map can sometimes do rehash with O(Alphabet).
[heading Memory Use]
Every node of trie consist of container of std::shared_ptr to trie nodes, which you choose(std::map, std::unordered_map or maybe something else), two std::shared_ptr to trie nodes and std:vector<size_t> of length of patterns, which that ends in this node. Count of nodes is linear in the sum of the length of the patterns.
[heading Complexity]
Nomenclature: M - sum of the patterns length, N - length of the corpus, K - alphabet size, T - number of coincidences
std::unordered_map-based version:
Time: O(M + N + T), Memory: O(M)
std::map-based version:
Time: O((M + N)log(K) + T), Memory: O(M).
[heading Exception Safety]
Both the object-oriented and procedural versions of the Aho-Corasick algorithm take all their parameters by value(exclude output container, taked by non-const reference). Therefore, both interfaces provide the strong exception guarantee.
[heading Notes]
* When using the object-based interface, the pattern must remain unchanged for during the inserting.
* The Aho-Corasick algorithm requires forward iterators for patterns and random-access iterators for the corpus.
[heading Customization points]
For using Aho-Corasick algorithms you must use your own Callback(RAIterator, RAIterator) -> bool. This Callback must returns true if all is fine, otherwise false.
In Aho_Corasick_HashMap and aho_corasick_hashmap() you can customize: value type, hash and compare functions.
In Aho_Corasick_Map and aho_corasick_map() you can customize: value type and predicate.
In AhoCorasick you can customize: value type, type of container and any other template parameters. It container will be used in nodes of the trie. Defining of the container: Container<Value_type, std::shared_ptr<Node>, Args...>. So your other template parameters will be used as Args... . Also your container must support 'find' method.
[endsect]
[/ File aho_corasick.qbk
Copyright 2016 Alexander Zaitsev
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).
]