algorithm/doc/musser_nishanov.qbk at e5977de7ef34f0d3535234c6d41234f879c2d786 · boostorg/algorithm · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
[/ QuickBook Document version 1.5 ]

[section:MusserNishanov Musser-Nishanov Search]

[/license

Copyright (c) 2017 Jeremy W. Murphy

Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at
http://www.boost.org/LICENSE_1_0.txt)
]


[heading Overview]

The header file 'musser_nishanov.hpp' contains an implementation of the Musser-Nishanov algorithm for sequence matching.

This algorithm was designed to be a generic sequence matching algorithm, extensible to any element type or character frequency by specialization of a traits class.
The algorithm and original code was written in 1997 by David Musser and Gor Nishanov.
Their paper is available from arXiv ([@https://arxiv.org/abs/0810.0264]) or from Dave Musser's homepage at Rensselaer Polytechnic Institute ([@http://www.cs.rpi.edu/~musser/gp/gensearch1.pdf]).

It is based on Knuth-Morris-Pratt (KMP) with the addition of a hash-coded form of the skip loop from Boyer-Moore. It has the same worst-case bound of 2n on the number of comparisons as KMP. The average case performance appears to be slightly better than Boyer-Moore-Horspool (which is in turn slightly better than better than Boyer-Moore).
It also features a fallback algorithm for when the corpus does not provide random-access iterators.

However, as is the case with the other accelerated search algorithms, it cannot be used with comparison predicates like `std::search`.

Nomenclature: I refer to the sequence being searched for as the "pattern", and the sequence being searched in as the "corpus".

[heading Interface]

For flexibility, the algorithm has two interfaces; an object-based interface and a procedural one. The object-based interface builds the tables in the constructor, and uses operator () to perform the search. The procedural interface builds the table and does the search all in one step. If you are going to be searching for the same pattern in multiple corpora, then you should use the object interface, and only build the tables once.

Here is the object interface:
``
template <typename patIter>
class musser_nishanov {
public:
    musser_nishanov ( patIter first, patIter last );
    ~musser_nishanov ();

    template <typename corpusIter>
    corpusIter operator () ( corpusIter corpus_first, corpusIter corpus_last );
    };
``

and here is the corresponding procedural interface:

``
template <typename patIter, typename corpusIter>
corpusIter musser_nishanov_search (
        corpusIter corpus_first, corpusIter corpus_last,
        patIter pat_first, patIter pat_last );
``

Each of the functions is passed two pairs of iterators. The first two define the corpus and the second two define the pattern. Note that the two pairs need not be of the same type, but they do need to "point" at the same type. In other words, `patIter::value_type` and `corpusIter::value_type` need to be the same type.

The return value of the function is an iterator pointing to the start of the pattern in the corpus. If the pattern is not found, it returns the end of the corpus (`corpus_last`).

[heading Performance]

The algorithm has the same 2['n] worst-case bound on the number of comparisons as
KMP. Similar to Boyer-Moore, the actual number of comparisons in practice will typically be much fewer, due to the skip loop feature, which provides the most dramatic effect on performance.

Execution time on UTF-8 string matching is comparable to Boyer-Moore-Horspool.
A traits class is provided for matching DNA sequences that optimizes the skip loop such that it searches 10x faster than it otherwise would. This traits class must be specified manually, so it is unfair to compare this result to another search algorithm without similar specialization.

Performance on types other than char has also been tested.
When the element is an unsigned short, performance with the default search traits is equivalent to std::search, whereas the Boyer-Moore family of algorithms tend to be 5-6x slower.


[heading Memory Use]

The algorithm allocates two internal tables. The first one is proportional to the length of the pattern; the second one has one entry for each member of the "alphabet" in the pattern. For (8-bit) character types, this table contains 256 entries.

[heading Complexity]

The worst-case performance to find a pattern in the corpus is ['O(N)] (linear) time; that is, proportional to the length of the corpus being searched. In general, the search is sub-linear; not every entry in the corpus need be checked.

[heading Exception Safety]

Both the object-oriented and procedural versions of the Boyer-Moore algorithm take their parameters by value and do not use any information other than what is passed in. Therefore, both interfaces provide the strong exception guarantee.

[heading Notes]

* When using the object-based interface, the pattern must remain unchanged for during the searches; i.e, from the time the object is constructed until the final call to operator () returns.

* The Boyer-Moore algorithm requires random-access iterators for both the pattern and the corpus.

[heading Customization points]

The Boyer-Moore object takes a traits template parameter which enables the caller to customize how one of the precomputed tables is stored. This table, called the skip table, contains (logically) one entry for every possible value that the pattern can contain. When searching 8-bit character data, this table contains 256 elements. The traits class defines the table to be used.

The default traits class uses a `boost::array` for small 'alphabets' and a `tr1::unordered_map` for larger ones.  The array-based skip table gives excellent performance, but could be prohibitively large when the 'alphabet' of elements to be searched grows. The unordered_map based version only grows as the number of unique elements in the pattern, but makes many more heap allocations, and gives slower lookup performance.

To use a different skip table, you should define your own skip table object and your own traits class, and use them to instantiate the Boyer-Moore object. The interface to these objects is described TBD.


[endsect]

[/ File musser_nishanov.qbk
Copyright 2017 Jeremy Murphy
Distributed under the Boost Software License, Version 1.0.
(See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt).
]