Saturday, September 6, 2014

String matching with *, what is the complexity?

Imagine you have a long input string and a "regexp" in the following form abc*efg*etc*. Which means that you are looking for a string that starts with abc, followed by some number of whatever characters, followed by efg, followed by...
So, here is a claim: to answer whether the string matches the "regexp", the complexity is linear for the size of the string you are looking into + the size of the regexp.
Why?
We will use KMP string matching algorithm with O(n+m) complexity, but we will use a detail of implementation of algorithm: when it finds a string, it has not looked at more than offset to string + size of string characters out of n.

So how does our algorithm work?
- if we see a set of characters in the beginning of regexp, see whether regexp starts the same way as string till the *. from now on:
- if we see a set of characters in regexp, look for that.
- if we see a star, ignore that.
- when we arrive at the end of regexp, see whether the end of search string has been achieved as well (if * is end of regexp, then yes, otherwise look at offsets in search string).
- if any of the above things does not work, return failure.

What is the complexity? Whenever we search inside the string, we use KMP and it works in O(n+mi), where mi is the size of current set of characters we are looking for. Sum of all mi's is ~m - the length of regexp. If we have many mi's why isn't the complexity O(n*count mi's + m)? The reason is that when KMP finds a string, it stops. So, if a string is found, we have looked at all characters that where before the string + the string, but not more. In our algorithm it means that we never go through the string again! Or, in other words: in KMP subroutine we don't look at the same characters in two subsequent calls. So, the overall complexity - O(n+m). What if KMP fails? Well, then we don't call another KMP.

Here is KMP: http://www.ics.uci.edu/~eppstein/161/960227.html


3 comments:

  1. The problem of exact matching with wildcards can be solved in
    O(n log m) time. And this is only for single character wildcard '?'
    http://www.cs.bris.ac.uk/Publications/Papers/2000602.pdf

    The second paper below explains the issues of using KMP for this wildcard matching problem.
    General case when multi-character wild card '*' can be used http://www.cs.bris.ac.uk/Publications/Papers/2000602.pdf
    The paper outlines several algorithms from O(n log m log ) to O(n log n) to finally deterministic O(n log m) - the quickest option.
    http://arxiv.org/pdf/1407.0950.pdf

    ReplyDelete
  2. I am solving a different problem: not "whether there is a substring that matches regexp" but "does a string match the pattern in full".

    ReplyDelete
  3. You are right. This is different from the single full match we discussed.

    ReplyDelete