Tuesday 19 November 2013

C++ implementation of a Trie / Prefix Tree

A Trie (http://en.wikipedia.org/wiki/Trie) is a great way to search a stream of text for multiple keywords. It's extremely fast and is a very simple structure to understand. Here is my initial attempt in C++. It is a first cut, not yet fully tested but seems to work. It is case sensitive and will match partial words. The main disadvantage of a Trie is the memory consumption. Each letter requires a node that contains a character / bool and vector so adding words quickly chews up memory.

Here is a link to the source on github:

https://github.com/eskeels/trie

There is just the header file and a main.cpp with some tests. I have also added 2 additional methods, Compress() and ValidateState(). Compress() will recurse through the Trie calling shrink_to_fit() on each of the vectors that are used to store child nodes. ValidateState() performs validation of the nodes. It is only required for testing purposes.


Friday 8 November 2013

Generating C++ code from a Trie

If you need to search for a small amount of hard coded strings then this might be a good solution. Whilst the Trie is fast you have to contend with the overhead of initialising the Trie data structure and the memory usage. This version of the Trie has a dump() method that will generate a search() function capable of performing a parallel search of all the words in the Trie. Some simple tests have shown this to be considerably faster than using the Trie directly. The generated search() function is pretty ugly and unwieldy but so long as you don't have too many terms it could prove useful.

The prototype of the search method generated is below:

const char * search(const char * pStart,  const char * pbuff, CallbackFunction cf )

pStart is a pointer to the very start of the buff you are search. pbuff is a pointer to the position where you want to search from. Finally cf is a callback function that is invoked whenever a search term is found. A sample showing use of the search function is below:

while( p && (*p != '\0') )
{
p = search(&peterpan[0], p, &myCallback );
count++;
}

The callback is of the form:

int myCallback(const char * pStart, const char * pbuff, const char * resultString, const char * position)

pStart / pbuff are as per the search() method. resultString is a NULL terminated string containing the search term that has been found. position is a pointer to the last character of where it was found. The return type is currently ignored.

The code is below. It is the same as the other Trie code example however it has the dump() methods added and a tweak to track the size of the largest term in the Trie.

Sample code


An example callback function. It uses std::distance() to work out the offset.

int myCallback(const char * pStart, const char * pbuff, const char * resultString, const char * position)
{
std::cout << "Found word :" << resultString << std::endl;
if (pStart && position)
std::cout << "At position :" << std::distance(pStart, position) << std::endl;
return 0;
}

// This is the search() function output from the call to dump().
typedef int (CallbackFunction)(const char * pStart, const char * pbuff, const char * resultString, const char * position);

const char * search(const char * pStart,  const char * pbuff, CallbackFunction cf )
{
  const char * p = pbuff;
  const char * pRet = NULL;
  const size_t maxWordLen = 7;
  while(*p)
  {
  switch(*p){
  case 'c':
    switch(*(++p)){
    case 'a':
      switch(*(++p)){
      case 't':
      // found a word:cat Store pointer to last character of where it was found.
      pRet=p;
      cf(pStart, pbuff, "cat", p);
        switch(*(++p)){
        case 's':
...
...
...

   int main(int argc, char* argv[])
  {
      Trie<char> t;
      std::map<const void *,std::string> dictionary;
      std::vector<SearchResult<char> > searchResults;

      AddWord<char>("cat", t, dictionary);
      AddWord<char>("cats", t, dictionary);
      AddWord<char>("peter", t, dictionary);
      AddWord<char>("wendy", t, dictionary);
      AddWord<char>("hook", t, dictionary);
      AddWord<char>("hooks", t, dictionary);
      AddWord<char>("party", t, dictionary);
      AddWord<char>("pillows", t, dictionary);
      t.dump();
      return 0;
}

// Run once calling the dump() function and then capture the output.

std::ifstream myfile ("c:\\Users\\soswin\\peterpan.txt");
std::string peterpan;
std::string line;
if (myfile.is_open())
{
while ( getline (myfile,line) )
{
peterpan.append(line);
}
myfile.close();
}

while( p && (*p != '\0') )
{
p = search(&peterpan[0], p, &myCallback );
}
}