Saturday 17 March 2018

Winnowing document fingerprinting

A C++ implementation of this algorithm:

http://igm.univ-mlv.fr/~mac/ENS/DOC/sigmod03-1.pdf

is available here:

https://github.com/eskeels/fingerprintotron

To build it requires CMake and GNU C++ (min version 4.8). Once built it produces a utility "compare". You pass it the files (UTF-8 text) to compare against each other. The utility can take either directories or individual files, it will work out what to do. It will then fingerprint each document and compare the results with every other document. As each document is compared it outputs the percentage of fingerprints that are common to both documents. Finally it lists all the documents that are similar to each other (as in have a > 20% fingerprint match).

Example output (running against directory called "dir" containing 4 text files):

Processing files:
dir/test1.txt
dir/test2.txt
dir/test1_same.txt
dir/rand.txt

dir/test1.txt - dir/test2.txt %24
dir/test1.txt - dir/test1_same.txt %100
dir/test1.txt - dir/rand.txt %1
dir/test2.txt - dir/test1_same.txt %22
dir/test2.txt - dir/rand.txt %5
dir/test1_same.txt - dir/rand.txt %1
Similar documents
dir/test1.txt dir/test1_same.txt dir/test2.txt

so test1.txt has 24% of its fingerprints in common with test2.txt. rand.txt only has 1% in common with test1.txt.

test1.txt / test1_same.txt and test2.txt all have common fingerprints.



Friday 13 October 2017

Levenshtein Prefilter

This code will take a word and create a feature vector from it by simply using the letter frequency. It then finds the nearest matches (using euclidean distance dropping the square root) prior to performing a levenshtein distance calculation. The original plan was to completely replace the levenshtein distance but it did not work out, the words it matched were not a good selection. I tried various feature vectors (number of syllable, length etc..) but the letter frequency was as goods as anything else and took far less computation. The code is hardwired to a levenshtein distance of 2. In terms of performance the longer the search term the more time the prefilter saves. Doing some ad hoc testing I found that a four character term it is slightly faster (maybe 50% quicker) and for a length over 10 it starts being around 11 times. One thing to note is that the prefilter can sometimes remove too many terms. You can increase the maximum distance from 6 to 8 to reduce the filter rate (the parameter passed to IsCloseEnough()) but there is an obvious performance penalty.

I've uploaded the code to git hub, it requires CMake and GNU C++ (min version 4.8).

https://github.com/eskeels/fuzzywordsearch

Friday 1 July 2016

Building the Clang AST visitor example on Centos 7

I was struggling to get the AST visitor example (http://clang.llvm.org/docs/RAVFrontendAction.html) to work on Centos 7. These steps allow you to build with yum installed CMake and CLang. I wanted to give it a go to see how easy it would be to create a decent source browser or a tool that can produce Plant UML from existing C++ code. I've refactored the sample to output more information and modified the CMakeLists.txt to work. For each class it will output all the parent contexts (ie namespaces / classes) and all the functions.

Package dependencies


Yum install the following packages:

cmake
clang
clang-devel
libffi
libffi-devel
llvm-devel
llvm-static

I understand Clang shares headers / libs with GNU g++ so I have assumed you already have g++ installed.

CMakeLists.txt


To allow the yum installed CMake to work I had to tweak the tutorial CMakeLists.txt to look like below:

set(CMAKE_CXX_COMPILER "/usr/bin/clang++")
set(CMAKE_C_COMPILER "/usr/bin/clang")
set(CMAKE_LIBRARY_PATH ${CMAKE_LIBRARY_PATH} /usr/lib64/llvm)

add_definitions(-D__STDC_CONSTANT_MACROS -D__STDC_LIMIT_MACROS)
project (Tutorial)
add_executable(Tutorial tutorial.cxx)
target_link_libraries(Tutorial /usr/lib64/llvm/libclang.so /usr/lib64/llvm/libLLVM-3.4.so /usr/lib64/llvm/liblldb.so /usr/lib64/llvm/libclangTooling.a)

Build and run

Simply do a "cmake ." followed by "make". This will build an executable called "Tutorial".
You can then just run Tutorial against tutorial.cxx.

compile_commands.json

My example invokes runToolOnCodeWithArgs(). To get the correct arguments you can enable verbose makefile:

cmake -DCMAKE_VERBOSE_MAKEFILE=ON .

This will cause a compile_commands.json to be output. This lists all the arguments you need to pass to Clang for the compilation unit to compile correctly.

Links:




Thursday 14 January 2016

Scratch for Arduino tutorials

Some more Arduino tutorials, these use the excellent Scratch for Arduino (http://s4a.cat/). The following links are to PDFs:

First off we get a single LED to flash on / off:

https://drive.google.com/open?id=0B_YZvz83_le0Uk93Qm51MC11Tlk

A simple reaction test game:

https://drive.google.com/open?id=0B_YZvz83_le0VlV3NFBGZVR4ckE

A temperature sensor:

https://drive.google.com/open?id=0B_YZvz83_le0bjhmdTcxR3Z6aFU

How to wire up buttons part 1:

https://drive.google.com/open?id=0B_YZvz83_le0ZGpzNEs5b2FOZVk

Following on from the above this tutorial uses the buttons to move a sprite:

https://drive.google.com/open?id=0B_YZvz83_le0Y2dkcFZzdGp2Q2c

Using a joystick:

https://drive.google.com/open?id=0B_YZvz83_le0Y2dkcFZzdGp2Q2c

Wednesday 13 January 2016

Simple Arduino memory game

Here is a tutorial for a very simple memory game that flashes a sequence of LEDs. You respond via the serial console with a list of  'r' or 'g' characters to indicate the sequence. The tutorial also shows you how to respond via a console application. There are more sophisticated arduino memory games on the web but I tried to keep this one as easy as possible as it was used in our coder dojo session.

PDF:
https://drive.google.com/file/d/0B_YZvz83_le0Vy16WUg1eVNyVjQ/view?usp=sharing

Code:
https://drive.google.com/open?id=0B_YZvz83_le0a0g0bWxPSV9xeTA

Saturday 21 February 2015

Arduino - Sending key presses from attached PC


I wanted to be able to send real time key presses to the Arduino from the PC so could not use the Arduino application as it requires the return key after every send. I initially tried a simple bat file with echos to the COM ports but could not get it to work. In the end I wrote a powershell script:

$port = new-Object System.IO.Ports.SerialPort COM9,9600,None,8,one
$port.open()
do
{
  $key = $Host.UI.RawUI.ReadKey("NoEcho,IncludeKeyDown")
  Write-Host $key.Character
  $port.WriteLine($key.Character) 
}
while(($key.Character -ne 'q'))
$port.Close()

Put the above into a script (scr.ps1) and invoke it as follows:

c:\> powershell -ExecutionPolicy ByPass -File scr.ps1

Ensure you have closed the Arduino app or you will get a permission error talking to the COM port. Also ensure you amend the COM9 (first line of script) to be the appropriate COM port. The script will run until the q key is pressed. I have only tested it on Windows 7 so I don't know whether it will work with powershell 1 (XP / Vista version). The corresponding Arduino code is simply:

void setup() { serial.begin(9600); }

void loop() {
  char inChar = serial.read();
  if (inChar == 'a') {
    // 'a' key was pressed
  }
}

Friday 6 June 2014

Java Trie - Ascii only

Just uploading an Ascii only version, uses a byte instead of a char. Also uses the top bit as an end of word flag so you can not populate it with Ansi (ie 8 bit characters).

https://drive.google.com/file/d/0B_YZvz83_le0alZ1OFNMVW56cWs/edit?usp=sharing

Also probably worth mentioning that this implementation (and the other full 'char' version) will only return the longest match when there are multiple overlapping terms within the Trie. For example a Trie containing "cap", "capitulate", "capitulated" and a search string of "capitulated" would only return the longest, "capitulated".

I haven't implemented the compressed version as it turns out the above version doesn't use that much memory.