q-grams for fuzzy string matching in KDB

January 31st, 2011

To see if two strings are similar we can break the strings up into qgrams and count how many match between the two. Padding at the front and back is done so that start/end letters have a chance to cause a match without the whole start/end needing to match:


"Elvis" -> "##E" "#El" "Elv" "lvi" "vis" "is%" "s%%"
"Jarvis" -> "##J" "#Ja" "Jar" "arv" "rvi" "vis" "is%" "s%%"
i.e. Three would match here.

What's particularly good about this algorithm is that it easily adapted for use in databases. Here is a version in q/SQL . q is a language created by KX and is tightly tied to a database called KDB+, a trial version is available. The code is based on this paper .

Using .qgrams.cache[] function in q we can see how it works:

q)a:("Donald Knuth";"Leslie Lamport";"Ken Thompson";"Rob Pike";"Linus Torvalds";"Richard Stallman";"Tim Berners-Lee";"Alan
 Turing";"John Von Neumann";"Brian Kernighan";"Grace Hopper";"Ada Lovelace";"Edsger Dijkstra";"Jon Von Neumann";"Herb Simo
n";"Kenneth Iverson");
q).qgrams.cache[3;a]
qstr| pid pn
----| --------
##D | ,0  ,0
#DO | ,0  ,1
DON | ,0  ,2
ONA | ,0  ,3
NAL | ,0  ,4
ALD | 0 4 5 12
LD  | ,0  ,6
D K | ,0  ,7
KN  | ,0  ,8
KNU | ,0  ,9
NUT | ,0  ,10
UTH | ,0  ,11
TH% | ,0  ,12
H%% | ,0  ,13
##L | 1 4 0 0
#LE | ,1  ,1
LES | ,1  ,2
ESL | ,1  ,3
SLI | ,1  ,4
LIE | ,1  ,5
..

A list of strings where the first one was “Donald Knuth” has been broken up.
qstr - is the string broken up
pid - is which string it originally belonged to
pn - is the position of the qgram within the original string

As you can see if we had two tables with qgrams, we could simply join on the qstr column and count the number of overlaps for given strings (pid’s). The pn column can be used if we want to use a “window”to ensure that strings that match occur near each other.

Functions are provided to do this for you, the results are sorted by matchR a crude similarity scoring system (number of matches / length of shortest string):

C:tokyom>q
KDB+ 2.7 2010.11.30 Copyright (C) 1993-2010 Kx Systems
w32/ 2()core 3036MB Admin anonymous 192.168.178.64 PLAY 2011.02.28
 
q)l qgrams.q
q)a:("Donald Knuth";"Leslie Lamport";"Ken Thompson";"Rob Pike";"Linus Torvalds";"Richard Stallman";"Tim Berners-Lee";"Alan
 Turing";"John Von Neumann";"Brian Kernighan";"Grace Hopper";"Ada Lovelace";"Edsger Dijkstra";"Jon Von Neumann";"Herb Simo
n";"Kenneth Iverson");
q) / 3 = qgram size     5 = window size
q).qgrams.fuzzyMatch[3; 5; ("Ken Iverson";"Jonny Newmann"); a]
matchR    qid pid matches strA            strB
------------------------------------------------------------
1         0   15  11      "Ken Iverson"   "Kenneth Iverson"
0.6363636 0   2   7       "Ken Iverson"   "Ken Thompson"
0.6153846 1   13  8       "Jonny Newmann" "Jon Von Neumann"
0.5384615 1   8   7       "Jonny Newmann" "John Von Neumann"
0.2       0   14  2       "Ken Iverson"   "Herb Simon"
0.1538462 1   5   2       "Jonny Newmann" "Richard Stallman"
q)

Assuming that one table will be extremely large and that you’ll be checking a small subset against that, the functions also allow for caching of one of the datasets. Here I have used the person data set from DBpedia. The functions to parse this file are included in the qgrams.q. In fact all example code is included at the bottom of the qgrams.q file commented out.

e.g.

q)a: distinct value loadDBpeople[`:persondata_en.nt]
q)count a
48089
q)\t .qgrams.fuzzyMatch[3; 2; ("isac newton";"albert einstine"); a];
2140
q)ca:.qgrams.cache[3;a]
q)\t .qgrams.fuzzyMatchUsingCache[ 2; ("isac newton";"albert einstine"); a; ca]
421
q)7#.qgrams.fuzzyMatchUsingCache[ 200; ("richie feynmann";"albert einstine"); a; ca]
matchR    qid pid   matches strA              strB                           ..
-----------------------------------------------------------------------------..
2.4       1   14225 36      "albert einstine" "Nyron nilbert gilbert filbert ..
0.8666667 1   8     13      "albert einstine" "Albert Einstein"              ..
0.8181818 1   35952 9       "albert einstine" "Albert Lane"                  ..
0.7777778 1   4010  7       "albert einstine" "Albert II"                    ..
0.75      1   30825 3       "albert einstine" "Wine"                         ..
0.7333333 0   860   11      "richie feynmann" "Richard Phillips Feynman"     ..
0.7272727 1   81    8       "albert einstine" "Albert Pike"                  ..

I’d say there are a number of major speed improvements could be made to the code and the matchR scoring system could definitely be improved. If you do have any good code changes or suggestions feel free to contact me and I’ll post any improved versions here.

KDB , , , , ,

fast java prime number generation

June 7th, 2009

So I found some old prime number generation code and decided to spruce it up. I also wanted to benchmark the BitSet vs a boolean array as I had previously had issues with a boolean array. Six methods tried:

Prime Generators - Source Download

  1. Simp - Simply check for all x’s 0 to maxPrime if x is prime. ie check x isn’t divisible by any number between 2 to x-1.
  2. SimpOdd - same as above, but add 2 to list of known primes, then check all x’s between 0 and maxPrime but stepping in 2’s. And only check for divisibles between 2 to x/2.
  3. S - Sieve - Sieve of Eratosthenes
  4. SS - SkippingSievePG - improve speed by taking advantage that even numbers are never prime except 2. Therefore when sieving using the prime 3, instead of sieving 6,9,12,15,18,21 we can actually sieve 9,15,21 ie jump double our prime each time.
  5. OSS - OddSkippingSievePG - Same as above BUT improve speed/memory making the boolean array represent odd numbers instead of all numbers.
  6. BSOSS - BitSetOddSkippingSievePG - Same as above but use BitSet instead of boolean array


Seconds taken to generate primes between 0 to 10,000,000

speed

Speed of simp’s were so slow as to be unusable for finding large numbers of primes, ie 1000’s of times slower than sieves.


Memory required to generate primes between 0 to 10,000,000

memory

Notice 1,120,000 Bytes needed to store arraylist of integers.

Size of boolean arrays / BitSets in Java

boolean arrays in java use 1 byte per true/false value. BitSets use 1 bit per true/false value (may depend on OS/VM). However BitSets are slower to access. This explains why BitSetOddSkippingSievePG was slower than OddSkippingSievePG but required less memory.

Use a sieve - to generate primes fast

Further Improvements

The idea of reducing the array size by letting the array represent odd numbers could be generalized. Further speed increases are also possible. If you want to know more see wheel factorization or segmented sieves. If you code a quicker prime generator please let me know.

Uncategorized

Setup java speech jsapi using FreeTTS

April 13th, 2009

Getting JSAPI setup to work using freetts seems to be quite difficult as seen by here, here, here…..

Turns out the install instructions were incomplete. The instructions were:

1. Go to the FreeTTS/lib directory
2. Type .\jsapi.exe
3. If the binary license agreement is acceptable, accept
it by clicking “I Agree”. The jsapi.jar file will be unpacked
and deposited into the lib directory.

What you actually want to do is:

  1. Download FreeTTS
  2. Unzip the freeTTS binary package and check inside the \lib directory for jsapi.exe
  3. Run Jsapi.exe, say yes, to unpack jsapi.jar
  4. Find your JRE directory, mine was C:\Program Files\Java\jdk1.6.0_03\jre\lib\ext
  5. Copy all the Jars (jsapi.jar, freetts.jar, cmu_time_awb.jar, cmu_us_kal.jar, etc.) to that directory
  6. Check in netbeans your projects properties ie right click on project->properties->libraries->manage platforms and the jars should be listed there. If they are not, restart and hopefully they should now be there.
    Netbeans Freetts Setup

    Netbeans Freetts Setup

  7. Copy speech.properties to the relevant folder. To find out where this is run the HelloWorld example that comes with FreeTTS. In my case the file was located in C:\Documents and Settings\Username\java.home\lib and contained the following
    1
    2
    3
    4
    5
    
    # Modify this accordingly...
    #
    #TextSynthEngineCentral=com.sun.speech.engine.synthesis.text.TextEngineCentral
    #
    FreeTTSSynthEngineCentral=com.sun.speech.freetts.jsapi.FreeTTSEngineCentral

    Or you can set the property using

    1
    
    System.setProperty("freetts.voices", "com.sun.speech.freetts.en.us.cmu_us_kal.KevinVoiceDirectory");

Any problems, post here and I will try to help you out.

Uncategorized , , ,

Symfony sfGuard - Setting up users,groups,permissions

January 11th, 2009

sfGuard is a Symfony plugin that implements a user management and login system for an application. It supports both groups and individual users… and it saves you from having to ‘roll your own’ user administration system. This guide assumes you have followed the steps given in the readme and that you now want to begin setting up users/permissions etc.

1. Create links in the backend menu to the user/group/permissions tables.

Edit apps/backend/templates/layout.php and add these items to the menu.

1
2
3
<li><?php echo link_to('Users', '@sf_guard_user') ?></li>
<li><?php echo link_to('Groups', '@sf_guard_group') ?></li>
<li><?php echo link_to('Permissions', '@sf_guard_permission') ?></li>

2. Create a login/logout link on the frontend.

Edit apps/frontend/templates/layout.php and add these items to the menu. (Notice the use of $sf_user in the templates.)

1
2
3
4
5
<?php if($sf_user->isAuthenticated()): ?>
   <ul><li><?php echo link_to('Logout', '/logout') ?></li></ul>
<?php else: ?>
   <ul><li><?php echo link_to('Login', '/login') ?></li></ul>
<?php endif; ?>

3. Create some users, groups, permissions

for us to play with using the backend. Create user->basicUser, group->basicGroup, permission->basicPermission. I will be using a basic setup where users always belong to a group and the group has permissions. I will not assigning permissions to individual users. therefore give basicGroup the basicPermission. and you will have something similar to this:

sfGuard user

sfGuard user

4. Restricting access to certain modules/actions

Similar to how I never set individual permissions for one user I make it standard that I only ever set permissions using credentials. ie. In the application I never restrict security dependent on user or group id only on permission/credentials. This allows greater flexibility in the future. Note sfGuard gets confusing to some people because many documents talk about credentials, well basically credentials are what is called in sfGuard permissions.

If we have a module called “question”, inside of apps/frontend/modules/question we create a config folder and a new security.yml. Inside of apps/frontend/modules/question/config/security.yml we would have

1
2
3
all:
    is_secure: on
    credentials: basicPermission

To set permissions on an action level we would have something similar to the following:

1
2
3
4
5
6
7
8
9
10
all:
    is_secure: on
    credentials: basicPermission
 
index:
    is_secure: off
 
new:
    is_secure: on
    credentials: basicPermission

Part 2 will detail setting up user registration

web development ,

How to setup a complete PHP development environment on windows

January 8th, 2009

Why to use an IDE

Code Completion

When you start to type a variable name or reference an object the IDE will present options that would autocomplete what your typing. It also presents the documentation saying which each parameter of a method call is. This is easiest to explain by showing you a picture:

Netbeans - Code Completion

Netbeans - Code Completion

Easy browsing

As you can see in the image, I have one pane for browsing files, one pane for browsing methods. The central pane showing the code is coloured to tell you which words are variables, functions, constants etc. Notice also that you have multiple files open in tabs like firefox. So I can now browse easily through directories/methods and recently opened files. very Handy.

Netbeans PHP Support

Netbeans PHP Support

No more echo debug_var

Why waste time echo’ing some variables, XDEBUG allows you to select a line that your interested in, the program will then stop running at that point and you can step through the code line by line and watch as each variable changes to see where the problem is.

The Setup

Download and install Netbeans 6.5 for PHP

Take advantage of syntactic and semantic code highlighting, pop-up documentation, code formating and folding, instant rename, code templates, and automatic code completion (including bracket completion) for PHP. The Editor recognizes PHP code including heredoc notation in PHP projects and in PHTML and PHP files.

Download and install MySQL GUI Tools

Never type “Select * from …” again. With simple clicking you can create/edit/drop/select tables. Far easier and quicker. You might also want to check out mysql workbench on the same site.

mysql-gui-tools

mysql-gui-tools

OPTIONAL install Subversion

If only working indiviually the versioning provided internally by netbeans may be sufficient otherwise checkout installing subversion and tortoise SVN. This allows you to save your project at different points. Later if you find out something has broken, you can revert to any earlier date. But thats the simplest method it offers. You can look back at one particular file and see which lines changed on what date. In a team environment this is essential for seeing who changed what files.

Download and install WAMP

WampServer is a Windows web development environment. It allows you to create web applications with Apache, PHP and the MySQL database. It also comes with PHPMyAdmin and SQLiteManager to easily manage your databases.

Environment setting

At times we will need to call PHP or MySQL from the command line to do this we have to setup the PATH environment. Right-click on My Computer, than Properties. Switch to Advanced tab and click the Environment Variables button. At the end of variable PATH add “;C:\wamp\bin\php\php5.2.6;C:\wamp\bin\mysql\mysql5.0.45\bin” (note separated by a semicolon). These paths may be slightly different depending on your version, Have a browse.

XDebug

Download the .dll from XDebug website (last time I checked it was on right hand menu, under windows modules). Save the file “php_xdebug-2.0.3-5.1.7.dll” to the “C:\wamp\bin\php\php5.2.6\ext” folder. Or the similar folder on your install as your PHP version may be different. Browse to “C:\wamp\bin\apache\apache2.2.8\bin” and edit “php.ini”. At the bottom of the PHP.ini also add

zend_extension_ts=”C:/wamp/bin/php/php5.2.6/ext/php_xdebug-2.0.2-5.2.5.dll”
xdebug.remote_enable=1

Apache Rewrite

Apache URL Rewrite Module is needed to allow nicer looking URL’s, by default its off in WAMP. We need to turn it on - left click on WAMP’s tray icon , then in Apache >> Apache Modules menu select rewrite_module.

php_xsl extension

Again left click on WAMP’s tray icon. Then PHP >> PHP Extension menu, look for php_xsl and click it. But there is one more php.ini file, which WAMP won’t change (no clue why) - we need to do it by hand, let’s open: C:\wamp\bin\php\php5.2.5\php.ini and remove “;” from the line “;extension=php_xsl.dll”. This uncomments it. At the bottom of the PHP.ini also add

zend_extension_ts=”C:/wamp/bin/php/php5.2.6/ext/php_xdebug-2.0.2-5.2.5.dll”
xdebug.remote_enable=1

Restart WAMP

Again left click on WAMP’s tray icon then select restart all services. Note: mySQL default username:”root” password:”" , thats right BLANK. All documents for the webserver are now located in c:\wamp\www\

web development , , ,