Archive

Posts Tagged ‘KDB’

Kdb Training

February 25th, 2013

We’ve been working really hard on new video guides to Kdb which covers briefly some of the topics available in the kdb training courses. Some of the articles to help learning kdb include:

Feedback or questions on the articles is more than welcome.

KDB , , , ,

q-grams for fuzzy string matching in KDB

January 31st, 2011

To see if two strings are similar we can break the strings up into qgrams and count how many match between the two. Padding at the front and back is done so that start/end letters have a chance to cause a match without the whole start/end needing to match:


"Elvis" -> "##E" "#El" "Elv" "lvi" "vis" "is%" "s%%"
"Jarvis" -> "##J" "#Ja" "Jar" "arv" "rvi" "vis" "is%" "s%%"
i.e. Three would match here.

What's particularly good about this algorithm is that it easily adapted for use in databases. Here is a version in q/SQL . q is a language created by KX and is tightly tied to a database called KDB+, a trial version is available. The code is based on this paper .

Using .qgrams.cache[] function in q we can see how it works:

q)a:("Donald Knuth";"Leslie Lamport";"Ken Thompson";"Rob Pike";"Linus Torvalds";"Richard Stallman";"Tim Berners-Lee";"Alan
 Turing";"John Von Neumann";"Brian Kernighan";"Grace Hopper";"Ada Lovelace";"Edsger Dijkstra";"Jon Von Neumann";"Herb Simo
n";"Kenneth Iverson");
q).qgrams.cache[3;a]
qstr| pid pn
----| --------
##D | ,0  ,0
#DO | ,0  ,1
DON | ,0  ,2
ONA | ,0  ,3
NAL | ,0  ,4
ALD | 0 4 5 12
LD  | ,0  ,6
D K | ,0  ,7
KN  | ,0  ,8
KNU | ,0  ,9
NUT | ,0  ,10
UTH | ,0  ,11
TH% | ,0  ,12
H%% | ,0  ,13
##L | 1 4 0 0
#LE | ,1  ,1
LES | ,1  ,2
ESL | ,1  ,3
SLI | ,1  ,4
LIE | ,1  ,5
..

A list of strings where the first one was “Donald Knuth” has been broken up.
qstr - is the string broken up
pid - is which string it originally belonged to
pn - is the position of the qgram within the original string

As you can see if we had two tables with qgrams, we could simply join on the qstr column and count the number of overlaps for given strings (pid’s). The pn column can be used if we want to use a “window”to ensure that strings that match occur near each other.

Functions are provided to do this for you, the results are sorted by matchR a crude similarity scoring system (number of matches / length of shortest string):

C:tokyom>q
KDB+ 2.7 2010.11.30 Copyright (C) 1993-2010 Kx Systems
w32/ 2()core 3036MB Admin anonymous 192.168.178.64 PLAY 2011.02.28
 
q)l qgrams.q
q)a:("Donald Knuth";"Leslie Lamport";"Ken Thompson";"Rob Pike";"Linus Torvalds";"Richard Stallman";"Tim Berners-Lee";"Alan
 Turing";"John Von Neumann";"Brian Kernighan";"Grace Hopper";"Ada Lovelace";"Edsger Dijkstra";"Jon Von Neumann";"Herb Simo
n";"Kenneth Iverson");
q) / 3 = qgram size     5 = window size
q).qgrams.fuzzyMatch[3; 5; ("Ken Iverson";"Jonny Newmann"); a]
matchR    qid pid matches strA            strB
------------------------------------------------------------
1         0   15  11      "Ken Iverson"   "Kenneth Iverson"
0.6363636 0   2   7       "Ken Iverson"   "Ken Thompson"
0.6153846 1   13  8       "Jonny Newmann" "Jon Von Neumann"
0.5384615 1   8   7       "Jonny Newmann" "John Von Neumann"
0.2       0   14  2       "Ken Iverson"   "Herb Simon"
0.1538462 1   5   2       "Jonny Newmann" "Richard Stallman"
q)

Assuming that one table will be extremely large and that you’ll be checking a small subset against that, the functions also allow for caching of one of the datasets. Here I have used the person data set from DBpedia. The functions to parse this file are included in the qgrams.q. In fact all example code is included at the bottom of the qgrams.q file commented out.

e.g.

q)a: distinct value loadDBpeople[`:persondata_en.nt]
q)count a
48089
q)\t .qgrams.fuzzyMatch[3; 2; ("isac newton";"albert einstine"); a];
2140
q)ca:.qgrams.cache[3;a]
q)\t .qgrams.fuzzyMatchUsingCache[ 2; ("isac newton";"albert einstine"); a; ca]
421
q)7#.qgrams.fuzzyMatchUsingCache[ 200; ("richie feynmann";"albert einstine"); a; ca]
matchR    qid pid   matches strA              strB                           ..
-----------------------------------------------------------------------------..
2.4       1   14225 36      "albert einstine" "Nyron nilbert gilbert filbert ..
0.8666667 1   8     13      "albert einstine" "Albert Einstein"              ..
0.8181818 1   35952 9       "albert einstine" "Albert Lane"                  ..
0.7777778 1   4010  7       "albert einstine" "Albert II"                    ..
0.75      1   30825 3       "albert einstine" "Wine"                         ..
0.7333333 0   860   11      "richie feynmann" "Richard Phillips Feynman"     ..
0.7272727 1   81    8       "albert einstine" "Albert Pike"                  ..

I’d say there are a number of major speed improvements could be made to the code and the matchR scoring system could definitely be improved. If you do have any good code changes or suggestions feel free to contact me and I’ll post any improved versions here.

KDB , , , , ,