This is an old revision of the document!
Table of Contents
CDLI transliteration search
Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand
Really technical background
The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.
The search currently only uses the ZCatalog index search on the word index or grapheme index.
The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16
Simple search mode
The simple search simply puts quotes around every "word" separated by blanks:
if search_mode == 'simple': # escape everything between blanks searchword = ' '.join(['"%s"'%w for w in word.split()])
So if you were to put in
Word and grapheme indexes
Lines beginning with
ignoreLines=['$','@','#','&','>']
are excluded from the index.
Kommas are deleted unless after s,t,h:
komma_exception="([^sStThH]),"
The word index is defined by:
wordBounds="_|,|\"" wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The grapheme index is defined by:
graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\"" graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The splitting code is:
# delete kommata except kommata relevant for graphemes txt = komma_exceptionex.sub(r"\1",txt) # replace word boundaries by spaces txt = self.boundsex.sub(' ',txt) # replace letters to be ignored txt = self.ignorex.sub('',txt) # split words words = txt.split(" ") for w in words: w=w.strip() if not (w==''): result.append(w)