This is an old revision of the document!
Table of Contents
CDLI transliteration search
Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand
Really technical background
The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.
The search currently only uses the ZCatalog index search on the word index or grapheme index.
The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16
Simple search mode
The simple search simply puts quotes around every "word" separated by blanks:
if search_mode == 'simple':
# escape everything between blanks
searchword = ' '.join(['"%s"'%w for w in word.split()])
So if you were to put in
Word and grapheme indexes
Lines beginning with
ignoreLines=['$','@','#','&','>']
are excluded from the index.
Kommas are deleted unless after s,t,h:
komma_exception="([^sStThH]),"
The word index is defined by:
wordBounds="_|,|\"" wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The grapheme index is defined by:
graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\""
graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The splitting code is:
# delete kommata except kommata relevant for graphemes
txt = komma_exceptionex.sub(r"\1",txt)
# replace word boundaries by spaces
txt = self.boundsex.sub(' ',txt)
# replace letters to be ignored
txt = self.ignorex.sub('',txt)
# split words
words = txt.split(" ")
for w in words:
w=w.strip()
if not (w==''):
result.append(w)
