This is an old revision of the document!
Table of Contents
CDLI transliteration search
Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand
Really technical background
The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.
The search currently only uses the ZCatalog index search on the word index or grapheme index.
The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16
Simple search mode
The default simple search mode simply puts quotes around every "word" separated by blanks:
if search_mode == 'simple': # escape everything between blanks searchword = ' '.join(['"%s"'%w for w in word.split()])
So if you put in "udu nita" with quotes it ends up as ""udu" "nita"" which is illegal.
If you use the advanced search syntax mode you have to do the quoting for yourself when you use characters from ATF that are also meaningful in ZCatalog search syntax, e.g. (, ), ?, *.
Word and grapheme indexes
Lines beginning with
ignoreLines=['$','@','#','&','>']
are excluded from the index.
Kommas are deleted unless after s,t,h:
komma_exception="([^sStThH]),"
The word index is defined by:
wordBounds="_|,|\"" wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The grapheme index is defined by:
graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\"" graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"
The splitting code is:
# delete kommata except kommata relevant for graphemes txt = komma_exceptionex.sub(r"\1",txt) # replace word boundaries by spaces txt = self.boundsex.sub(' ',txt) # replace letters to be ignored txt = self.ignorex.sub('',txt) # split words words = txt.split(" ") for w in words: w=w.strip() if not (w==''): result.append(w)