Table of Contents
CDLI transliteration search
Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand
Really technical background
The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.
The search currently uses the ZCatalog index search on the word index ("word") or grapheme index ("part of word").
The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16
Because the ZCatalog search only returns documents i.e. tablets, another search is needed to find the lines that matched inside the document for the result list with lines or the highlighting in the result list with full texts.
This second search currently only understands full words without order in the search expression which leads to the situation that a text shows up in the result list but no line from the text or too many lines or errors in the highlighting. So a search for "udu nita" with quotes using the advanced search syntax will list the correct texts but it will show all lines that contain either udu or nita.
Simple search mode
The default simple search mode simply puts quotes around every "word" separated by blanks:
if search_mode == 'simple': # escape everything between blanks searchword = ' '.join(['"%s"'%w for w in word.split()])
So if you put in "udu nita" with quotes it ends up as ""udu" "nita"" which is illegal.
Advanced search syntax
If you use the advanced search syntax mode you have to do the quoting for yourself when you use characters from ATF that are also meaningful in ZCatalog search syntax, e.g. (, ), ?, *.
Word and grapheme indexes
Lines beginning with
are excluded from the index.
Kommas are deleted unless after s,t,h:
The word index is defined by:
The grapheme index is defined by:
The splitting code is:
# delete kommata except kommata relevant for graphemes txt = komma_exceptionex.sub(r"\1",txt) # replace word boundaries by spaces txt = self.boundsex.sub(' ',txt) # replace letters to be ignored txt = self.ignorex.sub('',txt) # split words words = txt.split(" ") for w in words: w=w.strip() if not (w==''): result.append(w)