This is an old revision of the document!

CDLI transliteration search

Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand

Really technical background

The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.

The search currently only uses the ZCatalog index search on the word index or grapheme index.

The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16

Simple search mode

The simple search simply puts quotes around every "word" separated by blanks:

    if search_mode == 'simple':
        # escape everything between blanks
        searchword = ' '.join(['"%s"'%w for w in word.split()])

So if you were to put in

Word and grapheme indexes

Lines beginning with

ignoreLines=['$','@','#','&','>']

are excluded from the index.

Kommas are deleted unless after s,t,h:

komma_exception="([^sStThH]),"

The word index is defined by:

wordBounds="_|,|\""
wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The grapheme index is defined by:

graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\""
graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The splitting code is:

   # delete kommata except kommata relevant for graphemes
   txt = komma_exceptionex.sub(r"\1",txt)
   # replace word boundaries by spaces
   txt = self.boundsex.sub(' ',txt)
   # replace letters to be ignored
   txt = self.ignorex.sub('',txt)
   # split words
   words = txt.split(" ")
   for w in words:
       w=w.strip()
       if not (w==''):
           result.append(w)

Table of Contents

CDLI transliteration search

Really technical background

Simple search mode

Word and grapheme indexes