CDLI transliteration search

Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand

Really technical background

The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.

The search currently uses the ZCatalog index search on the word index ("word") or grapheme index ("part of word").

The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16

Because the ZCatalog search only returns documents i.e. tablets, another search is needed to find the lines that matched inside the document for the result list with lines or the highlighting in the result list with full texts.

This second search currently only understands full words without order in the search expression which leads to the situation that a text shows up in the result list but no line from the text or too many lines or errors in the highlighting. So a search for "udu nita" with quotes using the advanced search syntax will list the correct texts but it will show all lines that contain either udu or nita.

Simple search mode

The default simple search mode simply puts quotes around every "word" separated by blanks:

    if search_mode == 'simple':
        # escape everything between blanks
        searchword = ' '.join(['"%s"'%w for w in word.split()])

So if you put in "udu nita" with quotes it ends up as ""udu" "nita"" which is illegal.

Advanced search syntax

If you use the advanced search syntax mode you have to do the quoting for yourself when you use characters from ATF that are also meaningful in ZCatalog search syntax, e.g. (, ), ?, *.

Word and grapheme indexes

Lines beginning with

ignoreLines=['$','@','#','&','>']

are excluded from the index.

Kommas are deleted unless after s,t,h:

komma_exception="([^sStThH]),"

The word index is defined by:

wordBounds="_|,|\""
wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The grapheme index is defined by:

graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\""
graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The splitting code is:

   # delete kommata except kommata relevant for graphemes
   txt = komma_exceptionex.sub(r"\1",txt)
   # replace word boundaries by spaces
   txt = self.boundsex.sub(' ',txt)
   # replace letters to be ignored
   txt = self.ignorex.sub('',txt)
   # split words
   words = txt.split(" ")
   for w in words:
       w=w.strip()
       if not (w==''):
           result.append(w)
cdli_transliteration_search.txt · Last modified: 2009/05/27 17:57 (external edit)
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0