Stylesheet style.css not found, please contact the developer of "arctic" template.

This is an old revision of the document!


CDLI transliteration search

Here adopt Peter's page (http://cdli.mpiwg-berlin.mpg.de/info/systeminfo.html ?) and expand

Really technical background

The transliteration search currently operates on a document i.e. tablet basis. If you search for two words it gives you all the tablets where both words are found somewhere on the tablet - not necessarily near each other.

The search currently only uses the ZCatalog index search on the word index or grapheme index.

The ZCatalog search syntax can be found here: http://www.zope.org/Documentation/Books/ZopeBook/2_6Edition/SearchingZCatalog.stx#2-16

Simple search mode

The default simple search mode simply puts quotes around every "word" separated by blanks:

    if search_mode == 'simple':
        # escape everything between blanks
        searchword = ' '.join(['"%s"'%w for w in word.split()])

So if you put in "udu nita" with quotes it ends up as ""udu" "nita"" which is illegal.

If you use the advanced search syntax mode you have to do the quoting for yourself when you use characters from ATF that are also meaningful in ZCatalog search syntax, e.g. (, ), ?, *.

Word and grapheme indexes

Lines beginning with

ignoreLines=['$','@','#','&','>']

are excluded from the index.

Kommas are deleted unless after s,t,h:

komma_exception="([^sStThH]),"

The word index is defined by:

wordBounds="_|,|\""
wordIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The grapheme index is defined by:

graphemeBounds="\{|\}|<|>|-|_|\#|,|\]|\[|\!|\?|\""
graphemeIgnore="<|>|\#|\||\]|\[|\!|\?\*|;"

The splitting code is:

   # delete kommata except kommata relevant for graphemes
   txt = komma_exceptionex.sub(r"\1",txt)
   # replace word boundaries by spaces
   txt = self.boundsex.sub(' ',txt)
   # replace letters to be ignored
   txt = self.ignorex.sub('',txt)
   # split words
   words = txt.split(" ")
   for w in words:
       w=w.strip()
       if not (w==''):
           result.append(w)
cdli_transliteration_search.1243441562.txt.gz · Last modified: 2009/05/27 16:26 by casties
CC Attribution-Noncommercial-Share Alike 4.0 International
Driven by DokuWiki Recent changes RSS feed Valid CSS Valid XHTML 1.0