[developers] pet-input-chart punctuation-characters

Eric Nichols eric-n at is.naist.jp
Thu Feb 9 23:25:53 CET 2006


Greetings,

I have written a patch that solves this problem. It can be applied to a
vanilla version of PET 0.99.7.
It should not conflict with any other patches. It is a bit of a hack --
I screen the results of the XML
parsing, use punctuationp() to check if the surface form of each item is
punctuation or not, and set
the token type to SKIP_TOKEN_CLASS. I had to add a function in item.h to
change the private
_class of tInputItem. This check should probably be done in the
pichandler code, but I was not up
to the task of learning the SAX library to do that ;-) In writing this,
I noticed that using stem() instead
of surface() to access the orthography of the current item does not work
-- stem() often returns a null
string even when the word itself is not null. I suspect this is the
reason that the yy tokenizer has the
same problem, but I haven't tested this theory yet.

Eric Nichols <eric-n at is.naist.jp >

Francis Bond wrote:
> G'day,
>
> currently cheap does not check whether pic input items are members of
> punctuation-characters.  This means that we currently can't parse any
> sentence with a full stop (^_^).  Of course, we can take it out of the
> pic, which is what we are doing, put it would be nice if the new
> preprocessor code handled this within cheap, particularly for MAF
> where we may use the same input for multiple systems.
>
> Our current settings are:
>
> JACY punctuation-characters. found in pet/japanese.set.
> punctuation-characters := "\"!&'()*+,-−./;<=>?@[\]^_`{|}~。?…., ○●◎*".
>
> Note that punctuation-characters are defined separately for the LKB
> (in lkb/globals.lsp):
> (defparameter *punctuation-characters*
>   (append
>    '(#\space #\! #\" #\& #\' #\(
>      #\) #\* #\+ #\, #\- #\. #\/ #\;
>      #\< #\= #\> #\? #\@ #\[ #\\ #\] #\^
>      #\_ #\` #\{ #\| #\} #\~)
>    #+:ics
>    '(#\ideographic_full_stop #\fullwidth_question_mark
>      #\horizontal_ellipsis #\fullwidth_full_stop
>      #\fullwidth_exclamation_mark #\black_circle
>      #\fullwidth_comma #\ideographic_space
>      #\katakana_middle_dot #\white_circle)))
>
> We occasionally get them out of sync (^_^).
>
> --
> Francis Bond  <www.kecl.ntt.co.jp/icl/mtg/members/bond/>
> NTT Communication Science Laboratories | Machine Translation Research Group
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 07_xml_punct.diff
Type: text/x-patch
Size: 1982 bytes
Desc: not available
URL: <http://lists.delph-in.net/archives/developers/attachments/20060210/c43c66be/attachment.bin>


More information about the developers mailing list