[developers] mocking a POS-tagger to handle unk words

Olga Zamaraeva olzama at uw.edu
Fri Mar 23 21:55:47 CET 2018

OK, I can run ACE in yy mode and I've attempted to enable  token mapping
and to map tags to generic entries, but apparently I am missing some

*On an existing word, it works:*

$cat ../../yy.txt | ace -g abz.dat -y

SENT: (yy mode)

[ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
+FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST
list ] *+TNT
tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [ FIRST
\"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
+CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
+TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))

NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
actives used, 2 passives used) RAM: 41k
*But on an unknown word it does not still:*

 ace Murka$ cat ../../yy.txt | ace -g abz.dat -y

NOTE: lexemes do not span position 0 `baabb'!

NOTE: post reduction gap

SKIP: (yy mode)

*Does anyone have an idea what I have likely failed to define/enable?*

*I've defined token paths like in the ERG, because that's where I copied
other types from:*

token-mapping := enabled.

lexicon-tokens-path := TOKENS +LIST.

lexicon-last-token-path := TOKENS +LAST.

token-type := token.

token-form-path     := +FORM.       ; [required] string for lexical lookup

token-id-path       := +ID.         ; [optional] list of external ids

token-from-path     := +FROM.       ; [optional] surface start position

token-to-path       := +TO.         ; [optional] surface end position

token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags

token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS probabilities
>> Following Woodley's suggestion, for YY-mode I can point you to a few
>> things.
>> In Jacy, we use POS tags from an external morphological analyzer
>> (previously Chasen; recently MeCab). We have a script that takes the output
>> of MeCab and transforms it into the YY format. Note the definition of the
>> pos_info variable---it holds POS data that is slightly more complex than a
>> simple, e.g., NNS or VBG tag.
>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
>> entries:
>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>> For ACE (and presumably other processors) you might also need to define
>> paths to the token info:
>> https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151
>> When you call ACE you'll need to tell it to expect YY input. I think it's
>> the -y option. There might be some other pieces to this that Woodley or
>> Francis can probably fill in for you. In my experiments, YY mode did help a
>> bit for getting parses where the standard machinery for unknowns failed.
>> If you're working in Python, then PyDelphin's 'tokens' module can help
>> with constructing YY input. This section of the relevant unit tests might
>> be informative:
>> https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59
>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <sweaglesw at sweaglesw.org
>> > wrote:
>>> Hi Olga,
>>> Since you are interested primarily in a demonstration rather than a real
>>> world system from what I understand, why not specify the POS tags as part
>>> of the input, using YY mode?
>>> Woodley
>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>> Dear developers!
>>> I am looking into the problem of handling unknown roots with LKB and ACE
>>> in a situation where we want to first be able to analyze the word
>>> morphologically (apply lexical rules).
>>> I had already sent an email about that a year ago, and Francis and I
>>> actually sat down and went through the process of constructing a minimal
>>> example which showed that there was a problem of some sort preventing us
>>> from analyzing the word morphologically and using the unknown word handling
>>> machinery at the same time.
>>> Alas, I cannot recover any record of this. It is possible that we did
>>> that on Francis's computer,...
>>> Anyway, I want to reconstruct this minimal example one more time, this
>>> time hopefully understanding more and producing some actual documentation.
>>> I would like to start from recreating what e.g. the ERG does: treating
>>> the words as full-form, relying on a POS tag which maps the word to a
>>> specific unknown_type.
>>> I have a small grammar to which I added what I was able to detect as
>>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>> included mtr.tdl and I included it into the script.
>>> Next thing I need to understand (I think) is what does it actually mean
>>> to "mock the POS tagger". How do I make the system aware of that
>>> information?
>>> I can see that the tags can be mapped to the generic lexical entries as
>>> described in http://moin.delph-in.net/PetInput. But how do I get the
>>> tags in the first place? Suppose I just want to consider everything the
>>> same POS, for starters.
>>> Thank you!
>>> Olga
