[developers] mocking a POS-tagger to handle unk words

Fri Mar 30 20:36:11 CEST 2018

I'm picking this up for Olga. I've followed the same steps and am
encountering the same issue, where I can parse a known word in YY mode, but
not an unkown word. I've attached the toy grammar we are using. If anyone
has insight on what we are missing, we'd appreciate it. Here is an example,
where "baab" is a known word and "baac" is not.

[kphowell at patas ace-0.9.26]$ ./ace -g
../aggregation/analyses/unknown-roots-morphology/data/abz-modified/ace/abz.dat
-y
(42, 0, 1, <0:4>, 1, "baab", 0, "null", "VB" 1.0)
SENT: (yy mode)
[ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
+FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ]
+TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ] +PRBS cons [ FIRST
\"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
+CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
+TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
actives used, 2 passives used)    RAM: 41k

(42, 0, 1, <0:4>, 1, "baac", 0, "null", "VB" 1.0)
NOTE: lexemes do not span position 0 `baac'!
NOTE: post reduction gap
SKIP: (yy mode)

Best,
Kristen

On Fri, Mar 23, 2018 at 1:55 PM, Olga Zamaraeva <olzama at uw.edu> wrote:

> OK, I can run ACE in yy mode and I've attempted to enable  token mapping
> and to map tags to generic entries, but apparently I am missing some
> step(s).
>
> *On an existing word, it works:*
>
> $cat ../../yy.txt | ace -g abz.dat -y
>
> SENT: (yy mode)
>
> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
> E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
> x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
> gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
> number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
> e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ] *+TNT
> tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [ FIRST
> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>
> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
> actives used, 2 passives used) RAM: 41k
> *But on an unknown word it does not still:*
>
>  ace Murka$ cat ../../yy.txt | ace -g abz.dat -y
>
> NOTE: lexemes do not span position 0 `baabb'!
>
> NOTE: post reduction gap
>
> SKIP: (yy mode)
>
> *Does anyone have an idea what I have likely failed to define/enable?*
>
> *I've defined token paths like in the ERG, because that's where I copied
> other types from:*
>
> token-mapping := enabled.
>
> lexicon-tokens-path := TOKENS +LIST.
>
> lexicon-last-token-path := TOKENS +LAST.
>
> token-type := token.
>
> token-form-path     := +FORM.       ; [required] string for lexical lookup
>
> token-id-path       := +ID.         ; [optional] list of external ids
>
> token-from-path     := +FROM.       ; [optional] surface start position
>
> token-to-path       := +TO.         ; [optional] surface end position
>
> token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags
>
> token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS probabilities
> *Thank you,*
> *Olga*
>
> On Fri, Mar 23, 2018 at 10:38 AM Olga Zamaraeva <olzama at uw.edu> wrote:
>
>> Thanks very much, Paul, Woodley, and Michael. Michael, thanks especially
>> for the detailed explanation!
>>
>> I did not notice that YY mode has a field for a POS tag. I will try that
>> then.
>>
>> Best,
>> Olga
>>
>> On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>
>> wrote:
>>
>>> Following Woodley's suggestion, for YY-mode I can point you to a few
>>> things.
>>>
>>> In Jacy, we use POS tags from an external morphological analyzer
>>> (previously Chasen; recently MeCab). We have a script that takes the output
>>> of MeCab and transforms it into the YY format. Note the definition of the
>>> pos_info variable---it holds POS data that is slightly more complex than a
>>> simple, e.g., NNS or VBG tag.
>>>
>>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>>>
>>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
>>> entries:
>>>
>>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>>>
>>> For ACE (and presumably other processors) you might also need to define
>>> paths to the token info:
>>>
>>>     https://github.com/delph-in/jacy/blob/develop/ace/config.
>>> tdl#L143-L151
>>>
>>> When you call ACE you'll need to tell it to expect YY input. I think
>>> it's the -y option. There might be some other pieces to this that Woodley
>>> or Francis can probably fill in for you. In my experiments, YY mode did
>>> help a bit for getting parses where the standard machinery for unknowns
>>> failed.
>>>
>>> If you're working in Python, then PyDelphin's 'tokens' module can help
>>> with constructing YY input. This section of the relevant unit tests might
>>> be informative:
>>>
>>>     https://github.com/delph-in/pydelphin/blob/develop/tests/
>>> tokens_test.py#L40-L59
>>>
>>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <
>>> sweaglesw at sweaglesw.org> wrote:
>>>
>>>> Hi Olga,
>>>>
>>>> Since you are interested primarily in a demonstration rather than a
>>>> real world system from what I understand, why not specify the POS tags as
>>>> part of the input, using YY mode?
>>>>
>>>> Woodley
>>>>
>>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>
>>>> Dear developers!
>>>>
>>>> I am looking into the problem of handling unknown roots with LKB and
>>>> ACE in a situation where we want to first be able to analyze the word
>>>> morphologically (apply lexical rules).
>>>>
>>>> I had already sent an email about that a year ago, and Francis and I
>>>> actually sat down and went through the process of constructing a minimal
>>>> example which showed that there was a problem of some sort preventing us
>>>> from analyzing the word morphologically and using the unknown word handling
>>>> machinery at the same time.
>>>>
>>>> Alas, I cannot recover any record of this. It is possible that we did
>>>> that on Francis's computer,...
>>>>
>>>> Anyway, I want to reconstruct this minimal example one more time, this
>>>> time hopefully understanding more and producing some actual documentation.
>>>>
>>>> I would like to start from recreating what e.g. the ERG does: treating
>>>> the words as full-form, relying on a POS tag which maps the word to a
>>>> specific unknown_type.
>>>>
>>>> I have a small grammar to which I added what I was able to detect as
>>>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>>> included mtr.tdl and I included it into the script.
>>>>
>>>> Next thing I need to understand (I think) is what does it actually mean
>>>> to "mock the POS tagger". How do I make the system aware of that
>>>> information?
>>>>
>>>> I can see that the tags can be mapped to the generic lexical entries as
>>>> described in http://moin.delph-in.net/PetInput. But how do I get the
>>>> tags in the first place? Suppose I just want to consider everything the
>>>> same POS, for starters.
>>>>
>>>> Thank you!
>>>> Olga
>>>>
>>>>
>>>
>>>
>>> --
>>> Michael Wayne Goodman
>>> Ph.D. Candidate, UW Linguistics
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180330/ecf6787d/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: abz_toy.tar
Type: application/x-tar
Size: 989843 bytes
Desc: not available
URL: <http://lists.delph-in.net/archives/developers/attachments/20180330/ecf6787d/attachment-0001.tar>