[developers] mocking a POS-tagger to handle unk words

Kristen Howell kphowell at uw.edu
Mon Apr 2 19:51:00 CEST 2018


Thank you Francis, Olga and Woodley! With these additions I was able to
parse the unknown word "baac" and the inflected form of the unknown root
"ha-baac".

On Fri, Mar 30, 2018 at 11:01 PM, Woodley Packard <sweaglesw at sweaglesw.org>
wrote:

> Hi Olga and Kristin,
>
> You were close.  As Francis mentioned, you need to define some generic
> lexical entries.  You managed to declare types for generic lexical entries,
> but not the entries themselves.  Add the following to abz-pet.tdl, near the
> main lexicon section:
>
> :begin :instance :status generic-lex-entry.
> :include "generic-lexicon".
> :end :instance.
>
> and then create the generic-lexicon.tdl file containing a single statement:
>
> generic-verb := generic_verb_lex_etry & [ STEM < string > ].
>
> With those changes, I was able to successfully parse using the YY lattice
> Kristin gave for "baac".  I noticed a message about a loopy optional
> complement rule, so the generic verb lexical type may be a bit too
> underspecified, e.g. regarding its opinion about its valence (in addition
> to being shy one ’N’).
>
> I apologize for not getting back to you about this quicker; somehow I
> missed Olga’s March 23rd email (though I see it in my mailbox now when I go
> back and look).
>
> Good luck, and let me know if you run into more trouble!
> Woodley
>
> On Mar 30, 2018, at 7:35 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>
> Here are some relevant types. When I started working on it, I mostly
> copied things which seemed relevant, over from ERG. (I see now that the
> #pred identity is present twice).
>
> generic_verb_lex_etry := unknown_word & basic-verb-lex &
>   [ SYNSEM.LKEYS.KEYREL.PRED #pred,
>     ORTH < "_generic_vb_" >,
>     TOKENS.+LIST < [ +TNT.+TAGS.FIRST "VB", +PRED #pred ] > ].
>
> unknown_word := norm_unknown_word.
>
> norm_unknown_word := basic_unknown_word &
>   [ SYNSEM [ LOCAL.CONT.HOOK.LTOP #ltop,
>              LKEYS.KEYREL [ LBL #ltop,
>     PRED #pred ] ],
>     TOKENS.+LIST.FIRST.+PRED #pred ].
>
> basic_unknown_word := basic_generic_lex_entry.
>
> generic_lex_entry := basic_generic_lex_entry &
>   [ TOKENS.+LIST < [ +TNT null_tnt ] > ].
>
> basic_generic_lex_entry := word &
>   [ SYNSEM.PHON.ONSET unk_onset ].
>
>
>
> On Fri, Mar 30, 2018 at 7:18 PM Francis Bond <bond at ieee.org> wrote:
>
>> G'day,
>>
>> if you run ace in a more verbose mode (I think -vv should be enough) it
>> tells you a bit more about what it is doing with the tokens.
>>
>> In addition to yy-mode, you must also have some generic lexical entries
>> for unknown words.
>>
>> You can find some nice examples by Sanghoun in: https://github.com/delph-
>> in/zhong/blob/master/cmn/gle.tdl
>> (I think easier to follow than Jacy).
>>
>> Can you show the lexical type you want to instantiate?
>>
>>
>>
>>
>> On Sat, Mar 31, 2018 at 2:36 AM, Kristen Howell <kphowell at uw.edu> wrote:
>>
>>> I'm picking this up for Olga. I've followed the same steps and am
>>> encountering the same issue, where I can parse a known word in YY mode, but
>>> not an unkown word. I've attached the toy grammar we are using. If anyone
>>> has insight on what we are missing, we'd appreciate it. Here is an example,
>>> where "baab" is a known word and "baac" is not.
>>>
>>> [kphowell at patas ace-0.9.26]$ ./ace -g ../aggregation/analyses/
>>> unknown-roots-morphology/data/abz-modified/ace/abz.dat -y
>>> (42, 0, 1, <0:4>, 1, "baab", 0, "null", "VB" 1.0)
>>> SENT: (yy mode)
>>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT:
>>> aspect E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2
>>> ARG1: x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number
>>> PNG.GEND: gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person
>>> PNG.NUM: number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2
>>> non-focus x4 e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
>>> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
>>> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ]
>>> +TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ] +PRBS cons [ FIRST
>>> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>>> actives used, 2 passives used)    RAM: 41k
>>>
>>>
>>> (42, 0, 1, <0:4>, 1, "baac", 0, "null", "VB" 1.0)
>>> NOTE: lexemes do not span position 0 `baac'!
>>> NOTE: post reduction gap
>>> SKIP: (yy mode)
>>>
>>> Best,
>>> Kristen
>>>
>>> On Fri, Mar 23, 2018 at 1:55 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>
>>>> OK, I can run ACE in yy mode and I've attempted to enable  token
>>>> mapping and to map tags to generic entries, but apparently I am missing
>>>> some step(s).
>>>>
>>>> *On an existing word, it works:*
>>>>
>>>> $cat ../../yy.txt | ace -g abz.dat -y
>>>>
>>>> SENT: (yy mode)
>>>>
>>>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT:
>>>> aspect E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2
>>>> ARG1: x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number
>>>> PNG.GEND: gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person
>>>> PNG.NUM: number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2
>>>> non-focus x4 e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0
>>>> 1 (9 basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token
>>>> [ +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list
>>>> ] *+TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [
>>>> FIRST \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>>>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>>>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>>>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>>>
>>>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>>>> actives used, 2 passives used) RAM: 41k
>>>> *But on an unknown word it does not still:*
>>>>
>>>>  ace Murka$ cat ../../yy.txt | ace -g abz.dat -y
>>>>
>>>> NOTE: lexemes do not span position 0 `baabb'!
>>>>
>>>> NOTE: post reduction gap
>>>>
>>>> SKIP: (yy mode)
>>>>
>>>> *Does anyone have an idea what I have likely failed to define/enable?*
>>>>
>>>> *I've defined token paths like in the ERG, because that's where I
>>>> copied other types from:*
>>>>
>>>> token-mapping := enabled.
>>>>
>>>> lexicon-tokens-path := TOKENS +LIST.
>>>>
>>>> lexicon-last-token-path := TOKENS +LAST.
>>>>
>>>> token-type := token.
>>>>
>>>> token-form-path     := +FORM.       ; [required] string for lexical
>>>> lookup
>>>>
>>>> token-id-path       := +ID.         ; [optional] list of external ids
>>>>
>>>> token-from-path     := +FROM.       ; [optional] surface start position
>>>>
>>>> token-to-path       := +TO.         ; [optional] surface end position
>>>>
>>>> token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags
>>>>
>>>> token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS
>>>> probabilities
>>>> *Thank you,*
>>>> *Olga*
>>>>
>>>> On Fri, Mar 23, 2018 at 10:38 AM Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>
>>>>> Thanks very much, Paul, Woodley, and Michael. Michael, thanks
>>>>> especially for the detailed explanation!
>>>>>
>>>>> I did not notice that YY mode has a field for a POS tag. I will try
>>>>> that then.
>>>>>
>>>>> Best,
>>>>> Olga
>>>>>
>>>>> On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>
>>>>> wrote:
>>>>>
>>>>>> Following Woodley's suggestion, for YY-mode I can point you to a few
>>>>>> things.
>>>>>>
>>>>>> In Jacy, we use POS tags from an external morphological analyzer
>>>>>> (previously Chasen; recently MeCab). We have a script that takes the output
>>>>>> of MeCab and transforms it into the YY format. Note the definition of the
>>>>>> pos_info variable---it holds POS data that is slightly more complex than a
>>>>>> simple, e.g., NNS or VBG tag.
>>>>>>
>>>>>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>>>>>>
>>>>>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic
>>>>>> lexical entries:
>>>>>>
>>>>>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>>>>>>
>>>>>> For ACE (and presumably other processors) you might also need to
>>>>>> define paths to the token info:
>>>>>>
>>>>>>     https://github.com/delph-in/jacy/blob/develop/ace/config.
>>>>>> tdl#L143-L151
>>>>>>
>>>>>> When you call ACE you'll need to tell it to expect YY input. I think
>>>>>> it's the -y option. There might be some other pieces to this that Woodley
>>>>>> or Francis can probably fill in for you. In my experiments, YY mode did
>>>>>> help a bit for getting parses where the standard machinery for unknowns
>>>>>> failed.
>>>>>>
>>>>>> If you're working in Python, then PyDelphin's 'tokens' module can
>>>>>> help with constructing YY input. This section of the relevant unit tests
>>>>>> might be informative:
>>>>>>
>>>>>>     https://github.com/delph-in/pydelphin/blob/develop/tests/
>>>>>> tokens_test.py#L40-L59
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <
>>>>>> sweaglesw at sweaglesw.org> wrote:
>>>>>>
>>>>>>> Hi Olga,
>>>>>>>
>>>>>>> Since you are interested primarily in a demonstration rather than a
>>>>>>> real world system from what I understand, why not specify the POS tags as
>>>>>>> part of the input, using YY mode?
>>>>>>>
>>>>>>> Woodley
>>>>>>>
>>>>>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>>>>
>>>>>>> Dear developers!
>>>>>>>
>>>>>>> I am looking into the problem of handling unknown roots with LKB and
>>>>>>> ACE in a situation where we want to first be able to analyze the word
>>>>>>> morphologically (apply lexical rules).
>>>>>>>
>>>>>>> I had already sent an email about that a year ago, and Francis and I
>>>>>>> actually sat down and went through the process of constructing a minimal
>>>>>>> example which showed that there was a problem of some sort preventing us
>>>>>>> from analyzing the word morphologically and using the unknown word handling
>>>>>>> machinery at the same time.
>>>>>>>
>>>>>>> Alas, I cannot recover any record of this. It is possible that we
>>>>>>> did that on Francis's computer,...
>>>>>>>
>>>>>>> Anyway, I want to reconstruct this minimal example one more time,
>>>>>>> this time hopefully understanding more and producing some actual
>>>>>>> documentation.
>>>>>>>
>>>>>>> I would like to start from recreating what e.g. the ERG does:
>>>>>>> treating the words as full-form, relying on a POS tag which maps the word
>>>>>>> to a specific unknown_type.
>>>>>>>
>>>>>>> I have a small grammar to which I added what I was able to detect as
>>>>>>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>>>>>> included mtr.tdl and I included it into the script.
>>>>>>>
>>>>>>> Next thing I need to understand (I think) is what does it actually
>>>>>>> mean to "mock the POS tagger". How do I make the system aware of that
>>>>>>> information?
>>>>>>>
>>>>>>> I can see that the tags can be mapped to the generic lexical entries
>>>>>>> as described in http://moin.delph-in.net/PetInput. But how do I get
>>>>>>> the tags in the first place? Suppose I just want to consider everything the
>>>>>>> same POS, for starters.
>>>>>>>
>>>>>>> Thank you!
>>>>>>> Olga
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Michael Wayne Goodman
>>>>>> Ph.D. Candidate, UW Linguistics
>>>>>>
>>>>>
>>>
>>
>>
>> --
>> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>> Division of Linguistics and Multilingual Studies
>> Nanyang Technological University
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180402/3e79de13/attachment-0001.html>


More information about the developers mailing list