[developers] mocking a POS-tagger to handle unk words

Francis Bond bond at ieee.org
Tue Apr 3 01:59:39 CEST 2018


Glad to hear it!

On Tue, 3 Apr 2018, 01:51 Kristen Howell, <kphowell at uw.edu> wrote:

> Thank you Francis, Olga and Woodley! With these additions I was able to
> parse the unknown word "baac" and the inflected form of the unknown root
> "ha-baac".
>
> On Fri, Mar 30, 2018 at 11:01 PM, Woodley Packard <sweaglesw at sweaglesw.org
> > wrote:
>
>> Hi Olga and Kristin,
>>
>> You were close.  As Francis mentioned, you need to define some generic
>> lexical entries.  You managed to declare types for generic lexical entries,
>> but not the entries themselves.  Add the following to abz-pet.tdl, near the
>> main lexicon section:
>>
>> :begin :instance :status generic-lex-entry.
>> :include "generic-lexicon".
>> :end :instance.
>>
>> and then create the generic-lexicon.tdl file containing a single
>> statement:
>>
>> generic-verb := generic_verb_lex_etry & [ STEM < string > ].
>>
>> With those changes, I was able to successfully parse using the YY lattice
>> Kristin gave for "baac".  I noticed a message about a loopy optional
>> complement rule, so the generic verb lexical type may be a bit too
>> underspecified, e.g. regarding its opinion about its valence (in addition
>> to being shy one ’N’).
>>
>> I apologize for not getting back to you about this quicker; somehow I
>> missed Olga’s March 23rd email (though I see it in my mailbox now when I go
>> back and look).
>>
>> Good luck, and let me know if you run into more trouble!
>> Woodley
>>
>> On Mar 30, 2018, at 7:35 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>
>> Here are some relevant types. When I started working on it, I mostly
>> copied things which seemed relevant, over from ERG. (I see now that the
>> #pred identity is present twice).
>>
>> generic_verb_lex_etry := unknown_word & basic-verb-lex &
>>   [ SYNSEM.LKEYS.KEYREL.PRED #pred,
>>     ORTH < "_generic_vb_" >,
>>     TOKENS.+LIST < [ +TNT.+TAGS.FIRST "VB", +PRED #pred ] > ].
>>
>> unknown_word := norm_unknown_word.
>>
>> norm_unknown_word := basic_unknown_word &
>>   [ SYNSEM [ LOCAL.CONT.HOOK.LTOP #ltop,
>>              LKEYS.KEYREL [ LBL #ltop,
>>     PRED #pred ] ],
>>     TOKENS.+LIST.FIRST.+PRED #pred ].
>>
>> basic_unknown_word := basic_generic_lex_entry.
>>
>> generic_lex_entry := basic_generic_lex_entry &
>>   [ TOKENS.+LIST < [ +TNT null_tnt ] > ].
>>
>> basic_generic_lex_entry := word &
>>   [ SYNSEM.PHON.ONSET unk_onset ].
>>
>>
>>
>> On Fri, Mar 30, 2018 at 7:18 PM Francis Bond <bond at ieee.org> wrote:
>>
>>> G'day,
>>>
>>> if you run ace in a more verbose mode (I think -vv should be enough) it
>>> tells you a bit more about what it is doing with the tokens.
>>>
>>> In addition to yy-mode, you must also have some generic lexical entries
>>> for unknown words.
>>>
>>> You can find some nice examples by Sanghoun in:
>>> https://github.com/delph-in/zhong/blob/master/cmn/gle.tdl
>>> (I think easier to follow than Jacy).
>>>
>>> Can you show the lexical type you want to instantiate?
>>>
>>>
>>>
>>>
>>> On Sat, Mar 31, 2018 at 2:36 AM, Kristen Howell <kphowell at uw.edu> wrote:
>>>
>>>> I'm picking this up for Olga. I've followed the same steps and am
>>>> encountering the same issue, where I can parse a known word in YY mode, but
>>>> not an unkown word. I've attached the toy grammar we are using. If anyone
>>>> has insight on what we are missing, we'd appreciate it. Here is an example,
>>>> where "baab" is a known word and "baac" is not.
>>>>
>>>> [kphowell at patas ace-0.9.26]$ ./ace -g
>>>> ../aggregation/analyses/unknown-roots-morphology/data/abz-modified/ace/abz.dat
>>>> -y
>>>> (42, 0, 1, <0:4>, 1, "baab", 0, "null", "VB" 1.0)
>>>> SENT: (yy mode)
>>>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT:
>>>> aspect E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2
>>>> ARG1: x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number
>>>> PNG.GEND: gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person
>>>> PNG.NUM: number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2
>>>> non-focus x4 e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
>>>> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
>>>> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ]
>>>> +TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ] +PRBS cons [ FIRST
>>>> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>>>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>>>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>>>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>>>> actives used, 2 passives used)    RAM: 41k
>>>>
>>>>
>>>> (42, 0, 1, <0:4>, 1, "baac", 0, "null", "VB" 1.0)
>>>> NOTE: lexemes do not span position 0 `baac'!
>>>> NOTE: post reduction gap
>>>> SKIP: (yy mode)
>>>>
>>>> Best,
>>>> Kristen
>>>>
>>>> On Fri, Mar 23, 2018 at 1:55 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>
>>>>> OK, I can run ACE in yy mode and I've attempted to enable  token
>>>>> mapping and to map tags to generic entries, but apparently I am missing
>>>>> some step(s).
>>>>>
>>>>> *On an existing word, it works:*
>>>>>
>>>>> $cat ../../yy.txt | ace -g abz.dat -y
>>>>>
>>>>> SENT: (yy mode)
>>>>>
>>>>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT:
>>>>> aspect E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2
>>>>> ARG1: x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number
>>>>> PNG.GEND: gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person
>>>>> PNG.NUM: number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2
>>>>> non-focus x4 e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0
>>>>> 1 (9 basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token
>>>>> [ +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list
>>>>> ] *+TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [
>>>>> FIRST \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>>>>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>>>>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>>>>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>>>>
>>>>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>>>>> actives used, 2 passives used) RAM: 41k
>>>>> *But on an unknown word it does not still:*
>>>>>
>>>>>  ace Murka$ cat ../../yy.txt | ace -g abz.dat -y
>>>>>
>>>>> NOTE: lexemes do not span position 0 `baabb'!
>>>>>
>>>>> NOTE: post reduction gap
>>>>>
>>>>> SKIP: (yy mode)
>>>>>
>>>>> *Does anyone have an idea what I have likely failed to define/enable?*
>>>>>
>>>>> *I've defined token paths like in the ERG, because that's where I
>>>>> copied other types from:*
>>>>>
>>>>> token-mapping := enabled.
>>>>>
>>>>> lexicon-tokens-path := TOKENS +LIST.
>>>>>
>>>>> lexicon-last-token-path := TOKENS +LAST.
>>>>>
>>>>> token-type := token.
>>>>>
>>>>> token-form-path     := +FORM.       ; [required] string for lexical
>>>>> lookup
>>>>>
>>>>> token-id-path       := +ID.         ; [optional] list of external ids
>>>>>
>>>>> token-from-path     := +FROM.       ; [optional] surface start position
>>>>>
>>>>> token-to-path       := +TO.         ; [optional] surface end position
>>>>>
>>>>> token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags
>>>>>
>>>>> token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS
>>>>> probabilities
>>>>> *Thank you,*
>>>>> *Olga*
>>>>>
>>>>> On Fri, Mar 23, 2018 at 10:38 AM Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>>
>>>>>> Thanks very much, Paul, Woodley, and Michael. Michael, thanks
>>>>>> especially for the detailed explanation!
>>>>>>
>>>>>> I did not notice that YY mode has a field for a POS tag. I will try
>>>>>> that then.
>>>>>>
>>>>>> Best,
>>>>>> Olga
>>>>>>
>>>>>> On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <
>>>>>> goodmami at uw.edu> wrote:
>>>>>>
>>>>>>> Following Woodley's suggestion, for YY-mode I can point you to a few
>>>>>>> things.
>>>>>>>
>>>>>>> In Jacy, we use POS tags from an external morphological analyzer
>>>>>>> (previously Chasen; recently MeCab). We have a script that takes the output
>>>>>>> of MeCab and transforms it into the YY format. Note the definition of the
>>>>>>> pos_info variable---it holds POS data that is slightly more complex than a
>>>>>>> simple, e.g., NNS or VBG tag.
>>>>>>>
>>>>>>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>>>>>>>
>>>>>>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic
>>>>>>> lexical entries:
>>>>>>>
>>>>>>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>>>>>>>
>>>>>>> For ACE (and presumably other processors) you might also need to
>>>>>>> define paths to the token info:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151
>>>>>>>
>>>>>>> When you call ACE you'll need to tell it to expect YY input. I think
>>>>>>> it's the -y option. There might be some other pieces to this that Woodley
>>>>>>> or Francis can probably fill in for you. In my experiments, YY mode did
>>>>>>> help a bit for getting parses where the standard machinery for unknowns
>>>>>>> failed.
>>>>>>>
>>>>>>> If you're working in Python, then PyDelphin's 'tokens' module can
>>>>>>> help with constructing YY input. This section of the relevant unit tests
>>>>>>> might be informative:
>>>>>>>
>>>>>>>
>>>>>>> https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59
>>>>>>>
>>>>>>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <
>>>>>>> sweaglesw at sweaglesw.org> wrote:
>>>>>>>
>>>>>>>> Hi Olga,
>>>>>>>>
>>>>>>>> Since you are interested primarily in a demonstration rather than a
>>>>>>>> real world system from what I understand, why not specify the POS tags as
>>>>>>>> part of the input, using YY mode?
>>>>>>>>
>>>>>>>> Woodley
>>>>>>>>
>>>>>>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>>>>>
>>>>>>>> Dear developers!
>>>>>>>>
>>>>>>>> I am looking into the problem of handling unknown roots with LKB
>>>>>>>> and ACE in a situation where we want to first be able to analyze the word
>>>>>>>> morphologically (apply lexical rules).
>>>>>>>>
>>>>>>>> I had already sent an email about that a year ago, and Francis and
>>>>>>>> I actually sat down and went through the process of constructing a minimal
>>>>>>>> example which showed that there was a problem of some sort preventing us
>>>>>>>> from analyzing the word morphologically and using the unknown word handling
>>>>>>>> machinery at the same time.
>>>>>>>>
>>>>>>>> Alas, I cannot recover any record of this. It is possible that we
>>>>>>>> did that on Francis's computer,...
>>>>>>>>
>>>>>>>> Anyway, I want to reconstruct this minimal example one more time,
>>>>>>>> this time hopefully understanding more and producing some actual
>>>>>>>> documentation.
>>>>>>>>
>>>>>>>> I would like to start from recreating what e.g. the ERG does:
>>>>>>>> treating the words as full-form, relying on a POS tag which maps the word
>>>>>>>> to a specific unknown_type.
>>>>>>>>
>>>>>>>> I have a small grammar to which I added what I was able to detect
>>>>>>>> as relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>>>>>>> included mtr.tdl and I included it into the script.
>>>>>>>>
>>>>>>>> Next thing I need to understand (I think) is what does it actually
>>>>>>>> mean to "mock the POS tagger". How do I make the system aware of that
>>>>>>>> information?
>>>>>>>>
>>>>>>>> I can see that the tags can be mapped to the generic lexical
>>>>>>>> entries as described in http://moin.delph-in.net/PetInput. But how
>>>>>>>> do I get the tags in the first place? Suppose I just want to consider
>>>>>>>> everything the same POS, for starters.
>>>>>>>>
>>>>>>>> Thank you!
>>>>>>>> Olga
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Michael Wayne Goodman
>>>>>>> Ph.D. Candidate, UW Linguistics
>>>>>>>
>>>>>>
>>>>
>>>
>>>
>>> --
>>> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>>> Division of Linguistics and Multilingual Studies
>>> Nanyang Technological University
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180402/08e1a656/attachment-0001.html>


More information about the developers mailing list