[developers] mocking a POS-tagger to handle unk words

Olga Zamaraeva olzama at uw.edu
Sat Mar 31 04:35:02 CEST 2018


Here are some relevant types. When I started working on it, I mostly copied
things which seemed relevant, over from ERG. (I see now that the #pred
identity is present twice).

generic_verb_lex_etry := unknown_word & basic-verb-lex &
  [ SYNSEM.LKEYS.KEYREL.PRED #pred,
    ORTH < "_generic_vb_" >,
    TOKENS.+LIST < [ +TNT.+TAGS.FIRST "VB", +PRED #pred ] > ].

unknown_word := norm_unknown_word.

norm_unknown_word := basic_unknown_word &
  [ SYNSEM [ LOCAL.CONT.HOOK.LTOP #ltop,
             LKEYS.KEYREL [ LBL #ltop,
    PRED #pred ] ],
    TOKENS.+LIST.FIRST.+PRED #pred ].

basic_unknown_word := basic_generic_lex_entry.

generic_lex_entry := basic_generic_lex_entry &
  [ TOKENS.+LIST < [ +TNT null_tnt ] > ].

basic_generic_lex_entry := word &
  [ SYNSEM.PHON.ONSET unk_onset ].



On Fri, Mar 30, 2018 at 7:18 PM Francis Bond <bond at ieee.org> wrote:

> G'day,
>
> if you run ace in a more verbose mode (I think -vv should be enough) it
> tells you a bit more about what it is doing with the tokens.
>
> In addition to yy-mode, you must also have some generic lexical entries
> for unknown words.
>
> You can find some nice examples by Sanghoun in:
> https://github.com/delph-in/zhong/blob/master/cmn/gle.tdl
> (I think easier to follow than Jacy).
>
> Can you show the lexical type you want to instantiate?
>
>
>
>
> On Sat, Mar 31, 2018 at 2:36 AM, Kristen Howell <kphowell at uw.edu> wrote:
>
>> I'm picking this up for Olga. I've followed the same steps and am
>> encountering the same issue, where I can parse a known word in YY mode, but
>> not an unkown word. I've attached the toy grammar we are using. If anyone
>> has insight on what we are missing, we'd appreciate it. Here is an example,
>> where "baab" is a known word and "baac" is not.
>>
>> [kphowell at patas ace-0.9.26]$ ./ace -g
>> ../aggregation/analyses/unknown-roots-morphology/data/abz-modified/ace/abz.dat
>> -y
>> (42, 0, 1, <0:4>, 1, "baab", 0, "null", "VB" 1.0)
>> SENT: (yy mode)
>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
>> E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
>> x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
>> gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
>> number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
>> e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
>> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
>> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ]
>> +TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ] +PRBS cons [ FIRST
>> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>> actives used, 2 passives used)    RAM: 41k
>>
>>
>> (42, 0, 1, <0:4>, 1, "baac", 0, "null", "VB" 1.0)
>> NOTE: lexemes do not span position 0 `baac'!
>> NOTE: post reduction gap
>> SKIP: (yy mode)
>>
>> Best,
>> Kristen
>>
>> On Fri, Mar 23, 2018 at 1:55 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>
>>> OK, I can run ACE in yy mode and I've attempted to enable  token mapping
>>> and to map tags to generic entries, but apparently I am missing some
>>> step(s).
>>>
>>> *On an existing word, it works:*
>>>
>>> $cat ../../yy.txt | ace -g abz.dat -y
>>>
>>> SENT: (yy mode)
>>>
>>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT:
>>> aspect E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2
>>> ARG1: x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number
>>> PNG.GEND: gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person
>>> PNG.NUM: number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2
>>> non-focus x4 e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1
>>> (9 basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
>>> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ] *+TNT
>>> tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [ FIRST
>>> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>>
>>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>>> actives used, 2 passives used) RAM: 41k
>>> *But on an unknown word it does not still:*
>>>
>>>  ace Murka$ cat ../../yy.txt | ace -g abz.dat -y
>>>
>>> NOTE: lexemes do not span position 0 `baabb'!
>>>
>>> NOTE: post reduction gap
>>>
>>> SKIP: (yy mode)
>>>
>>> *Does anyone have an idea what I have likely failed to define/enable?*
>>>
>>> *I've defined token paths like in the ERG, because that's where I copied
>>> other types from:*
>>>
>>> token-mapping := enabled.
>>>
>>> lexicon-tokens-path := TOKENS +LIST.
>>>
>>> lexicon-last-token-path := TOKENS +LAST.
>>>
>>> token-type := token.
>>>
>>> token-form-path     := +FORM.       ; [required] string for lexical
>>> lookup
>>>
>>> token-id-path       := +ID.         ; [optional] list of external ids
>>>
>>> token-from-path     := +FROM.       ; [optional] surface start position
>>>
>>> token-to-path       := +TO.         ; [optional] surface end position
>>>
>>> token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags
>>>
>>> token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS
>>> probabilities
>>> *Thank you,*
>>> *Olga*
>>>
>>> On Fri, Mar 23, 2018 at 10:38 AM Olga Zamaraeva <olzama at uw.edu> wrote:
>>>
>>>> Thanks very much, Paul, Woodley, and Michael. Michael, thanks
>>>> especially for the detailed explanation!
>>>>
>>>> I did not notice that YY mode has a field for a POS tag. I will try
>>>> that then.
>>>>
>>>> Best,
>>>> Olga
>>>>
>>>> On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>
>>>> wrote:
>>>>
>>>>> Following Woodley's suggestion, for YY-mode I can point you to a few
>>>>> things.
>>>>>
>>>>> In Jacy, we use POS tags from an external morphological analyzer
>>>>> (previously Chasen; recently MeCab). We have a script that takes the output
>>>>> of MeCab and transforms it into the YY format. Note the definition of the
>>>>> pos_info variable---it holds POS data that is slightly more complex than a
>>>>> simple, e.g., NNS or VBG tag.
>>>>>
>>>>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>>>>>
>>>>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
>>>>> entries:
>>>>>
>>>>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>>>>>
>>>>> For ACE (and presumably other processors) you might also need to
>>>>> define paths to the token info:
>>>>>
>>>>>
>>>>> https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151
>>>>>
>>>>> When you call ACE you'll need to tell it to expect YY input. I think
>>>>> it's the -y option. There might be some other pieces to this that Woodley
>>>>> or Francis can probably fill in for you. In my experiments, YY mode did
>>>>> help a bit for getting parses where the standard machinery for unknowns
>>>>> failed.
>>>>>
>>>>> If you're working in Python, then PyDelphin's 'tokens' module can help
>>>>> with constructing YY input. This section of the relevant unit tests might
>>>>> be informative:
>>>>>
>>>>>
>>>>> https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59
>>>>>
>>>>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <
>>>>> sweaglesw at sweaglesw.org> wrote:
>>>>>
>>>>>> Hi Olga,
>>>>>>
>>>>>> Since you are interested primarily in a demonstration rather than a
>>>>>> real world system from what I understand, why not specify the POS tags as
>>>>>> part of the input, using YY mode?
>>>>>>
>>>>>> Woodley
>>>>>>
>>>>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>>>
>>>>>> Dear developers!
>>>>>>
>>>>>> I am looking into the problem of handling unknown roots with LKB and
>>>>>> ACE in a situation where we want to first be able to analyze the word
>>>>>> morphologically (apply lexical rules).
>>>>>>
>>>>>> I had already sent an email about that a year ago, and Francis and I
>>>>>> actually sat down and went through the process of constructing a minimal
>>>>>> example which showed that there was a problem of some sort preventing us
>>>>>> from analyzing the word morphologically and using the unknown word handling
>>>>>> machinery at the same time.
>>>>>>
>>>>>> Alas, I cannot recover any record of this. It is possible that we did
>>>>>> that on Francis's computer,...
>>>>>>
>>>>>> Anyway, I want to reconstruct this minimal example one more time,
>>>>>> this time hopefully understanding more and producing some actual
>>>>>> documentation.
>>>>>>
>>>>>> I would like to start from recreating what e.g. the ERG does:
>>>>>> treating the words as full-form, relying on a POS tag which maps the word
>>>>>> to a specific unknown_type.
>>>>>>
>>>>>> I have a small grammar to which I added what I was able to detect as
>>>>>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>>>>> included mtr.tdl and I included it into the script.
>>>>>>
>>>>>> Next thing I need to understand (I think) is what does it actually
>>>>>> mean to "mock the POS tagger". How do I make the system aware of that
>>>>>> information?
>>>>>>
>>>>>> I can see that the tags can be mapped to the generic lexical entries
>>>>>> as described in http://moin.delph-in.net/PetInput. But how do I get
>>>>>> the tags in the first place? Suppose I just want to consider everything the
>>>>>> same POS, for starters.
>>>>>>
>>>>>> Thank you!
>>>>>> Olga
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Michael Wayne Goodman
>>>>> Ph.D. Candidate, UW Linguistics
>>>>>
>>>>
>>
>
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180331/9fb97eaa/attachment-0001.html>


More information about the developers mailing list