[developers] mocking a POS-tagger to handle unk words

Sat Mar 31 04:17:07 CEST 2018

G'day,

if you run ace in a more verbose mode (I think -vv should be enough) it
tells you a bit more about what it is doing with the tokens.

In addition to yy-mode, you must also have some generic lexical entries for
unknown words.

You can find some nice examples by Sanghoun in:
https://github.com/delph-in/zhong/blob/master/cmn/gle.tdl
(I think easier to follow than Jacy).

Can you show the lexical type you want to instantiate?

On Sat, Mar 31, 2018 at 2:36 AM, Kristen Howell <kphowell at uw.edu> wrote:

> I'm picking this up for Olga. I've followed the same steps and am
> encountering the same issue, where I can parse a known word in YY mode, but
> not an unkown word. I've attached the toy grammar we are using. If anyone
> has insight on what we are missing, we'd appreciate it. Here is an example,
> where "baab" is a known word and "baac" is not.
>
> [kphowell at patas ace-0.9.26]$ ./ace -g ../aggregation/analyses/
> unknown-roots-morphology/data/abz-modified/ace/abz.dat -y
> (42, 0, 1, <0:4>, 1, "baab", 0, "null", "VB" 1.0)
> SENT: (yy mode)
> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
> E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
> x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
> gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
> number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
> e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ]
> +TNT tnt [ +TAGS cons [ FIRST \"VB\" REST null ] +PRBS cons [ FIRST
> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
> actives used, 2 passives used)    RAM: 41k
>
>
> (42, 0, 1, <0:4>, 1, "baac", 0, "null", "VB" 1.0)
> NOTE: lexemes do not span position 0 `baac'!
> NOTE: post reduction gap
> SKIP: (yy mode)
>
> Best,
> Kristen
>
> On Fri, Mar 23, 2018 at 1:55 PM, Olga Zamaraeva <olzama at uw.edu> wrote:
>
>> OK, I can run ACE in yy mode and I've attempted to enable  token mapping
>> and to map tags to generic entries, but apparently I am missing some
>> step(s).
>>
>> *On an existing word, it works:*
>>
>> $cat ../../yy.txt | ace -g abz.dat -y
>>
>> SENT: (yy mode)
>>
>> [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques E.TENSE: tense E.ASPECT: aspect
>> E.MOOD: mood ] RELS: < [ "_strike.pfv_v_rel"<-1:-1> LBL: h1 ARG0: e2 ARG1:
>> x3 [ x SPECI: bool COG-ST: in-foc PNG.PER: person PNG.NUM: number PNG.GEND:
>> gender ] ARG2: x4 [ x SPECI: bool COG-ST: cog-st PNG.PER: person PNG.NUM:
>> number PNG.GEND: gender ] ] > HCONS: < h0 qeq h1 > ICONS: < e2 non-focus x4
>> e2 non-focus x3 > ] ;  (10 decl-head-opt-subj 0.000000 0 1 (9
>> basic-head-opt-comp 0.000000 0 1 (2 baab 0.000000 0 1 ("baab" 1 "token [
>> +FORM \"baab\" +FROM \"0\" +TO \"4\" +ID diff-list [ LIST list LAST list ] *+TNT
>> tnt [ +TAGS cons [ FIRST \"VB\" REST null ]* +PRBS cons [ FIRST
>> \"1.000000\" REST null ] +MAIN tnt_main [ +TAG string +PRB string ] ]
>> +CLASS token_class +TRAIT token_trait [ +UW bool +IT italics +LB
>> bracket_list +RB bracket_list +HD token_head [ +LL ctype [ -CTYPE- string ]
>> +TG string +TI string ] ] +PRED predsort +CARG string +TICK bool ]"))))
>>
>> NOTE: 1 readings, added 6 / 2 edges to chart (3 fully instantiated, 2
>> actives used, 2 passives used) RAM: 41k
>> *But on an unknown word it does not still:*
>>
>>  ace Murka$ cat ../../yy.txt | ace -g abz.dat -y
>>
>> NOTE: lexemes do not span position 0 `baabb'!
>>
>> NOTE: post reduction gap
>>
>> SKIP: (yy mode)
>>
>> *Does anyone have an idea what I have likely failed to define/enable?*
>>
>> *I've defined token paths like in the ERG, because that's where I copied
>> other types from:*
>>
>> token-mapping := enabled.
>>
>> lexicon-tokens-path := TOKENS +LIST.
>>
>> lexicon-last-token-path := TOKENS +LAST.
>>
>> token-type := token.
>>
>> token-form-path     := +FORM.       ; [required] string for lexical lookup
>>
>> token-id-path       := +ID.         ; [optional] list of external ids
>>
>> token-from-path     := +FROM.       ; [optional] surface start position
>>
>> token-to-path       := +TO.         ; [optional] surface end position
>>
>> token-postags-path  := +TNT +TAGS.  ; [optional] list of POS tags
>>
>> token-posprobs-path := +TNT +PRBS.  ; [optional] list of POS probabilities
>> *Thank you,*
>> *Olga*
>>
>> On Fri, Mar 23, 2018 at 10:38 AM Olga Zamaraeva <olzama at uw.edu> wrote:
>>
>>> Thanks very much, Paul, Woodley, and Michael. Michael, thanks especially
>>> for the detailed explanation!
>>>
>>> I did not notice that YY mode has a field for a POS tag. I will try that
>>> then.
>>>
>>> Best,
>>> Olga
>>>
>>> On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>
>>> wrote:
>>>
>>>> Following Woodley's suggestion, for YY-mode I can point you to a few
>>>> things.
>>>>
>>>> In Jacy, we use POS tags from an external morphological analyzer
>>>> (previously Chasen; recently MeCab). We have a script that takes the output
>>>> of MeCab and transforms it into the YY format. Note the definition of the
>>>> pos_info variable---it holds POS data that is slightly more complex than a
>>>> simple, e.g., NNS or VBG tag.
>>>>
>>>>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>>>>
>>>> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
>>>> entries:
>>>>
>>>>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>>>>
>>>> For ACE (and presumably other processors) you might also need to define
>>>> paths to the token info:
>>>>
>>>>     https://github.com/delph-in/jacy/blob/develop/ace/config.tdl
>>>> #L143-L151
>>>>
>>>> When you call ACE you'll need to tell it to expect YY input. I think
>>>> it's the -y option. There might be some other pieces to this that Woodley
>>>> or Francis can probably fill in for you. In my experiments, YY mode did
>>>> help a bit for getting parses where the standard machinery for unknowns
>>>> failed.
>>>>
>>>> If you're working in Python, then PyDelphin's 'tokens' module can help
>>>> with constructing YY input. This section of the relevant unit tests might
>>>> be informative:
>>>>
>>>>     https://github.com/delph-in/pydelphin/blob/develop/tests/tok
>>>> ens_test.py#L40-L59
>>>>
>>>> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <
>>>> sweaglesw at sweaglesw.org> wrote:
>>>>
>>>>> Hi Olga,
>>>>>
>>>>> Since you are interested primarily in a demonstration rather than a
>>>>> real world system from what I understand, why not specify the POS tags as
>>>>> part of the input, using YY mode?
>>>>>
>>>>> Woodley
>>>>>
>>>>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>>>>
>>>>> Dear developers!
>>>>>
>>>>> I am looking into the problem of handling unknown roots with LKB and
>>>>> ACE in a situation where we want to first be able to analyze the word
>>>>> morphologically (apply lexical rules).
>>>>>
>>>>> I had already sent an email about that a year ago, and Francis and I
>>>>> actually sat down and went through the process of constructing a minimal
>>>>> example which showed that there was a problem of some sort preventing us
>>>>> from analyzing the word morphologically and using the unknown word handling
>>>>> machinery at the same time.
>>>>>
>>>>> Alas, I cannot recover any record of this. It is possible that we did
>>>>> that on Francis's computer,...
>>>>>
>>>>> Anyway, I want to reconstruct this minimal example one more time, this
>>>>> time hopefully understanding more and producing some actual documentation.
>>>>>
>>>>> I would like to start from recreating what e.g. the ERG does: treating
>>>>> the words as full-form, relying on a POS tag which maps the word to a
>>>>> specific unknown_type.
>>>>>
>>>>> I have a small grammar to which I added what I was able to detect as
>>>>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>>>>> included mtr.tdl and I included it into the script.
>>>>>
>>>>> Next thing I need to understand (I think) is what does it actually
>>>>> mean to "mock the POS tagger". How do I make the system aware of that
>>>>> information?
>>>>>
>>>>> I can see that the tags can be mapped to the generic lexical entries
>>>>> as described in http://moin.delph-in.net/PetInput. But how do I get
>>>>> the tags in the first place? Suppose I just want to consider everything the
>>>>> same POS, for starters.
>>>>>
>>>>> Thank you!
>>>>> Olga
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Michael Wayne Goodman
>>>> Ph.D. Candidate, UW Linguistics
>>>>
>>>
>

-- 
Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180331/80c32c06/attachment.html>