[developers] mocking a POS-tagger to handle unk words

Fri Mar 23 18:37:59 CET 2018

Thanks very much, Paul, Woodley, and Michael. Michael, thanks especially
for the detailed explanation!

I did not notice that YY mode has a field for a POS tag. I will try that
then.

Best,
Olga

On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>
wrote:

> Following Woodley's suggestion, for YY-mode I can point you to a few
> things.
>
> In Jacy, we use POS tags from an external morphological analyzer
> (previously Chasen; recently MeCab). We have a script that takes the output
> of MeCab and transforms it into the YY format. Note the definition of the
> pos_info variable---it holds POS data that is slightly more complex than a
> simple, e.g., NNS or VBG tag.
>
>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
>
> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
> entries:
>
>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
>
> For ACE (and presumably other processors) you might also need to define
> paths to the token info:
>
>     https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151
>
> When you call ACE you'll need to tell it to expect YY input. I think it's
> the -y option. There might be some other pieces to this that Woodley or
> Francis can probably fill in for you. In my experiments, YY mode did help a
> bit for getting parses where the standard machinery for unknowns failed.
>
> If you're working in Python, then PyDelphin's 'tokens' module can help
> with constructing YY input. This section of the relevant unit tests might
> be informative:
>
>
> https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59
>
> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>
>> Hi Olga,
>>
>> Since you are interested primarily in a demonstration rather than a real
>> world system from what I understand, why not specify the POS tags as part
>> of the input, using YY mode?
>>
>> Woodley
>>
>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>>
>> Dear developers!
>>
>> I am looking into the problem of handling unknown roots with LKB and ACE
>> in a situation where we want to first be able to analyze the word
>> morphologically (apply lexical rules).
>>
>> I had already sent an email about that a year ago, and Francis and I
>> actually sat down and went through the process of constructing a minimal
>> example which showed that there was a problem of some sort preventing us
>> from analyzing the word morphologically and using the unknown word handling
>> machinery at the same time.
>>
>> Alas, I cannot recover any record of this. It is possible that we did
>> that on Francis's computer,...
>>
>> Anyway, I want to reconstruct this minimal example one more time, this
>> time hopefully understanding more and producing some actual documentation.
>>
>> I would like to start from recreating what e.g. the ERG does: treating
>> the words as full-form, relying on a POS tag which maps the word to a
>> specific unknown_type.
>>
>> I have a small grammar to which I added what I was able to detect as
>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>> included mtr.tdl and I included it into the script.
>>
>> Next thing I need to understand (I think) is what does it actually mean
>> to "mock the POS tagger". How do I make the system aware of that
>> information?
>>
>> I can see that the tags can be mapped to the generic lexical entries as
>> described in http://moin.delph-in.net/PetInput. But how do I get the
>> tags in the first place? Suppose I just want to consider everything the
>> same POS, for starters.
>>
>> Thank you!
>> Olga
>>
>>
>
>
> --
> Michael Wayne Goodman
> Ph.D. Candidate, UW Linguistics
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180323/af98d2c4/attachment.html>