[developers] mocking a POS-tagger to handle unk words

Fri Mar 23 00:10:49 CET 2018

Following Woodley's suggestion, for YY-mode I can point you to a few things.

In Jacy, we use POS tags from an external morphological analyzer
(previously Chasen; recently MeCab). We have a script that takes the output
of MeCab and transforms it into the YY format. Note the definition of the
pos_info variable---it holds POS data that is slightly more complex than a
simple, e.g., NNS or VBG tag.

    https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy

Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
entries:

    https://github.com/delph-in/jacy/blob/develop/gle.tdl.

For ACE (and presumably other processors) you might also need to define
paths to the token info:

    https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151

When you call ACE you'll need to tell it to expect YY input. I think it's
the -y option. There might be some other pieces to this that Woodley or
Francis can probably fill in for you. In my experiments, YY mode did help a
bit for getting parses where the standard machinery for unknowns failed.

If you're working in Python, then PyDelphin's 'tokens' module can help with
constructing YY input. This section of the relevant unit tests might be
informative:

https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59

On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <sweaglesw at sweaglesw.org>
wrote:

> Hi Olga,
>
> Since you are interested primarily in a demonstration rather than a real
> world system from what I understand, why not specify the POS tags as part
> of the input, using YY mode?
>
> Woodley
>
> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>
> Dear developers!
>
> I am looking into the problem of handling unknown roots with LKB and ACE
> in a situation where we want to first be able to analyze the word
> morphologically (apply lexical rules).
>
> I had already sent an email about that a year ago, and Francis and I
> actually sat down and went through the process of constructing a minimal
> example which showed that there was a problem of some sort preventing us
> from analyzing the word morphologically and using the unknown word handling
> machinery at the same time.
>
> Alas, I cannot recover any record of this. It is possible that we did that
> on Francis's computer,...
>
> Anyway, I want to reconstruct this minimal example one more time, this
> time hopefully understanding more and producing some actual documentation.
>
> I would like to start from recreating what e.g. the ERG does: treating the
> words as full-form, relying on a POS tag which maps the word to a specific
> unknown_type.
>
> I have a small grammar to which I added what I was able to detect as
> relevant in the ERG (generic lexical entries, unknown onset etc). I also
> included mtr.tdl and I included it into the script.
>
> Next thing I need to understand (I think) is what does it actually mean to
> "mock the POS tagger". How do I make the system aware of that information?
>
> I can see that the tags can be mapped to the generic lexical entries as
> described in http://moin.delph-in.net/PetInput. But how do I get the tags
> in the first place? Suppose I just want to consider everything the same
> POS, for starters.
>
> Thank you!
> Olga
>
>

-- 
Michael Wayne Goodman
Ph.D. Candidate, UW Linguistics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180322/b15ad295/attachment.html>