[developers] mocking a POS-tagger to handle unk words

Olga Zamaraeva olzama at uw.edu
Fri Mar 23 18:37:59 CET 2018

Thanks very much, Paul, Woodley, and Michael. Michael, thanks especially
for the detailed explanation!

I did not notice that YY mode has a field for a POS tag. I will try that


On Thu, Mar 22, 2018 at 4:11 PM Michael Wayne Goodman <goodmami at uw.edu>

> Following Woodley's suggestion, for YY-mode I can point you to a few
> things.
> In Jacy, we use POS tags from an external morphological analyzer
> (previously Chasen; recently MeCab). We have a script that takes the output
> of MeCab and transforms it into the YY format. Note the definition of the
> pos_info variable---it holds POS data that is slightly more complex than a
> simple, e.g., NNS or VBG tag.
>     https://github.com/delph-in/jacy/blob/develop/utils/jpn2yy
> Then see gle.tdl in Jacy, which maps the POS "tags" to generic lexical
> entries:
>     https://github.com/delph-in/jacy/blob/develop/gle.tdl.
> For ACE (and presumably other processors) you might also need to define
> paths to the token info:
>     https://github.com/delph-in/jacy/blob/develop/ace/config.tdl#L143-L151
> When you call ACE you'll need to tell it to expect YY input. I think it's
> the -y option. There might be some other pieces to this that Woodley or
> Francis can probably fill in for you. In my experiments, YY mode did help a
> bit for getting parses where the standard machinery for unknowns failed.
> If you're working in Python, then PyDelphin's 'tokens' module can help
> with constructing YY input. This section of the relevant unit tests might
> be informative:
> https://github.com/delph-in/pydelphin/blob/develop/tests/tokens_test.py#L40-L59
> On Thu, Mar 22, 2018 at 3:40 PM, Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>> Hi Olga,
>> Since you are interested primarily in a demonstration rather than a real
>> world system from what I understand, why not specify the POS tags as part
>> of the input, using YY mode?
>> Woodley
>> On Mar 22, 2018, at 11:42 AM, Olga Zamaraeva <olzama at uw.edu> wrote:
>> Dear developers!
>> I am looking into the problem of handling unknown roots with LKB and ACE
>> in a situation where we want to first be able to analyze the word
>> morphologically (apply lexical rules).
>> I had already sent an email about that a year ago, and Francis and I
>> actually sat down and went through the process of constructing a minimal
>> example which showed that there was a problem of some sort preventing us
>> from analyzing the word morphologically and using the unknown word handling
>> machinery at the same time.
>> Alas, I cannot recover any record of this. It is possible that we did
>> that on Francis's computer,...
>> Anyway, I want to reconstruct this minimal example one more time, this
>> time hopefully understanding more and producing some actual documentation.
>> I would like to start from recreating what e.g. the ERG does: treating
>> the words as full-form, relying on a POS tag which maps the word to a
>> specific unknown_type.
>> I have a small grammar to which I added what I was able to detect as
>> relevant in the ERG (generic lexical entries, unknown onset etc). I also
>> included mtr.tdl and I included it into the script.
>> Next thing I need to understand (I think) is what does it actually mean
>> to "mock the POS tagger". How do I make the system aware of that
>> information?
>> I can see that the tags can be mapped to the generic lexical entries as
>> described in http://moin.delph-in.net/PetInput. But how do I get the
>> tags in the first place? Suppose I just want to consider everything the
>> same POS, for starters.
>> Thank you!
>> Olga
> --
> Michael Wayne Goodman
> Ph.D. Candidate, UW Linguistics
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180323/af98d2c4/attachment.html>

More information about the developers mailing list