[developers] Question on using PET/ACE for parsing

Tue Jun 12 18:35:47 CEST 2018

Thank your for your response! Sorry but just to be clear, my language model
was trained on the wikiwoods corpus, and I used the leaf nodes of the
derivation to decide how a word would appear in the LM's training data. The
rules were as follows

1. If the word has a native entry, the word's original form was kept.
2. If the word was recognized as a POS generic, the part of speech tag was
used to "unk" the word ("JJ_u_unknown" etc.).
3. Otherwise, the word was replaced with its class attribute (these include
"generic_proper_ne", "generic_card_ne" etc.)

To my understanding, this encompasses all the unknown word handling in the
ERG. A sample from my language model might look like "generic_proper_ne had
VBP_u_unknown a cat ." I want to see if these sequences can be parsed by
the ERG.

One way I thought might be to proceed is to add a lexical rule into the
grammar that parses the surface form "generic_proper_ne" into the
generic_proper_ne. Would this be the easiest way? What would the easiest
way for making such a change be? I would really appreciate some pointers,
thank you!

--
Johnny Wei

On Jun 11, 2018, at 6:39 PM, Michael Wayne Goodman <goodmami at uw.edu> wrote:

Hello developers,

See the forwarded message below for a question about parsing unknowns using
the ERG, asked by Johnny Wei at the University of Massachusetts, Amherst.

Johnny: others on the list are more qualified to talk about parsing unknown
tokens using ACE or PET with the ERG, but I'll attempt a response:

Parsing unknowns with DELPH-IN grammars is generally the task of matching
tokens that couldn't be analyzed to a defined lexical entry (i.e., "lexical
gaps") to some generic lexical entry. To avoid the explosion of ambiguity
caused by attempting every generic lexical entry for every gap, filters are
used to block some generic entries. One such filter, which it seems you are
aware of, is the TNT POS tags assigned to the unknown token. These tags can
be assigned using a trained POS tagger which is employed by PET or ACE
during the parsing process, or they can be passed in via structured input
to the parser (e.g., "yy-tokens"). In both of these cases, the POS tag is
paired with the input token. What your language model is outputting looks
like predicate symbols, and I'm not sure how to use those to directly
influence the parser, but others on this list might. Also see this wiki
page for more information: http://moin.delph-in.net/PetInput

There are also other methods of robust parsing, such as a PCFG backoff
("csaw"), but maybe these are not what you're looking for right now.

Also note that, in addition to PET and ACE, the LKB system can parse using
DELPH-IN grammars, and it has a bit more robust support for unknown tokens
(e.g., regarding morphological inflection of unknowns), although its
Lisp-based implementation can make it tricky to interface with external
programs, and it tends to run a bit slower than the so-called "efficient
implementations" (but work is being done on improving the Lisp code's
performance).

 i hope this helps!

---------- Forwarded message ----------
From: Michael Wayne Goodman <goodman.m.w at gmail.com>
Date: Mon, Jun 11, 2018 at 1:39 PM
Subject: Fwd: Question on using PET/ACE for parsing
To: goodmami at uw.edu

---------- Forwarded message ----------
From: Johnny Wei <jwei at umass.edu>
Subject: Question on using PET/ACE for parsing
Date: Mon, 11 Jun 2018 14:46:45 -0400
To: goodman.m.w at gmail.com

Dear Michael,

My name is Johnny Wei, an undergraduate from the University of
Massachusetts, Amherst. Deep grammars are very interesting to me, and I am
looking to use the ERG with PET/ACE for parsing language model output. I
have a few questions on parsing I was wondering whether you could answer.
The questions are below and I really would appreciate your help!

To my understanding there are two ways that ERG handles unknown words,
using TNT POS tags and certain regex matching for classes. Is this correct?
The way I have my language model set up is that it can generate a
''JJ_u_unknown" or "_generic_proper_ne" for each of the unknown word
classe! s. To parse, what would be the easiest way to proceed? For some of
the generic classes, I have been able to replace them with some word such
as card_ne -> 9, but I do not know of an easy way to incorporate the part
of speech unknown words.

Again, I really appreciate your help. If anything is not clear please let
me know, thanks!

-- 
Johnny Wei

-- 
Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20180612/3aff9c96/attachment.html>