[developers] New ERG with improved tokenization/preprocessing for PET

Sat Jul 18 12:23:28 CEST 2009

howdy,

i was expecting the topic to (also) come up naturally on wednesday  
morning.  maybe we can see which specific questions we want to address  
related to praraphrasing, and which ones we leave to chart mapping and  
pre-processing.  and maybe i'll even manage to summarize before then  
what we concluded for this forthcoming ERG release from the earlier  
discussion ...

i don't suppose you noticed that generating involving unknown words to  
parsing now works (in the original paraphrase setup, i.e. /not/ EnEn)?

see you'll next week, oe

On Jul 18, 2009, at 11:10 AM, Francis Bond <bond at ieee.org> wrote:

> G'day,
>
> I want to discuss this further at Barcelona, but am not quite sure
> where it should be.  If no one else lays claim to it, I will discuss
> this during the paraphrasing session.
>
> Just to avoid overlap, is any one else planning to talk about the MRS
> relation name for unknown predicates?
>
> Francis
>
> ---------- Forwarded message ----------
> From: Francis Bond <fcbond at gmail.com>
> Date: 2009/5/22
> Subject: Re: [developers] New ERG with improved
> tokenization/preprocessing for PET
> To: oe at ifi.uio.no
> Cc: danf at stanford.edu, erg at delph-in.net, gisley at ifi.uio.no,
> developers at delph-in.net, bec.dridan at gmail.com
>
>
> G'day,
>
> jumping in late, sorry if there is some overlap.
>
> [snip]
>
>> in my view, it is an error to expect the (current) parsing outputs to
>> always be valid generator inputs.  `named_unk_rel' should not be used
>> in inputs to generation (a provider of an input semantics should need
>> no knowledge of which names exist in the ERG lexicon).
>
> In my view the goal should be that the parser should produce valid
> semantics for as wide a  range of input as possible, and the generator
> should generate from as wide a range of semantics as possible.    Of
> course, I realize that it isn't always possible :-).
>
>> with the revised treatment of NEs and unknown words in parsing, there
>> are many more inputs that parse, but the semantics assigned to  
>> unknown
>> words is often `incomplete' (or `internal', or `not quite right'),  
>> and
>> hence i think one would have to add an MRS post-processing step  
>> before
>> trying to feed these MRSs back into the generator.
>
> The semantics assigned to unknown words is potentially input for MT,
> Sciborg, QA, ontology extraction as well as just generation.
>
>> names are probably not the most interesting example, as one might ask
>> why `Frodo' should end up with a different predicate than `Abrams' in
>> parsing.  at present, the `named_unk_rel' is an attempt at marking  
>> the
>> fact that `Frodo' was parsed as an unknown name in the semantics.   
>> for
>> all i see, the underlying generic lexical entry could just as well  
>> use
>> `named_rel' instead.
>
> I do so ask:  does anyone have any special motivation for using a
> different predicate in the semantics here?
>
>> more interesting, however, are unknown nouns, verbs, etc.  looking at
>> the example towards the bottom of
>>
>>  http://wiki.delph-in.net/moin/PetInput
>>
>> the current goal is for an unknown verb like `bazed' (as recognized  
>> by
>> the parser by virtue of its PoS tag: VBD) to introduce a predicate  
>> like
>> "_bazed_vbd_rel", i.e. just the concatenation of the token surface  
>> form
>> and the PoS tag.  the reasoning is that the grammar has no  
>> knowledge of
>> the actual stem (which could be `baz' or `baze' in this example;  
>> with a
>> doubled consonant, there would be three alternatives), and  
>> therefore we
>> `short-circuit' morphology: most PoS-based generic lexical entries  
>> are
>> already inflected, i.e. words rather than lexemes.  this is really  
>> all
>> the parser can do at this point (without introducing silly  
>> ambiguity).
>>
>> using the predicate "_bazed_vbd_rel" preserves all information  
>> provided
>> by the tagger and grammar for downstream processing.  my  
>> expectation is
>> that this `incomplete' MRS should be post-processed after parsing,  
>> such
>> that the predicate can be rewritten to "_baze_v_?_rel", or whatever  
>> one
>> deems appropriate in terms of the external interface.
>>
>> for the paraphrasing setup, such rewriting can be part of the  
>> transfer
>> grammar that is invoked after parsing and prior to generation.  i  
>> will
>> aim to at least provide a first shot at predicate normalization in  
>> the
>> forthcoming 0907 ERG release.
>
> Adding an extra transfer grammar is feasible for paraphrasing, but may
> be a considerable burden for other tasks.  It also has the potential
> to separate part of  the unknown word handling from the grammar
> itself, which would then make it prone to boundary friction as one
> component changes but the other doesn't.
>
>>> Did unknown word generation not make into the mainstream?  If so, is
>>> there a branch that has it?
>>
>> i believe your joint experimentation with dan on generating `unknown'
>> nouns and verbs was after 0902 but is reflected in the current  
>> `trunk'.
>> but before making that functionality part of the forthcoming  
>> release, i
>> was hoping to have a little more discussion about what we actually  
>> want
>> in terms of generation with generic lexical entries.
>
> And your wish is being granted.
>
>> i guess there is consensus on the various NE classes listed above,  
>> i.e.
>> where there is a CARG corresponding to surface form we will  
>> continue to
>> support that (names, numbers, dates, and such).
>
> Good.  The new chart mapping stuff also makes some classes work that
> had been problematic, e.g. some of the date ersatz (like 1967).
>
>> as for generating nouns or verbs that are not in the lexicon, i see  
>> no
>> point in trying to support "_bazed_vbd_rel" as a valid generator  
>> input.
>
> I agree.  I think that that "_bazed_vbd_rel" should not be the
> representation for unknown words.
>
>> we might be able, however, to support "_baze_v_?_rel", but then i  
>> think
>> we need to say a little more about how we can map unknown  
>> predicates to
>> generic lexical entries; and on how to relate the predicate and  
>> surface
>> form to be generated.  in generation, i think, inflection is  
>> determined
>> by variable properties; hence, generic entries for generation will  
>> need
>> to be different from those used in parsing.
>
> I don't understand the last sentence.
>
>>  further, if we assume that
>> the grammar can provide generics (in generation) for a limited sub- 
>> set
>> of argument frames for verbs (intransitive and simple transitive,  
>> say),
>> nouns (mass or count, non-relational), and adjectives (intersective,
>> non-relational), then the generator should check the complete EP  
>> giving
>> rise to a generic for compatibility.  for example, an input MRS  
>> with an
>> instantiated ARG1 on an unknown noun should be rejected, in my view.
>
> I think it is reasonable to expect that the unknown word handling will
> not correctly handle all cases, and may reject some. For Jacy, and I
> think for the ERG as well, our policy is to add idiosyncratic words
> (in inflection, syntax or semantics) to the grammar/lexicon as far as
> we can.  I think we should do this even for relatively rare words.
> The expectation can then be that the remaining unknown words are
> fairly regular in their behavior.
>
>> an extra layer of `generics' trigger rules would likely be adequate  
>> to
>> capture the correspondence relation between `unknown' EPs and  
>> generics
>> available to the generator.  it would still have to be combined  
>> with a
>> convention of how to determine the surface form: /^_([^_]+).*_rel$  
>> -->
>> \1 would seem like a plausible start, i think.
>
> Indeed.
>
>> --- i would be grateful for comments on these thoughts, especially  
>> from
>> ann, dan, and francis.  i believe i could implement a first shot at  
>> the
>> setup sketched here for the 0907 ERG release.  however, to decide  
>> where
>> to use my time, it would also be helpful to know who actually makes  
>> use
>> of the ERG paraphrasing setup in current projects?
>
> I am experimenting with it, aided by an intern from Georgia Tech
> (Darren).  The MT group at NICT has asked me to make the paraphraser
> available for them, as they think it will help in the SMT
> competitions.  We have also had a request for the paraphraser from a
> group in India.
>
> Eric and I, and more recently Sanghoun, also want to use the unknown
> word handling in our MT systems.  Better parsing and generating the
> most common classes of unknown words will make an immediate difference
> to the quality of our systems.
>
> Now I will try and cover some of the later discussion:
>
> I agree completely with Ann that we should be keeping to the agreed
> upon RMRS standard for predicate names.   I can also think of no
> potential use of the grammar/parser who would want the surface form
> rather than the base form here.   Any link to some external resource
> (dictionary, ontology, transfer, generator) will need the base form.
>
>> well, in our setup input to parsing is PoS-tagged but /not/  
>> lemmatized.
>> so `bazed' and `VBD' really are the only pieces of information that  
>> we
>> have available when synthesizing the initial PRED values for  
>> generics.
>> even classifying the various tags into _v_, _n_, and _a_ would  
>> require
>> multiplying out token mapping rules, i.e. come at a small extra cost.
>
> I realize that determining the base form is non-trivial.
>
> My preference would be to get it from the morphological analyzer.  In
> the current setup tnt does not provide the lemmatized form, but other
> setups are possible.   We could (a) add a wrapper with morpha or
> equivalent or (b) switch to a tagger that does provide the base form.
>  The taggers for Japanese and Spanish do provide the base form.   In
> the actual parse, each entry is associated with a lexical type --- I
> assumed that this gives enough information to decide _v_, _n_, or _a_.
>  I would be very impressed to find that the ERG has super-types for
> these classes used in parsing.  But perhaps I am missing something?
> Could you (or Dan) give some examples where classifying the various
> tags into _v_, _n_, and _a_ would require multiplying out token
> mapping rules?  If it is just a few rules, then I think this cost
> would be dwarfed by the cost of a separate post-processing transfer
> grammar.
>
>> i think an actual solution would require a tool like morpha (which is
>> part of pre-processing in RASP, i believe), adapted for PTB tags and
>> american english.  one could argue this /should/ be part of our input
>> pre-processing prior to parsing, but that is not an option right now.
>
> Why is it not an option?  I thought Tim has something like this
> already.  If not  we could  make one, or maybe see if the RASP project
> has one squirreled away somewhere.  If we can't find anything better,
> I volunteer to make one: under the assumption that irregular cases
> should go in the ERG proper, I believe it would be reasonably cheap to
> build.
>
>> and in principle there can be lemmatization ambiguity, which (for the
>> cases discussed here) has no bearing on parsing; thus it is desirable
>> to defer such ambiguity until as late as possible (late commitment),
>> much like i would expect to do (the bulk of) WSD /after/ parsing.
>
> There is a cost to adding lemmatization ambiguity.   Leaving it
> unspecified is a possibility, but I think we should keep in mind that
> almost any potential user of the grammar is going to have to ambiguate
> at some stage, defering the ambiguity is just moving the cost
> elsewhere, not eliminating it.
>
> I think there are other reasonably principled alternatives: one is to
> say that unknown words are relatively rare, and we can afford to add
> in some ambiguity.  Another would be to say that unknown words are
> likely to be regular and just chose a best guess.  If it turns out
> that we are wrong and it is important then the words get added to the
> lexicon.   Neither of these are perfect solutions, but I consider both
> of them better than doing nothing.
>
>> did you try yourself?  you underestimate the complexity of this  
>> issue:
>> the ERG orthographemic rules hypothesize three candidate stems:  
>> `ban',
>> `bann', and `banne'.  without a lexicon, i believe all are justified.
>
> My solution would be to add as many verbs that end in double
> consonants to the ERG as we can, and then only run the regular rule
> :-).
>
>> we do not want this (silly) ambiguity in parsing, nor would i be keen
>> on putting an approximate (procedural) solution (restricted to  
>> english)
>> into the parser (PET, in our case).  thus i still think post- 
>> processing
>> is the best we can do (as long as parser inputs are not lemmatized).
>
> I see the attraction of keeping a language-specific approximate
> solution out of the parser.  I think it should therefore go into the
> pre-processor: i.e. we should lemmatize the parser inputs.  If this
> isn't really feasible, then an approach that reuses the grammar's
> existing morphological rules (like Ann outlined) would go some way to
> solving the problem of language-specificity.　The grammar writer then
> becomes responsible for either (a) mapping the tags to inflectional
> rules and providing a word list or (b) providing a tagger that
> lemmatizes.    Whatever is decided for English, I would really like
> (b) to be available for Japanese, as I would prefer not to have to
> redo what ChaSen (or Juman) give me already.
>
> Anyway, I am really glad that you brought this up so we could discuss
> it before things gets too fixed in stone.  Can I encourage other
> people with a potential interest to chip in? Berthold? Montse?
> Antonio? Ulrich --- I would think unknown word predicate naming is
> important for the HoG too :-)
>
> Yours,
>
> --
> Francis Bond <http://www2.nict.go.jp/x/x161/en/member/bond/>
> NICT Language Infrastructure Group
>
>
>
> -- 
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University