[developers] developers Digest, Vol 66, Issue 3

Thu Jul 8 15:33:56 CEST 2010

Hi Stephan,

Stephan Oepen wrote:
> hi antonio et al.,
> 
> there are really two aspects to be considered here: (b) the surface 
> format used in handing PET an annotated token lattice, and (b) the 
> details of how PET interprets various parts of its input.  in terms of 
> current code consolidation plans, (b) is probably more important.  there 
> is a fair amount of special-purpose, procedural machinery in this area 
> that we would like to eliminate.  the ideal outcome, in my view, is to 
> reduce the set of properties on input tokens that PET needs to know 
> about to three: a string for lexical lookup, a binary flag indicating 
> internal vs. externel morphological analysis, and a list of identifiers 
> (of orthographemic rules), in the case of external morphology.  
> everything else (e.g. characterization, use of annotation for unknown 
> word handling, custom CARGs or synthesized PREDs) can be done in the 
> grammar, i believe.  this worked out well for the ERG.
> 
> so the main goal really should be interface simplification, and we 
> should ask: how far is the portuguese grammar removed from the above 
> ideal?  

Most likely as much as a grammar can possibly be, since all that
is "below/before" configurational syntax (and even bits of this in
case one include NER in this realm) is obtained outside/before
the grammar, and this is what goes in our input (pic) representation:
- surface form
- lemma
- inflection features
- ner

Also a key reason to being using PIC is the ability provided
to constrain POS tag of input words that are neverrtheless
(ambigously) known to the grammar/lexicon.

> and how much effort would be involved in making the transition?  

Given our funding conditions: unaffordable.

Given our medium to long term goal, i.e. to get as fast as possible
to a set of materials for Portuguese with size and maturity of those
existing for English:  redoing what took us several years of time
is also unaffordable

> i understand you are about to experiment with migrating to token mapping 
> for independent reasons, 

not necessarily: we are living (for almost one year now) with our
patch for the HCONS problem in MRSs and waiting for a principled
solution that permits just removing that patch

best,

--Ant.

and i wholeheartedly believe you stand to gain
> from these revisions.  in doing so, are there interface aspects that you 
> believe cannot be accomodated within the assumptions of my ‘pure’ vision 
> above?
> 
> best wishes, oe
> 
> 
> 
> 
> 
> On 8. juli 2010, at 12.17, Ant—onio Branco <Antonio.Branco at di.fc.ul.pt> 
> wrote:
> 
>>
>>
>>
>> Dear Uli,
>>
>> Please note that in the case of the Portuguese grammar,
>> interfacing the deep grammar with our pre-processing
>> tools (POS tagger, lemmatizer, morphological analyzer,
>> NER) are done exclusively via PIC, so having PET
>> without PIC would knock out our grammar from running
>> on it.
>>
>> All the best,
>>
>>
>> --Ant.
>>
>>
>> P.S.: Francisco has already sent to Bernd a detailed description
>> of the problem with PET we reported in Paris, together with
>> our grammar so that you will be able to reproduce it on your side.
>> For the sake of the recording he'll be submiting a ticket
>> in PET bug tracker as well.
>>
>>
>>
>>
>> developers-request at emmtee.net wrote:
>> ----------
>>> Message: 1
>>> Date: Wed, 07 Jul 2010 18:34:40 +0200
>>> From: Ulrich Schaefer <ulrich.schaefer at dfki.de>
>>> Subject: Re: [developers] [pet] [delph-in] Poll to identify actively
>>>    used functionality in PET
>>> To: pet at delph-in.net, developers at delph-in.net
>>> Message-ID: <4C34ACA0.4050304 at dfki.de>
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>> From my / Heart of Gold's point of view, it would be OK to have a PET 
>>> without PIC and SMAF. REPP isn't used in any of the hybrid workflows 
>>> AFAIR.
>>> It implies that we will probably loose integrations for Italian and 
>>> Norwegian (no harm as the grammars are no longer actively developed I 
>>> guess) and maybe also Greek, but we would gain cleaner and uniform 
>>> configuration settings, hopefully.
>>> I hope we will also be able to integrate Spanish with FreeLing via 
>>> FSC soon.
>>> I would also like to replace the current stdin/stderr communication 
>>> in PetModule by XML-RPC as soon as possible.
>>> -Uli
>>> Am 07.07.2010 12:56, schrieb Stephan Oepen:
>>>> i guess you're asking about FSR (aka FSPP) and REPP?  the latter now 
>>>> supersedes FSPP in the LKB, and for all i know the existing FSPP 
>>>> support in PET (based on ECL) is not UniCode-enabled and builds on 
>>>> the deprecated SMAF.  hence, no practical loss purging that from PET 
>>>> now, i'd think?  REPP, on the other hand, should be natively 
>>>> supported in PET, in my view.  i seem to recall that you had a C++ 
>>>> implementation of REPP?  woodley has a C implementation.  maybe 
>>>> sometime this fall we could jointly look at the choices (and 
>>>> remaining limitations: i believe none of the existing 
>>>> implementations is perfect in terms of characterization corner 
>>>> cases), and then add native REPP support to PET?
>>>>
>>>> as for FSC, there is pretty good documentation on the wiki now, and 
>>>> it seems the format is reasonably stable.  i am inclined to preserve 
>>>> YY format, as the non-XML alternative to inputting a PoS-annotated 
>>>> token lattice.
>>>>
>>>> finally, i see your point about efficiency losses in 
>>>> -default-les=all mode when combined with a very large number of 
>>>> generics (i.e. one per LE type); personally, i'd think lexical 
>>>> instantiation can be optimized to alleviate these concerns.  i 
>>>> personally find the limitations in the old generics mode so severe 
>>>> that i can't imagine going back to that mode.  but if there were 
>>>> active users who'd be badly affected by its removal prior to 
>>>> optimizing -default-les=all further, i have no opinion on when best 
>>>> to ditch the old mode.
>>>>
>>>> best, oe
>>>>
>>>>
>>>>
>>>> On 7. juli 2010, at 02.03, Rebecca Dridan <bec.dridan at gmail.com> wrote:
>>>>
>>>>> I couldn't attend the PetRoadMap discussion - is there any summary 
>>>>> of the discussion, or at least what decisions were made on the wiki?
>>>>>
>>>>>> Input formats we'd like to discard:
>>>>>>
>>>>>> - pic / pic_counts
>>>>>> - yy_counts
>>>>>> - smaf
>>>>>> - fsr
>>>>>>
>>>>> Particularly, what is the plan for inputs? FSC seemed to do 
>>>>> everything I had needed from PIC, but at the time it was 
>>>>> undocumented, experimental code. Will FSC be the default input 
>>>>> format when annotation beyond POS tags is needed?
>>>>>
>>>>>> -default-les=traditional  determine default les by posmapping for all
>>>>>>                         lexical gaps
>>>>> Does this mean that we can either hypothesise every generic entry 
>>>>> for every token (and then filter them), or not use generic entries 
>>>>> at all? I found this to be a major efficiency issue when large 
>>>>> numbers of generic entries were available. I don't have a problem 
>>>>> with defaulting to the current "all" setting, but I think there are 
>>>>> still possible configurations where one would like to react only 
>>>>> when lexical gaps were found.
>>>>>
>>>>>> Because these are the only modules that require the inclusion of ECL,
>>>>>> support for ECL in PET will also be removed.
>>>>> I celebrate the removal of ECL, but will there be any way of doing 
>>>>> more than white space tokenisation natively in PET, or was the 
>>>>> decision made that PET will always be run in conjunction with an 
>>>>> LKB pre-processing step?
>>>>>
>>>>> Rebecca
>>>>>