[developers] developers Digest, Vol 66, Issue 3
Antonio Branco
Antonio.Branco at di.fc.ul.pt
Thu Jul 8 15:33:56 CEST 2010
Hi Stephan,
Stephan Oepen wrote:
> hi antonio et al.,
>
> there are really two aspects to be considered here: (b) the surface
> format used in handing PET an annotated token lattice, and (b) the
> details of how PET interprets various parts of its input. in terms of
> current code consolidation plans, (b) is probably more important. there
> is a fair amount of special-purpose, procedural machinery in this area
> that we would like to eliminate. the ideal outcome, in my view, is to
> reduce the set of properties on input tokens that PET needs to know
> about to three: a string for lexical lookup, a binary flag indicating
> internal vs. externel morphological analysis, and a list of identifiers
> (of orthographemic rules), in the case of external morphology.
> everything else (e.g. characterization, use of annotation for unknown
> word handling, custom CARGs or synthesized PREDs) can be done in the
> grammar, i believe. this worked out well for the ERG.
>
> so the main goal really should be interface simplification, and we
> should ask: how far is the portuguese grammar removed from the above
> ideal?
Most likely as much as a grammar can possibly be, since all that
is "below/before" configurational syntax (and even bits of this in
case one include NER in this realm) is obtained outside/before
the grammar, and this is what goes in our input (pic) representation:
- surface form
- lemma
- inflection features
- ner
Also a key reason to being using PIC is the ability provided
to constrain POS tag of input words that are neverrtheless
(ambigously) known to the grammar/lexicon.
> and how much effort would be involved in making the transition?
Given our funding conditions: unaffordable.
Given our medium to long term goal, i.e. to get as fast as possible
to a set of materials for Portuguese with size and maturity of those
existing for English: redoing what took us several years of time
is also unaffordable
> i understand you are about to experiment with migrating to token mapping
> for independent reasons,
not necessarily: we are living (for almost one year now) with our
patch for the HCONS problem in MRSs and waiting for a principled
solution that permits just removing that patch
best,
--Ant.
and i wholeheartedly believe you stand to gain
> from these revisions. in doing so, are there interface aspects that you
> believe cannot be accomodated within the assumptions of my ‘pure’ vision
> above?
>
> best wishes, oe
>
>
>
>
>
> On 8. juli 2010, at 12.17, Antonio Branco <Antonio.Branco at di.fc.ul.pt>
> wrote:
>
>>
>>
>>
>> Dear Uli,
>>
>> Please note that in the case of the Portuguese grammar,
>> interfacing the deep grammar with our pre-processing
>> tools (POS tagger, lemmatizer, morphological analyzer,
>> NER) are done exclusively via PIC, so having PET
>> without PIC would knock out our grammar from running
>> on it.
>>
>> All the best,
>>
>>
>> --Ant.
>>
>>
>> P.S.: Francisco has already sent to Bernd a detailed description
>> of the problem with PET we reported in Paris, together with
>> our grammar so that you will be able to reproduce it on your side.
>> For the sake of the recording he'll be submiting a ticket
>> in PET bug tracker as well.
>>
>>
>>
>>
>> developers-request at emmtee.net wrote:
>> ----------
>>> Message: 1
>>> Date: Wed, 07 Jul 2010 18:34:40 +0200
>>> From: Ulrich Schaefer <ulrich.schaefer at dfki.de>
>>> Subject: Re: [developers] [pet] [delph-in] Poll to identify actively
>>> used functionality in PET
>>> To: pet at delph-in.net, developers at delph-in.net
>>> Message-ID: <4C34ACA0.4050304 at dfki.de>
>>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>> From my / Heart of Gold's point of view, it would be OK to have a PET
>>> without PIC and SMAF. REPP isn't used in any of the hybrid workflows
>>> AFAIR.
>>> It implies that we will probably loose integrations for Italian and
>>> Norwegian (no harm as the grammars are no longer actively developed I
>>> guess) and maybe also Greek, but we would gain cleaner and uniform
>>> configuration settings, hopefully.
>>> I hope we will also be able to integrate Spanish with FreeLing via
>>> FSC soon.
>>> I would also like to replace the current stdin/stderr communication
>>> in PetModule by XML-RPC as soon as possible.
>>> -Uli
>>> Am 07.07.2010 12:56, schrieb Stephan Oepen:
>>>> i guess you're asking about FSR (aka FSPP) and REPP? the latter now
>>>> supersedes FSPP in the LKB, and for all i know the existing FSPP
>>>> support in PET (based on ECL) is not UniCode-enabled and builds on
>>>> the deprecated SMAF. hence, no practical loss purging that from PET
>>>> now, i'd think? REPP, on the other hand, should be natively
>>>> supported in PET, in my view. i seem to recall that you had a C++
>>>> implementation of REPP? woodley has a C implementation. maybe
>>>> sometime this fall we could jointly look at the choices (and
>>>> remaining limitations: i believe none of the existing
>>>> implementations is perfect in terms of characterization corner
>>>> cases), and then add native REPP support to PET?
>>>>
>>>> as for FSC, there is pretty good documentation on the wiki now, and
>>>> it seems the format is reasonably stable. i am inclined to preserve
>>>> YY format, as the non-XML alternative to inputting a PoS-annotated
>>>> token lattice.
>>>>
>>>> finally, i see your point about efficiency losses in
>>>> -default-les=all mode when combined with a very large number of
>>>> generics (i.e. one per LE type); personally, i'd think lexical
>>>> instantiation can be optimized to alleviate these concerns. i
>>>> personally find the limitations in the old generics mode so severe
>>>> that i can't imagine going back to that mode. but if there were
>>>> active users who'd be badly affected by its removal prior to
>>>> optimizing -default-les=all further, i have no opinion on when best
>>>> to ditch the old mode.
>>>>
>>>> best, oe
>>>>
>>>>
>>>>
>>>> On 7. juli 2010, at 02.03, Rebecca Dridan <bec.dridan at gmail.com> wrote:
>>>>
>>>>> I couldn't attend the PetRoadMap discussion - is there any summary
>>>>> of the discussion, or at least what decisions were made on the wiki?
>>>>>
>>>>>> Input formats we'd like to discard:
>>>>>>
>>>>>> - pic / pic_counts
>>>>>> - yy_counts
>>>>>> - smaf
>>>>>> - fsr
>>>>>>
>>>>> Particularly, what is the plan for inputs? FSC seemed to do
>>>>> everything I had needed from PIC, but at the time it was
>>>>> undocumented, experimental code. Will FSC be the default input
>>>>> format when annotation beyond POS tags is needed?
>>>>>
>>>>>> -default-les=traditional determine default les by posmapping for all
>>>>>> lexical gaps
>>>>> Does this mean that we can either hypothesise every generic entry
>>>>> for every token (and then filter them), or not use generic entries
>>>>> at all? I found this to be a major efficiency issue when large
>>>>> numbers of generic entries were available. I don't have a problem
>>>>> with defaulting to the current "all" setting, but I think there are
>>>>> still possible configurations where one would like to react only
>>>>> when lexical gaps were found.
>>>>>
>>>>>> Because these are the only modules that require the inclusion of ECL,
>>>>>> support for ECL in PET will also be removed.
>>>>> I celebrate the removal of ECL, but will there be any way of doing
>>>>> more than white space tokenisation natively in PET, or was the
>>>>> decision made that PET will always be run in conjunction with an
>>>>> LKB pre-processing step?
>>>>>
>>>>> Rebecca
>>>>>
More information about the developers
mailing list