[developers] developers Digest, Vol 66, Issue 3
oe at ifi.uio.no
Thu Jul 8 12:59:44 CEST 2010
hi antonio et al.,
there are really two aspects to be considered here: (b) the surface
format used in handing PET an annotated token lattice, and (b) the
details of how PET interprets various parts of its input. in terms of
current code consolidation plans, (b) is probably more important.
there is a fair amount of special-purpose, procedural machinery in
this area that we would like to eliminate. the ideal outcome, in my
view, is to reduce the set of properties on input tokens that PET
needs to know about to three: a string for lexical lookup, a binary
flag indicating internal vs. externel morphological analysis, and a
list of identifiers (of orthographemic rules), in the case of external
morphology. everything else (e.g. characterization, use of annotation
for unknown word handling, custom CARGs or synthesized PREDs) can be
done in the grammar, i believe. this worked out well for the ERG.
so the main goal really should be interface simplification, and we
should ask: how far is the portuguese grammar removed from the above
ideal? and how much effort would be involved in making the
transition? i understand you are about to experiment with migrating
to token mapping for independent reasons, and i wholeheartedly believe
you stand to gain from these revisions. in doing so, are there
interface aspects that you believe cannot be accomodated within the
assumptions of my ‘pure’ vision above?
best wishes, oe
On 8. juli 2010, at 12.17, Antonio Branco
<Antonio.Branco at di.fc.ul.pt> wrote:
> Dear Uli,
> Please note that in the case of the Portuguese grammar,
> interfacing the deep grammar with our pre-processing
> tools (POS tagger, lemmatizer, morphological analyzer,
> NER) are done exclusively via PIC, so having PET
> without PIC would knock out our grammar from running
> on it.
> All the best,
> P.S.: Francisco has already sent to Bernd a detailed description
> of the problem with PET we reported in Paris, together with
> our grammar so that you will be able to reproduce it on your side.
> For the sake of the recording he'll be submiting a ticket
> in PET bug tracker as well.
> developers-request at emmtee.net wrote:
>> Message: 1
>> Date: Wed, 07 Jul 2010 18:34:40 +0200
>> From: Ulrich Schaefer <ulrich.schaefer at dfki.de>
>> Subject: Re: [developers] [pet] [delph-in] Poll to identify actively
>> used functionality in PET
>> To: pet at delph-in.net, developers at delph-in.net
>> Message-ID: <4C34ACA0.4050304 at dfki.de>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>> From my / Heart of Gold's point of view, it would be OK to have a
>> PET without PIC and SMAF. REPP isn't used in any of the hybrid
>> workflows AFAIR.
>> It implies that we will probably loose integrations for Italian and
>> Norwegian (no harm as the grammars are no longer actively developed
>> I guess) and maybe also Greek, but we would gain cleaner and
>> uniform configuration settings, hopefully.
>> I hope we will also be able to integrate Spanish with FreeLing via
>> FSC soon.
>> I would also like to replace the current stdin/stderr communication
>> in PetModule by XML-RPC as soon as possible.
>> Am 07.07.2010 12:56, schrieb Stephan Oepen:
>>> i guess you're asking about FSR (aka FSPP) and REPP? the latter
>>> now supersedes FSPP in the LKB, and for all i know the existing
>>> FSPP support in PET (based on ECL) is not UniCode-enabled and
>>> builds on the deprecated SMAF. hence, no practical loss purging
>>> that from PET now, i'd think? REPP, on the other hand, should be
>>> natively supported in PET, in my view. i seem to recall that you
>>> had a C++ implementation of REPP? woodley has a C
>>> implementation. maybe sometime this fall we could jointly look at
>>> the choices (and remaining limitations: i believe none of the
>>> existing implementations is perfect in terms of characterization
>>> corner cases), and then add native REPP support to PET?
>>> as for FSC, there is pretty good documentation on the wiki now,
>>> and it seems the format is reasonably stable. i am inclined to
>>> preserve YY format, as the non-XML alternative to inputting a PoS-
>>> annotated token lattice.
>>> finally, i see your point about efficiency losses in -default-
>>> les=all mode when combined with a very large number of generics
>>> (i.e. one per LE type); personally, i'd think lexical
>>> instantiation can be optimized to alleviate these concerns. i
>>> personally find the limitations in the old generics mode so severe
>>> that i can't imagine going back to that mode. but if there were
>>> active users who'd be badly affected by its removal prior to
>>> optimizing -default-les=all further, i have no opinion on when
>>> best to ditch the old mode.
>>> best, oe
>>> On 7. juli 2010, at 02.03, Rebecca Dridan <bec.dridan at gmail.com>
>>>> I couldn't attend the PetRoadMap discussion - is there any
>>>> summary of the discussion, or at least what decisions were made
>>>> on the wiki?
>>>>> Input formats we'd like to discard:
>>>>> - pic / pic_counts
>>>>> - yy_counts
>>>>> - smaf
>>>>> - fsr
>>>> Particularly, what is the plan for inputs? FSC seemed to do
>>>> everything I had needed from PIC, but at the time it was
>>>> undocumented, experimental code. Will FSC be the default input
>>>> format when annotation beyond POS tags is needed?
>>>>> -default-les=traditional determine default les by posmapping
>>>>> for all
>>>>> lexical gaps
>>>> Does this mean that we can either hypothesise every generic entry
>>>> for every token (and then filter them), or not use generic
>>>> entries at all? I found this to be a major efficiency issue when
>>>> large numbers of generic entries were available. I don't have a
>>>> problem with defaulting to the current "all" setting, but I think
>>>> there are still possible configurations where one would like to
>>>> react only when lexical gaps were found.
>>>>> Because these are the only modules that require the inclusion of
>>>>> support for ECL in PET will also be removed.
>>>> I celebrate the removal of ECL, but will there be any way of
>>>> doing more than white space tokenisation natively in PET, or was
>>>> the decision made that PET will always be run in conjunction with
>>>> an LKB pre-processing step?
More information about the developers