[developers] Using Chart Parsing to integrate FreeLing
Lluís Padró
padro at lsi.upc.edu
Fri Jul 15 22:04:43 CEST 2011
I guess you mean
0, 1, "sin"
1, 2, "embargo"
0, 2, "sin embargo"
... Then it looks we have been working much more that it was
necessary... :}
Could you provide more information on the right XML syntax to do that
in SPPP ?
On 15/07/11 12:59, Stephan Oepen wrote:
> hmm, both YY and SPPP support actual lattices, so why should it not
> work to have an input like the following?
>
> 0, 1, "sin"
> 1, 2, "embargo"
> 0, 1, "sin embargo"
>
> in fact, even the current LKB should support this; does it not?
>
> cheers, oe
>
>
> On 15. juli 2011, at 21.26, Lluís Padró <padro at lsi.upc.edu> wrote:
>
>> Hi Stephan
>>
>> Out motivation is not improving integration with PET, but to be
>> able to feed ambiguous tokenization into the SRG (e.g. expressing
>> that the multiword expression "sin_embargo" may be either an actual
>> multiword (one single token) or two separate words (hence two tokens
>> "sin"+"embargo").
>>
>> As far as we understand, SPPP (or YY) is not capable of
>> representing this kind of ambiguity, while FSC is.
>>
>> We do not use token mapping rules. We tried, but realized they are
>> not what we need.
>> We use chart mapping machinery only because it supports FSC input
>> .The FreeLing interface (chartMap.cc) will take care of all token
>> management and produce the final lattice that has to be loaded into
>> the grammar, so no need for chart mapping rules.
>>
>> For the integration of the morphological information coming from
>> FreeLing, we wrote some lexical rules that do the same work than the
>> ortographemic rules in SPPP, and load the morpohlogical information
>> form the PoS tag into the FS.
>>
>> So far, it seems to work with cheap. Next step will be testing it
>> with [incr tsdb()], which we assume should work also.
>>
>> In summary, if LKB accepted FSC input format (even with no mapping
>> rules), we could forget about SPPP.
>> Meanwhile, we will keep both interfaces.
>>
>> Thank you
>>
>> Montse & Lluis
>>
>>
>> On 15/07/11 12:02, Stephan Oepen wrote:
>>> hi montse,
>>>
>>> i'm not sure i understand exactly what you're planning to do here, but
>>> i see that dan sent you a link to the general chart mapping machinery,
>>> and i noticed the new ChartMap.cc in FreeLing, which appears to
>>> output (more or less, i would guess) the same information as
>>> LKBAnalyzer.cc, but in the FSC format rather than SPPP.
>>>
>>> for FL integration with the LKB, SPPP currently is your only
>>> (supported)
>>> option, i.e. not requiring you to provide your own Lisp code to call
>>> out to
>>> the tagger and interpret its results (this is what Jacy still does
>>> for ChaSen,
>>> but mostly because that interface was built prior to SPPP).
>>>
>>> so i am assuming you want to improve integration with PET here? as it
>>> is currently, you can only use PET in connection with [incr tsdb()],
>>> which
>>> will then invoke FL through the SPPP interface and reformat its
>>> result in
>>> a form suitable for input to PET. there are currently two such formats
>>> (that are officially supported): YY and FSC. YY is equivalent to SPPP
>>> in what it can express, but using a more compact, non-XML syntax. for
>>> more details, please see:
>>>
>>> http://wiki.delph-in.net/moin/PetInput
>>>
>>> FSC is a more recent invention (by peter adolphs) that seeks to further
>>> generalize what can be provided as input to PET, going all the way to a
>>> lattice of (arbitrary) token feature structures. however, for all i
>>> recall, in
>>> FSC input mode there is currently no support for 'annotating' tokens
>>> with
>>> information about mandatory orthographemic rules (i.e. setting what in
>>> PET internally is known as the inflr_todo list on lexical items). i
>>> recall
>>> peter and i discussed the necessity of this feature (which is
>>> available in
>>> YY mode) several times and concluded it was maybe unneeded. one
>>> could 'mimic' the intended effect in the feature structures of the
>>> rules,
>>> i.e. have a list (+RULES or so) on each token feature structure, where
>>> members in this list could be strings naming orthographemic rules. to
>>> enforce the application of a specific chain of orthographemic rules,
>>> the
>>> grammar would have to (a) percolate the +RULES value on all lexical
>>> signs (lexical entries and lexical rules); (b) make each
>>> orthorgraphemic
>>> rule require its own name to be the 'next' rule to be called for
>>> (e.g. the
>>> value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
>>> list upon application of an orthographemic rule, i.e. percolate up
>>> to the
>>> mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
>>> value on all arguments to syntactic rules.
>>>
>>> i am not quite sure i would actually recommend the above approach to
>>> anyone. one issue i see with it just now reflects recent discussion i
>>> had with dan and others about extending our notion of a derivation, to
>>> actually record additional information about the string-level
>>> effects upon
>>> application of each orthographemic rule (so as to be able to recover
>>> the
>>> corresponding surface form at the rule daughter and mother, e.g. if one
>>> were to accomodate tokenization conventions that split off punctuation
>>> marks). in the approach sketched above, this information would not
>>> be available---whereas it could be when PET (like the LKB) remains in
>>> full control of the application of orthographemic rules. however, when
>>> working with an external morphological analyzer (as is the case for
>>> the SRG), one would still need more information than is currently
>>> supported in YY. for example, one could imagine extending the FSC
>>> 'edge' element with an 'analysis' element quite similar to the one in
>>> SPPP. this is an area for revision that i would like to discuss with
>>> peter aver the holidays.
>>>
>>> --- in summary, your (assumed) desire to improve integration of PET
>>> and FL (without requiring the assistance of [incr tsdb()]) has prompted
>>> me to recall some remaining open questions in the token lattice input
>>> design for PET, particularly in connection with an external
>>> morphological
>>> analyzer. i plan on returning to these jointly with peter before
>>> too long.
>>>
>>> in the meantime, i suspect you might actually be better served using YY
>>> input format for PET (which, after all, is what [incr tsdb()]
>>> converts to from
>>> SPPP inputs). fortunately, the specific input format used (YY or
>>> FSC) is
>>> independent of the use of chart mapping. that is, using either
>>> format, as
>>> soon as the initial lattice of token feature structures is created
>>> in PET,
>>> everything else remains the same. thus, if you were looking to utilize
>>> chart mapping to improve your treatment of numbers, dates, or other
>>> named entities that can be recognized in terms of regular expressions,
>>> you could do so by adding a set of token mapping rules to the SRG
>>> (much like we have in the ERG). that would remain valid no matter
>>> what revisions to FSC (and possibly YY) might be down the road :-).
>>>
>>> i hope there may be some useful information in this partly self-serving
>>> message to you!
>>>
>>> best, oe
>>>
>>>
>>>
>>> On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
>>> <montserrat.marimon at ub.edu> wrote:
>>>> Hi everybody,
>>>>
>>>> Since the SRG is the only grammar which integrates a tagger using
>>>> SPPP,
>>>> we've decided to use chart parsing to integrate it.
>>>> Is there any document we could read?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Montserrat Marimon
>>>> Departament de Lingüística General
>>>> Facultat de Filologia - Universitat de Barcelona
>>>> Edifici Josep Carner, 5a planta
>>>> Gran Via de les Corts Catalanes, 585
>>>> 08007 BARCELONA
>>>> tel.: + 34 93 4034695
>>>>
>>>> http://stel.ub.edu/linguistica-ub/
>>>>
>>>>
>>
More information about the developers
mailing list