[developers] Using Chart Parsing to integrate FreeLing

Lluís Padró padro at lsi.upc.edu
Fri Jul 15 22:04:43 CEST 2011


I guess you mean

   0, 1, "sin"
   1, 2, "embargo"
   0, 2, "sin embargo"

  ... Then it looks we have been working much more that it was 
necessary...  :}

   Could you provide more information on the right XML syntax to do that 
in SPPP ?


On 15/07/11 12:59, Stephan Oepen wrote:
> hmm, both YY and SPPP support actual lattices, so why should it not 
> work to have an input like the following?
>
>   0, 1, "sin"
>   1, 2, "embargo"
>   0, 1, "sin embargo"
>
> in fact, even the current LKB should support this; does it not?
>
> cheers, oe
>
>
> On 15. juli 2011, at 21.26, Lluís Padró <padro at lsi.upc.edu> wrote:
>
>> Hi Stephan
>>
>>   Out motivation is not improving integration with PET, but to be 
>> able to feed ambiguous tokenization into the SRG (e.g. expressing 
>> that the multiword expression "sin_embargo" may be either an actual 
>> multiword (one single token) or two separate words (hence two tokens 
>> "sin"+"embargo").
>>
>>   As far as we understand, SPPP (or YY) is not capable of 
>> representing this kind of ambiguity, while FSC is.
>>
>>   We do not use token mapping rules.  We tried, but realized they are 
>> not what we need.
>>   We use chart mapping machinery only because it supports FSC input 
>> .The FreeLing interface (chartMap.cc) will take care of all token 
>> management and produce the final lattice that has to be loaded into 
>> the grammar, so no need for chart mapping rules.
>>
>>   For the integration of the morphological information coming from 
>> FreeLing, we wrote some lexical rules that do the same work than the 
>> ortographemic rules in SPPP, and load the morpohlogical information 
>> form the PoS tag into the FS.
>>
>>   So far, it seems to work with cheap.  Next step will be testing it 
>> with [incr tsdb()], which we assume should work also.
>>
>>   In summary, if LKB accepted FSC input format (even with no mapping 
>> rules), we could forget about SPPP.
>>   Meanwhile, we will keep both interfaces.
>>
>>     Thank you
>>
>>            Montse & Lluis
>>
>>
>> On 15/07/11 12:02, Stephan Oepen wrote:
>>> hi montse,
>>>
>>> i'm not sure i understand exactly what you're planning to do here, but
>>> i see that dan sent you a link to the general chart mapping machinery,
>>> and i noticed the new ChartMap.cc in FreeLing, which appears to
>>> output (more or less, i would guess) the same information as
>>> LKBAnalyzer.cc, but in the FSC format rather than SPPP.
>>>
>>> for FL integration with the LKB, SPPP currently is your only 
>>> (supported)
>>> option, i.e. not requiring you to provide your own Lisp code to call 
>>> out to
>>> the tagger and interpret its results (this is what Jacy still does 
>>> for ChaSen,
>>> but mostly because that interface was built prior to SPPP).
>>>
>>> so i am assuming you want to improve integration with PET here?  as it
>>> is currently, you can only use PET in connection with [incr tsdb()], 
>>> which
>>> will then invoke FL through the SPPP interface and reformat its 
>>> result in
>>> a form suitable for input to PET.  there are currently two such formats
>>> (that are officially supported): YY and FSC.  YY is equivalent to SPPP
>>> in what it can express, but using a more compact, non-XML syntax.  for
>>> more details, please see:
>>>
>>>   http://wiki.delph-in.net/moin/PetInput
>>>
>>> FSC is a more recent invention (by peter adolphs) that seeks to further
>>> generalize what can be provided as input to PET, going all the way to a
>>> lattice of (arbitrary) token feature structures.  however, for all i 
>>> recall, in
>>> FSC input mode there is currently no support for 'annotating' tokens 
>>> with
>>> information about mandatory orthographemic rules (i.e. setting what in
>>> PET internally is known as the inflr_todo list on lexical items).  i 
>>> recall
>>> peter and i discussed the necessity of this feature (which is 
>>> available in
>>> YY mode) several times and concluded it was maybe unneeded.  one
>>> could 'mimic' the intended effect in the feature structures of the 
>>> rules,
>>> i.e. have a list (+RULES or so) on each token feature structure, where
>>> members in this list could be strings naming orthographemic rules.  to
>>> enforce the application of a specific chain of orthographemic rules, 
>>> the
>>> grammar would have to (a) percolate the +RULES value on all lexical
>>> signs (lexical entries and lexical rules); (b) make each 
>>> orthorgraphemic
>>> rule require its own name to be the 'next' rule to be called for 
>>> (e.g. the
>>> value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
>>> list upon application of an orthographemic rule, i.e. percolate up 
>>> to the
>>> mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
>>> value on all arguments to syntactic rules.
>>>
>>> i am not quite sure i would actually recommend the above approach to
>>> anyone.  one issue i see with it just now reflects recent discussion i
>>> had with dan and others about extending our notion of a derivation, to
>>> actually record additional information about the string-level 
>>> effects upon
>>> application of each orthographemic rule (so as to be able to recover 
>>> the
>>> corresponding surface form at the rule daughter and mother, e.g. if one
>>> were to accomodate tokenization conventions that split off punctuation
>>> marks).  in the approach sketched above, this information would not
>>> be available---whereas it could be when PET (like the LKB) remains in
>>> full control of the application of orthographemic rules.  however, when
>>> working with an external morphological analyzer (as is the case for
>>> the SRG), one would still need more information than is currently
>>> supported in YY.  for example, one could imagine extending the FSC
>>> 'edge' element with an 'analysis' element quite similar to the one in
>>> SPPP.  this is an area for revision that i would like to discuss with
>>> peter aver the holidays.
>>>
>>> --- in summary, your (assumed) desire to improve integration of PET
>>> and FL (without requiring the assistance of [incr tsdb()]) has prompted
>>> me to recall some remaining open questions in the token lattice input
>>> design for PET, particularly in connection with an external 
>>> morphological
>>> analyzer.  i plan on returning to these jointly with peter before 
>>> too long.
>>>
>>> in the meantime, i suspect you might actually be better served using YY
>>> input format for PET (which, after all, is what [incr tsdb()] 
>>> converts to from
>>> SPPP inputs).  fortunately, the specific input format used (YY or 
>>> FSC) is
>>> independent of the use of chart mapping.  that is, using either 
>>> format, as
>>> soon as the initial lattice of token feature structures is created 
>>> in PET,
>>> everything else remains the same.  thus, if you were looking to utilize
>>> chart mapping to improve your treatment of numbers, dates, or other
>>> named entities that can be recognized in terms of regular expressions,
>>> you could do so by adding a set of token mapping rules to the SRG
>>> (much like we have in the ERG).  that would remain valid no matter
>>> what revisions to FSC (and possibly YY) might be down the road :-).
>>>
>>> i hope there may be some useful information in this partly self-serving
>>> message to you!
>>>
>>> best, oe
>>>
>>>
>>>
>>> On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
>>> <montserrat.marimon at ub.edu>  wrote:
>>>> Hi everybody,
>>>>
>>>> Since the SRG is the only grammar which integrates a tagger using 
>>>> SPPP,
>>>> we've decided to use chart parsing to integrate it.
>>>> Is there any document we could read?
>>>>
>>>> Thanks,
>>>>
>>>> -- 
>>>> Montserrat Marimon
>>>> Departament de Lingüística General
>>>> Facultat de Filologia - Universitat de Barcelona
>>>> Edifici Josep Carner, 5a planta
>>>> Gran Via de les Corts Catalanes, 585
>>>> 08007 BARCELONA
>>>> tel.: + 34 93 4034695
>>>>
>>>> http://stel.ub.edu/linguistica-ub/
>>>>
>>>>
>>




More information about the developers mailing list