[developers] Using Chart Parsing to integrate FreeLing

Mon Jul 18 21:25:58 CEST 2011

    We tried that and it works.  It builds the right tokens based on the 
character-wise positions.

    So, I guess we'll continue using SPPP, and wait for FSC support in 
LKB...

         thank you very much

             Lluis


On 15/07/11 13:21, Stephan Oepen wrote:
> yes, sorry: that's what i meant.  well, SPPP only 'positions' its
> tokens in terms of character start and end indices (whereas in
> my example below i was assuming inter-token vertices, as are
> used in YY).  SPPP will create the actual lattice by setting up
> vertices 'as appropriate' and arranging tokens relative to these
> (see sppp-serialize-tokens() for the current implementation).
>
> so it *should* work to just have three 'token' elements, where "sin"
> and "sin embargo" have the same 'from' value, and "embargo" and
> "sin embargo" have the same 'to' value.  could you give this a
> quick shot (and use inspect the LKB SPPP internals and token
> chart to see what the results are; e.g. print-tchart())?
> nb: for the record, i consider the lack of explicit vertices in SPPP
> a design flaw nowadays, seeing that we consider the input to
> parsing a lattice of tokens.  hence, a moderately extended FSC
> supported in both PET and the LKB seems like the right way
> forward, in principle.
> On Fri, Jul 15, 2011 at 22:04, Lluís Padró<padro at lsi.upc.edu>  wrote:
>> I guess you mean
>>
>>   0, 1, "sin"
>>   1, 2, "embargo"
>>   0, 2, "sin embargo"
>>
>>   ... Then it looks we have been working much more that it was necessary...
>>   :}
>>
>>   Could you provide more information on the right XML syntax to do that in
>> SPPP ?
>>
>>
>> On 15/07/11 12:59, Stephan Oepen wrote:
>>> hmm, both YY and SPPP support actual lattices, so why should it not work
>>> to have an input like the following?
>>>
>>>   0, 1, "sin"
>>>   1, 2, "embargo"
>>>   0, 1, "sin embargo"
>>>
>>> in fact, even the current LKB should support this; does it not?
>>>
>>> cheers, oe
>>>
>>>
>>> On 15. juli 2011, at 21.26, Lluís Padró<padro at lsi.upc.edu>  wrote:
>>>
>>>> Hi Stephan
>>>>
>>>>   Out motivation is not improving integration with PET, but to be able to
>>>> feed ambiguous tokenization into the SRG (e.g. expressing that the multiword
>>>> expression "sin_embargo" may be either an actual multiword (one single
>>>> token) or two separate words (hence two tokens "sin"+"embargo").
>>>>
>>>>   As far as we understand, SPPP (or YY) is not capable of representing
>>>> this kind of ambiguity, while FSC is.
>>>>
>>>>   We do not use token mapping rules.  We tried, but realized they are not
>>>> what we need.
>>>>   We use chart mapping machinery only because it supports FSC input .The
>>>> FreeLing interface (chartMap.cc) will take care of all token management and
>>>> produce the final lattice that has to be loaded into the grammar, so no need
>>>> for chart mapping rules.
>>>>
>>>>   For the integration of the morphological information coming from
>>>> FreeLing, we wrote some lexical rules that do the same work than the
>>>> ortographemic rules in SPPP, and load the morpohlogical information form the
>>>> PoS tag into the FS.
>>>>
>>>>   So far, it seems to work with cheap.  Next step will be testing it with
>>>> [incr tsdb()], which we assume should work also.
>>>>
>>>>   In summary, if LKB accepted FSC input format (even with no mapping
>>>> rules), we could forget about SPPP.
>>>>   Meanwhile, we will keep both interfaces.
>>>>
>>>>     Thank you
>>>>
>>>>            Montse&  Lluis
>>>>
>>>>
>>>> On 15/07/11 12:02, Stephan Oepen wrote:
>>>>> hi montse,
>>>>>
>>>>> i'm not sure i understand exactly what you're planning to do here, but
>>>>> i see that dan sent you a link to the general chart mapping machinery,
>>>>> and i noticed the new ChartMap.cc in FreeLing, which appears to
>>>>> output (more or less, i would guess) the same information as
>>>>> LKBAnalyzer.cc, but in the FSC format rather than SPPP.
>>>>>
>>>>> for FL integration with the LKB, SPPP currently is your only (supported)
>>>>> option, i.e. not requiring you to provide your own Lisp code to call out
>>>>> to
>>>>> the tagger and interpret its results (this is what Jacy still does for
>>>>> ChaSen,
>>>>> but mostly because that interface was built prior to SPPP).
>>>>>
>>>>> so i am assuming you want to improve integration with PET here?  as it
>>>>> is currently, you can only use PET in connection with [incr tsdb()],
>>>>> which
>>>>> will then invoke FL through the SPPP interface and reformat its result
>>>>> in
>>>>> a form suitable for input to PET.  there are currently two such formats
>>>>> (that are officially supported): YY and FSC.  YY is equivalent to SPPP
>>>>> in what it can express, but using a more compact, non-XML syntax.  for
>>>>> more details, please see:
>>>>>
>>>>>   http://wiki.delph-in.net/moin/PetInput
>>>>>
>>>>> FSC is a more recent invention (by peter adolphs) that seeks to further
>>>>> generalize what can be provided as input to PET, going all the way to a
>>>>> lattice of (arbitrary) token feature structures.  however, for all i
>>>>> recall, in
>>>>> FSC input mode there is currently no support for 'annotating' tokens
>>>>> with
>>>>> information about mandatory orthographemic rules (i.e. setting what in
>>>>> PET internally is known as the inflr_todo list on lexical items).  i
>>>>> recall
>>>>> peter and i discussed the necessity of this feature (which is available
>>>>> in
>>>>> YY mode) several times and concluded it was maybe unneeded.  one
>>>>> could 'mimic' the intended effect in the feature structures of the
>>>>> rules,
>>>>> i.e. have a list (+RULES or so) on each token feature structure, where
>>>>> members in this list could be strings naming orthographemic rules.  to
>>>>> enforce the application of a specific chain of orthographemic rules, the
>>>>> grammar would have to (a) percolate the +RULES value on all lexical
>>>>> signs (lexical entries and lexical rules); (b) make each orthorgraphemic
>>>>> rule require its own name to be the 'next' rule to be called for (e.g.
>>>>> the
>>>>> value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
>>>>> list upon application of an orthographemic rule, i.e. percolate up to
>>>>> the
>>>>> mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
>>>>> value on all arguments to syntactic rules.
>>>>>
>>>>> i am not quite sure i would actually recommend the above approach to
>>>>> anyone.  one issue i see with it just now reflects recent discussion i
>>>>> had with dan and others about extending our notion of a derivation, to
>>>>> actually record additional information about the string-level effects
>>>>> upon
>>>>> application of each orthographemic rule (so as to be able to recover the
>>>>> corresponding surface form at the rule daughter and mother, e.g. if one
>>>>> were to accomodate tokenization conventions that split off punctuation
>>>>> marks).  in the approach sketched above, this information would not
>>>>> be available---whereas it could be when PET (like the LKB) remains in
>>>>> full control of the application of orthographemic rules.  however, when
>>>>> working with an external morphological analyzer (as is the case for
>>>>> the SRG), one would still need more information than is currently
>>>>> supported in YY.  for example, one could imagine extending the FSC
>>>>> 'edge' element with an 'analysis' element quite similar to the one in
>>>>> SPPP.  this is an area for revision that i would like to discuss with
>>>>> peter aver the holidays.
>>>>>
>>>>> --- in summary, your (assumed) desire to improve integration of PET
>>>>> and FL (without requiring the assistance of [incr tsdb()]) has prompted
>>>>> me to recall some remaining open questions in the token lattice input
>>>>> design for PET, particularly in connection with an external
>>>>> morphological
>>>>> analyzer.  i plan on returning to these jointly with peter before too
>>>>> long.
>>>>>
>>>>> in the meantime, i suspect you might actually be better served using YY
>>>>> input format for PET (which, after all, is what [incr tsdb()] converts
>>>>> to from
>>>>> SPPP inputs).  fortunately, the specific input format used (YY or FSC)
>>>>> is
>>>>> independent of the use of chart mapping.  that is, using either format,
>>>>> as
>>>>> soon as the initial lattice of token feature structures is created in
>>>>> PET,
>>>>> everything else remains the same.  thus, if you were looking to utilize
>>>>> chart mapping to improve your treatment of numbers, dates, or other
>>>>> named entities that can be recognized in terms of regular expressions,
>>>>> you could do so by adding a set of token mapping rules to the SRG
>>>>> (much like we have in the ERG).  that would remain valid no matter
>>>>> what revisions to FSC (and possibly YY) might be down the road :-).
>>>>>
>>>>> i hope there may be some useful information in this partly self-serving
>>>>> message to you!
>>>>>
>>>>> best, oe
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
>>>>> <montserrat.marimon at ub.edu>    wrote:
>>>>>> Hi everybody,
>>>>>>
>>>>>> Since the SRG is the only grammar which integrates a tagger using SPPP,
>>>>>> we've decided to use chart parsing to integrate it.
>>>>>> Is there any document we could read?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Montserrat Marimon
>>>>>> Departament de Lingüística General
>>>>>> Facultat de Filologia - Universitat de Barcelona
>>>>>> Edifici Josep Carner, 5a planta
>>>>>> Gran Via de les Corts Catalanes, 585
>>>>>> 08007 BARCELONA
>>>>>> tel.: + 34 93 4034695
>>>>>>
>>>>>> http://stel.ub.edu/linguistica-ub/
>>>>>>
>>>>>>
>>