[developers] Using Chart Parsing to integrate FreeLing

Tue Jul 19 04:18:21 CEST 2011

just for the record, the current russian resource grammar also relies
on SPPP for the integration with ``mystem'' morph. analyzer. but we
are flexible with the choice of input formats/protocol.

yi

On Tue, Jul 19, 2011 at 4:10 AM, Montserrat Marimon
<montserrat.marimon at ub.edu> wrote:
> Yes it works, but should we move to FSC anyway? (now that we know how to do
> that) I thought you wanted to get rid of SPPP (are we the only ones using
> SPPP?)... We could use SPPP with LKB and FSC with PET until there's support
> for FSC in LKB.
>
>
> On Mon, Jul 18, 2011 at 12:25 PM, Lluís Padró <padro at lsi.upc.edu> wrote:
>>
>>   We tried that and it works.  It builds the right tokens based on the
>> character-wise positions.
>>
>>   So, I guess we'll continue using SPPP, and wait for FSC support in
>> LKB...
>>
>>        thank you very much
>>
>>            Lluis
>>
>>
>> On 15/07/11 13:21, Stephan Oepen wrote:
>>>
>>> yes, sorry: that's what i meant.  well, SPPP only 'positions' its
>>> tokens in terms of character start and end indices (whereas in
>>> my example below i was assuming inter-token vertices, as are
>>> used in YY).  SPPP will create the actual lattice by setting up
>>> vertices 'as appropriate' and arranging tokens relative to these
>>> (see sppp-serialize-tokens() for the current implementation).
>>>
>>> so it *should* work to just have three 'token' elements, where "sin"
>>> and "sin embargo" have the same 'from' value, and "embargo" and
>>> "sin embargo" have the same 'to' value.  could you give this a
>>> quick shot (and use inspect the LKB SPPP internals and token
>>> chart to see what the results are; e.g. print-tchart())?
>>> nb: for the record, i consider the lack of explicit vertices in SPPP
>>> a design flaw nowadays, seeing that we consider the input to
>>> parsing a lattice of tokens.  hence, a moderately extended FSC
>>> supported in both PET and the LKB seems like the right way
>>> forward, in principle.
>>> On Fri, Jul 15, 2011 at 22:04, Lluís Padró<padro at lsi.upc.edu>  wrote:
>>>>
>>>> I guess you mean
>>>>
>>>>  0, 1, "sin"
>>>>  1, 2, "embargo"
>>>>  0, 2, "sin embargo"
>>>>
>>>>  ... Then it looks we have been working much more that it was
>>>> necessary...
>>>>  :}
>>>>
>>>>  Could you provide more information on the right XML syntax to do that
>>>> in
>>>> SPPP ?
>>>>
>>>>
>>>> On 15/07/11 12:59, Stephan Oepen wrote:
>>>>>
>>>>> hmm, both YY and SPPP support actual lattices, so why should it not
>>>>> work
>>>>> to have an input like the following?
>>>>>
>>>>>  0, 1, "sin"
>>>>>  1, 2, "embargo"
>>>>>  0, 1, "sin embargo"
>>>>>
>>>>> in fact, even the current LKB should support this; does it not?
>>>>>
>>>>> cheers, oe
>>>>>
>>>>>
>>>>> On 15. juli 2011, at 21.26, Lluís Padró<padro at lsi.upc.edu>  wrote:
>>>>>
>>>>>> Hi Stephan
>>>>>>
>>>>>>  Out motivation is not improving integration with PET, but to be able
>>>>>> to
>>>>>> feed ambiguous tokenization into the SRG (e.g. expressing that the
>>>>>> multiword
>>>>>> expression "sin_embargo" may be either an actual multiword (one single
>>>>>> token) or two separate words (hence two tokens "sin"+"embargo").
>>>>>>
>>>>>>  As far as we understand, SPPP (or YY) is not capable of representing
>>>>>> this kind of ambiguity, while FSC is.
>>>>>>
>>>>>>  We do not use token mapping rules.  We tried, but realized they are
>>>>>> not
>>>>>> what we need.
>>>>>>  We use chart mapping machinery only because it supports FSC input
>>>>>> .The
>>>>>> FreeLing interface (chartMap.cc) will take care of all token
>>>>>> management and
>>>>>> produce the final lattice that has to be loaded into the grammar, so
>>>>>> no need
>>>>>> for chart mapping rules.
>>>>>>
>>>>>>  For the integration of the morphological information coming from
>>>>>> FreeLing, we wrote some lexical rules that do the same work than the
>>>>>> ortographemic rules in SPPP, and load the morpohlogical information
>>>>>> form the
>>>>>> PoS tag into the FS.
>>>>>>
>>>>>>  So far, it seems to work with cheap.  Next step will be testing it
>>>>>> with
>>>>>> [incr tsdb()], which we assume should work also.
>>>>>>
>>>>>>  In summary, if LKB accepted FSC input format (even with no mapping
>>>>>> rules), we could forget about SPPP.
>>>>>>  Meanwhile, we will keep both interfaces.
>>>>>>
>>>>>>    Thank you
>>>>>>
>>>>>>           Montse&  Lluis
>>>>>>
>>>>>>
>>>>>> On 15/07/11 12:02, Stephan Oepen wrote:
>>>>>>>
>>>>>>> hi montse,
>>>>>>>
>>>>>>> i'm not sure i understand exactly what you're planning to do here,
>>>>>>> but
>>>>>>> i see that dan sent you a link to the general chart mapping
>>>>>>> machinery,
>>>>>>> and i noticed the new ChartMap.cc in FreeLing, which appears to
>>>>>>> output (more or less, i would guess) the same information as
>>>>>>> LKBAnalyzer.cc, but in the FSC format rather than SPPP.
>>>>>>>
>>>>>>> for FL integration with the LKB, SPPP currently is your only
>>>>>>> (supported)
>>>>>>> option, i.e. not requiring you to provide your own Lisp code to call
>>>>>>> out
>>>>>>> to
>>>>>>> the tagger and interpret its results (this is what Jacy still does
>>>>>>> for
>>>>>>> ChaSen,
>>>>>>> but mostly because that interface was built prior to SPPP).
>>>>>>>
>>>>>>> so i am assuming you want to improve integration with PET here?  as
>>>>>>> it
>>>>>>> is currently, you can only use PET in connection with [incr tsdb()],
>>>>>>> which
>>>>>>> will then invoke FL through the SPPP interface and reformat its
>>>>>>> result
>>>>>>> in
>>>>>>> a form suitable for input to PET.  there are currently two such
>>>>>>> formats
>>>>>>> (that are officially supported): YY and FSC.  YY is equivalent to
>>>>>>> SPPP
>>>>>>> in what it can express, but using a more compact, non-XML syntax.
>>>>>>>  for
>>>>>>> more details, please see:
>>>>>>>
>>>>>>>  http://wiki.delph-in.net/moin/PetInput
>>>>>>>
>>>>>>> FSC is a more recent invention (by peter adolphs) that seeks to
>>>>>>> further
>>>>>>> generalize what can be provided as input to PET, going all the way to
>>>>>>> a
>>>>>>> lattice of (arbitrary) token feature structures.  however, for all i
>>>>>>> recall, in
>>>>>>> FSC input mode there is currently no support for 'annotating' tokens
>>>>>>> with
>>>>>>> information about mandatory orthographemic rules (i.e. setting what
>>>>>>> in
>>>>>>> PET internally is known as the inflr_todo list on lexical items).  i
>>>>>>> recall
>>>>>>> peter and i discussed the necessity of this feature (which is
>>>>>>> available
>>>>>>> in
>>>>>>> YY mode) several times and concluded it was maybe unneeded.  one
>>>>>>> could 'mimic' the intended effect in the feature structures of the
>>>>>>> rules,
>>>>>>> i.e. have a list (+RULES or so) on each token feature structure,
>>>>>>> where
>>>>>>> members in this list could be strings naming orthographemic rules.
>>>>>>>  to
>>>>>>> enforce the application of a specific chain of orthographemic rules,
>>>>>>> the
>>>>>>> grammar would have to (a) percolate the +RULES value on all lexical
>>>>>>> signs (lexical entries and lexical rules); (b) make each
>>>>>>> orthorgraphemic
>>>>>>> rule require its own name to be the 'next' rule to be called for
>>>>>>> (e.g.
>>>>>>> the
>>>>>>> value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
>>>>>>> list upon application of an orthographemic rule, i.e. percolate up to
>>>>>>> the
>>>>>>> mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
>>>>>>> value on all arguments to syntactic rules.
>>>>>>>
>>>>>>> i am not quite sure i would actually recommend the above approach to
>>>>>>> anyone.  one issue i see with it just now reflects recent discussion
>>>>>>> i
>>>>>>> had with dan and others about extending our notion of a derivation,
>>>>>>> to
>>>>>>> actually record additional information about the string-level effects
>>>>>>> upon
>>>>>>> application of each orthographemic rule (so as to be able to recover
>>>>>>> the
>>>>>>> corresponding surface form at the rule daughter and mother, e.g. if
>>>>>>> one
>>>>>>> were to accomodate tokenization conventions that split off
>>>>>>> punctuation
>>>>>>> marks).  in the approach sketched above, this information would not
>>>>>>> be available---whereas it could be when PET (like the LKB) remains in
>>>>>>> full control of the application of orthographemic rules.  however,
>>>>>>> when
>>>>>>> working with an external morphological analyzer (as is the case for
>>>>>>> the SRG), one would still need more information than is currently
>>>>>>> supported in YY.  for example, one could imagine extending the FSC
>>>>>>> 'edge' element with an 'analysis' element quite similar to the one in
>>>>>>> SPPP.  this is an area for revision that i would like to discuss with
>>>>>>> peter aver the holidays.
>>>>>>>
>>>>>>> --- in summary, your (assumed) desire to improve integration of PET
>>>>>>> and FL (without requiring the assistance of [incr tsdb()]) has
>>>>>>> prompted
>>>>>>> me to recall some remaining open questions in the token lattice input
>>>>>>> design for PET, particularly in connection with an external
>>>>>>> morphological
>>>>>>> analyzer.  i plan on returning to these jointly with peter before too
>>>>>>> long.
>>>>>>>
>>>>>>> in the meantime, i suspect you might actually be better served using
>>>>>>> YY
>>>>>>> input format for PET (which, after all, is what [incr tsdb()]
>>>>>>> converts
>>>>>>> to from
>>>>>>> SPPP inputs).  fortunately, the specific input format used (YY or
>>>>>>> FSC)
>>>>>>> is
>>>>>>> independent of the use of chart mapping.  that is, using either
>>>>>>> format,
>>>>>>> as
>>>>>>> soon as the initial lattice of token feature structures is created in
>>>>>>> PET,
>>>>>>> everything else remains the same.  thus, if you were looking to
>>>>>>> utilize
>>>>>>> chart mapping to improve your treatment of numbers, dates, or other
>>>>>>> named entities that can be recognized in terms of regular
>>>>>>> expressions,
>>>>>>> you could do so by adding a set of token mapping rules to the SRG
>>>>>>> (much like we have in the ERG).  that would remain valid no matter
>>>>>>> what revisions to FSC (and possibly YY) might be down the road :-).
>>>>>>>
>>>>>>> i hope there may be some useful information in this partly
>>>>>>> self-serving
>>>>>>> message to you!
>>>>>>>
>>>>>>> best, oe
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
>>>>>>> <montserrat.marimon at ub.edu>    wrote:
>>>>>>>>
>>>>>>>> Hi everybody,
>>>>>>>>
>>>>>>>> Since the SRG is the only grammar which integrates a tagger using
>>>>>>>> SPPP,
>>>>>>>> we've decided to use chart parsing to integrate it.
>>>>>>>> Is there any document we could read?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Montserrat Marimon
>>>>>>>> Departament de Lingüística General
>>>>>>>> Facultat de Filologia - Universitat de Barcelona
>>>>>>>> Edifici Josep Carner, 5a planta
>>>>>>>> Gran Via de les Corts Catalanes, 585
>>>>>>>> 08007 BARCELONA
>>>>>>>> tel.: + 34 93 4034695
>>>>>>>>
>>>>>>>> http://stel.ub.edu/linguistica-ub/
>>>>>>>>
>>>>>>>>
>>>>
>>
>
>
>
> --
> Montserrat Marimon
> Departament de Lingüística General
> Facultat de Filologia - Universitat de Barcelona
> Edifici Josep Carner, 5a planta
> Gran Via de les Corts Catalanes, 585
> 08007 BARCELONA
> tel.: + 34 93 4034695
>
> http://stel.ub.edu/linguistica-ub/
>
>