[developers] Using Chart Parsing to integrate FreeLing

Tue Jul 19 04:10:32 CEST 2011

Yes it works, but should we move to FSC anyway? (now that we know how to do
that) I thought you wanted to get rid of SPPP (are we the only ones using
SPPP?)... We could use SPPP with LKB and FSC with PET until there's support
for FSC in LKB.

On Mon, Jul 18, 2011 at 12:25 PM, Lluís Padró <padro at lsi.upc.edu> wrote:

>
>   We tried that and it works.  It builds the right tokens based on the
> character-wise positions.
>
>   So, I guess we'll continue using SPPP, and wait for FSC support in LKB...
>
>        thank you very much
>
>            Lluis
>
>
>
> On 15/07/11 13:21, Stephan Oepen wrote:
>
>> yes, sorry: that's what i meant.  well, SPPP only 'positions' its
>> tokens in terms of character start and end indices (whereas in
>> my example below i was assuming inter-token vertices, as are
>> used in YY).  SPPP will create the actual lattice by setting up
>> vertices 'as appropriate' and arranging tokens relative to these
>> (see sppp-serialize-tokens() for the current implementation).
>>
>> so it *should* work to just have three 'token' elements, where "sin"
>> and "sin embargo" have the same 'from' value, and "embargo" and
>> "sin embargo" have the same 'to' value.  could you give this a
>> quick shot (and use inspect the LKB SPPP internals and token
>> chart to see what the results are; e.g. print-tchart())?
>> nb: for the record, i consider the lack of explicit vertices in SPPP
>> a design flaw nowadays, seeing that we consider the input to
>> parsing a lattice of tokens.  hence, a moderately extended FSC
>> supported in both PET and the LKB seems like the right way
>> forward, in principle.
>> On Fri, Jul 15, 2011 at 22:04, Lluís Padró<padro at lsi.upc.edu>  wrote:
>>
>>> I guess you mean
>>>
>>>  0, 1, "sin"
>>>  1, 2, "embargo"
>>>  0, 2, "sin embargo"
>>>
>>>  ... Then it looks we have been working much more that it was
>>> necessary...
>>>  :}
>>>
>>>  Could you provide more information on the right XML syntax to do that in
>>> SPPP ?
>>>
>>>
>>> On 15/07/11 12:59, Stephan Oepen wrote:
>>>
>>>> hmm, both YY and SPPP support actual lattices, so why should it not work
>>>> to have an input like the following?
>>>>
>>>>  0, 1, "sin"
>>>>  1, 2, "embargo"
>>>>  0, 1, "sin embargo"
>>>>
>>>> in fact, even the current LKB should support this; does it not?
>>>>
>>>> cheers, oe
>>>>
>>>>
>>>> On 15. juli 2011, at 21.26, Lluís Padró<padro at lsi.upc.edu>  wrote:
>>>>
>>>>  Hi Stephan
>>>>>
>>>>>  Out motivation is not improving integration with PET, but to be able
>>>>> to
>>>>> feed ambiguous tokenization into the SRG (e.g. expressing that the
>>>>> multiword
>>>>> expression "sin_embargo" may be either an actual multiword (one single
>>>>> token) or two separate words (hence two tokens "sin"+"embargo").
>>>>>
>>>>>  As far as we understand, SPPP (or YY) is not capable of representing
>>>>> this kind of ambiguity, while FSC is.
>>>>>
>>>>>  We do not use token mapping rules.  We tried, but realized they are
>>>>> not
>>>>> what we need.
>>>>>  We use chart mapping machinery only because it supports FSC input .The
>>>>> FreeLing interface (chartMap.cc) will take care of all token management
>>>>> and
>>>>> produce the final lattice that has to be loaded into the grammar, so no
>>>>> need
>>>>> for chart mapping rules.
>>>>>
>>>>>  For the integration of the morphological information coming from
>>>>> FreeLing, we wrote some lexical rules that do the same work than the
>>>>> ortographemic rules in SPPP, and load the morpohlogical information
>>>>> form the
>>>>> PoS tag into the FS.
>>>>>
>>>>>  So far, it seems to work with cheap.  Next step will be testing it
>>>>> with
>>>>> [incr tsdb()], which we assume should work also.
>>>>>
>>>>>  In summary, if LKB accepted FSC input format (even with no mapping
>>>>> rules), we could forget about SPPP.
>>>>>  Meanwhile, we will keep both interfaces.
>>>>>
>>>>>    Thank you
>>>>>
>>>>>           Montse&  Lluis
>>>>>
>>>>>
>>>>> On 15/07/11 12:02, Stephan Oepen wrote:
>>>>>
>>>>>> hi montse,
>>>>>>
>>>>>> i'm not sure i understand exactly what you're planning to do here, but
>>>>>> i see that dan sent you a link to the general chart mapping machinery,
>>>>>> and i noticed the new ChartMap.cc in FreeLing, which appears to
>>>>>> output (more or less, i would guess) the same information as
>>>>>> LKBAnalyzer.cc, but in the FSC format rather than SPPP.
>>>>>>
>>>>>> for FL integration with the LKB, SPPP currently is your only
>>>>>> (supported)
>>>>>> option, i.e. not requiring you to provide your own Lisp code to call
>>>>>> out
>>>>>> to
>>>>>> the tagger and interpret its results (this is what Jacy still does for
>>>>>> ChaSen,
>>>>>> but mostly because that interface was built prior to SPPP).
>>>>>>
>>>>>> so i am assuming you want to improve integration with PET here?  as it
>>>>>> is currently, you can only use PET in connection with [incr tsdb()],
>>>>>> which
>>>>>> will then invoke FL through the SPPP interface and reformat its result
>>>>>> in
>>>>>> a form suitable for input to PET.  there are currently two such
>>>>>> formats
>>>>>> (that are officially supported): YY and FSC.  YY is equivalent to SPPP
>>>>>> in what it can express, but using a more compact, non-XML syntax.  for
>>>>>> more details, please see:
>>>>>>
>>>>>>  http://wiki.delph-in.net/moin/**PetInput<http://wiki.delph-in.net/moin/PetInput>
>>>>>>
>>>>>> FSC is a more recent invention (by peter adolphs) that seeks to
>>>>>> further
>>>>>> generalize what can be provided as input to PET, going all the way to
>>>>>> a
>>>>>> lattice of (arbitrary) token feature structures.  however, for all i
>>>>>> recall, in
>>>>>> FSC input mode there is currently no support for 'annotating' tokens
>>>>>> with
>>>>>> information about mandatory orthographemic rules (i.e. setting what in
>>>>>> PET internally is known as the inflr_todo list on lexical items).  i
>>>>>> recall
>>>>>> peter and i discussed the necessity of this feature (which is
>>>>>> available
>>>>>> in
>>>>>> YY mode) several times and concluded it was maybe unneeded.  one
>>>>>> could 'mimic' the intended effect in the feature structures of the
>>>>>> rules,
>>>>>> i.e. have a list (+RULES or so) on each token feature structure, where
>>>>>> members in this list could be strings naming orthographemic rules.  to
>>>>>> enforce the application of a specific chain of orthographemic rules,
>>>>>> the
>>>>>> grammar would have to (a) percolate the +RULES value on all lexical
>>>>>> signs (lexical entries and lexical rules); (b) make each
>>>>>> orthorgraphemic
>>>>>> rule require its own name to be the 'next' rule to be called for (e.g.
>>>>>> the
>>>>>> value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
>>>>>> list upon application of an orthographemic rule, i.e. percolate up to
>>>>>> the
>>>>>> mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
>>>>>> value on all arguments to syntactic rules.
>>>>>>
>>>>>> i am not quite sure i would actually recommend the above approach to
>>>>>> anyone.  one issue i see with it just now reflects recent discussion i
>>>>>> had with dan and others about extending our notion of a derivation, to
>>>>>> actually record additional information about the string-level effects
>>>>>> upon
>>>>>> application of each orthographemic rule (so as to be able to recover
>>>>>> the
>>>>>> corresponding surface form at the rule daughter and mother, e.g. if
>>>>>> one
>>>>>> were to accomodate tokenization conventions that split off punctuation
>>>>>> marks).  in the approach sketched above, this information would not
>>>>>> be available---whereas it could be when PET (like the LKB) remains in
>>>>>> full control of the application of orthographemic rules.  however,
>>>>>> when
>>>>>> working with an external morphological analyzer (as is the case for
>>>>>> the SRG), one would still need more information than is currently
>>>>>> supported in YY.  for example, one could imagine extending the FSC
>>>>>> 'edge' element with an 'analysis' element quite similar to the one in
>>>>>> SPPP.  this is an area for revision that i would like to discuss with
>>>>>> peter aver the holidays.
>>>>>>
>>>>>> --- in summary, your (assumed) desire to improve integration of PET
>>>>>> and FL (without requiring the assistance of [incr tsdb()]) has
>>>>>> prompted
>>>>>> me to recall some remaining open questions in the token lattice input
>>>>>> design for PET, particularly in connection with an external
>>>>>> morphological
>>>>>> analyzer.  i plan on returning to these jointly with peter before too
>>>>>> long.
>>>>>>
>>>>>> in the meantime, i suspect you might actually be better served using
>>>>>> YY
>>>>>> input format for PET (which, after all, is what [incr tsdb()] converts
>>>>>> to from
>>>>>> SPPP inputs).  fortunately, the specific input format used (YY or FSC)
>>>>>> is
>>>>>> independent of the use of chart mapping.  that is, using either
>>>>>> format,
>>>>>> as
>>>>>> soon as the initial lattice of token feature structures is created in
>>>>>> PET,
>>>>>> everything else remains the same.  thus, if you were looking to
>>>>>> utilize
>>>>>> chart mapping to improve your treatment of numbers, dates, or other
>>>>>> named entities that can be recognized in terms of regular expressions,
>>>>>> you could do so by adding a set of token mapping rules to the SRG
>>>>>> (much like we have in the ERG).  that would remain valid no matter
>>>>>> what revisions to FSC (and possibly YY) might be down the road :-).
>>>>>>
>>>>>> i hope there may be some useful information in this partly
>>>>>> self-serving
>>>>>> message to you!
>>>>>>
>>>>>> best, oe
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
>>>>>> <montserrat.marimon at ub.edu>    wrote:
>>>>>>
>>>>>>> Hi everybody,
>>>>>>>
>>>>>>> Since the SRG is the only grammar which integrates a tagger using
>>>>>>> SPPP,
>>>>>>> we've decided to use chart parsing to integrate it.
>>>>>>> Is there any document we could read?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> --
>>>>>>> Montserrat Marimon
>>>>>>> Departament de Lingüística General
>>>>>>> Facultat de Filologia - Universitat de Barcelona
>>>>>>> Edifici Josep Carner, 5a planta
>>>>>>> Gran Via de les Corts Catalanes, 585
>>>>>>> 08007 BARCELONA
>>>>>>> tel.: + 34 93 4034695
>>>>>>>
>>>>>>> http://stel.ub.edu/**linguistica-ub/<http://stel.ub.edu/linguistica-ub/>
>>>>>>>
>>>>>>>
>>>>>>>
>>>
>

-- 
Montserrat Marimon
Departament de Lingüística General
Facultat de Filologia - Universitat de Barcelona
Edifici Josep Carner, 5a planta
Gran Via de les Corts Catalanes, 585
08007 BARCELONA
tel.: + 34 93 4034695

http://stel.ub.edu/linguistica-ub/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20110718/03fb09e6/attachment.html>