[developers] Using Chart Parsing to integrate FreeLing

Fri Jul 15 21:02:38 CEST 2011

hi montse,

i'm not sure i understand exactly what you're planning to do here, but
i see that dan sent you a link to the general chart mapping machinery,
and i noticed the new ChartMap.cc in FreeLing, which appears to
output (more or less, i would guess) the same information as
LKBAnalyzer.cc, but in the FSC format rather than SPPP.

for FL integration with the LKB, SPPP currently is your only (supported)
option, i.e. not requiring you to provide your own Lisp code to call out to
the tagger and interpret its results (this is what Jacy still does for ChaSen,
but mostly because that interface was built prior to SPPP).

so i am assuming you want to improve integration with PET here?  as it
is currently, you can only use PET in connection with [incr tsdb()], which
will then invoke FL through the SPPP interface and reformat its result in
a form suitable for input to PET.  there are currently two such formats
(that are officially supported): YY and FSC.  YY is equivalent to SPPP
in what it can express, but using a more compact, non-XML syntax.  for
more details, please see:

  http://wiki.delph-in.net/moin/PetInput

FSC is a more recent invention (by peter adolphs) that seeks to further
generalize what can be provided as input to PET, going all the way to a
lattice of (arbitrary) token feature structures.  however, for all i recall, in
FSC input mode there is currently no support for 'annotating' tokens with
information about mandatory orthographemic rules (i.e. setting what in
PET internally is known as the inflr_todo list on lexical items).  i recall
peter and i discussed the necessity of this feature (which is available in
YY mode) several times and concluded it was maybe unneeded.  one
could 'mimic' the intended effect in the feature structures of the rules,
i.e. have a list (+RULES or so) on each token feature structure, where
members in this list could be strings naming orthographemic rules.  to
enforce the application of a specific chain of orthographemic rules, the
grammar would have to (a) percolate the +RULES value on all lexical
signs (lexical entries and lexical rules); (b) make each orthorgraphemic
rule require its own name to be the 'next' rule to be called for (e.g. the
value of a path like ARGS.FIRST.+RULES.FIRST); (c) 'pop' the +RULES
list upon application of an orthographemic rule, i.e. percolate up to the
mother ARGS.FIRST.+RULES.REST; and (d) require an empty +RULES
value on all arguments to syntactic rules.

i am not quite sure i would actually recommend the above approach to
anyone.  one issue i see with it just now reflects recent discussion i
had with dan and others about extending our notion of a derivation, to
actually record additional information about the string-level effects upon
application of each orthographemic rule (so as to be able to recover the
corresponding surface form at the rule daughter and mother, e.g. if one
were to accomodate tokenization conventions that split off punctuation
marks).  in the approach sketched above, this information would not
be available---whereas it could be when PET (like the LKB) remains in
full control of the application of orthographemic rules.  however, when
working with an external morphological analyzer (as is the case for
the SRG), one would still need more information than is currently
supported in YY.  for example, one could imagine extending the FSC
'edge' element with an 'analysis' element quite similar to the one in
SPPP.  this is an area for revision that i would like to discuss with
peter aver the holidays.

--- in summary, your (assumed) desire to improve integration of PET
and FL (without requiring the assistance of [incr tsdb()]) has prompted
me to recall some remaining open questions in the token lattice input
design for PET, particularly in connection with an external morphological
analyzer.  i plan on returning to these jointly with peter before too long.

in the meantime, i suspect you might actually be better served using YY
input format for PET (which, after all, is what [incr tsdb()] converts to from
SPPP inputs).  fortunately, the specific input format used (YY or FSC) is
independent of the use of chart mapping.  that is, using either format, as
soon as the initial lattice of token feature structures is created in PET,
everything else remains the same.  thus, if you were looking to utilize
chart mapping to improve your treatment of numbers, dates, or other
named entities that can be recognized in terms of regular expressions,
you could do so by adding a set of token mapping rules to the SRG
(much like we have in the ERG).  that would remain valid no matter
what revisions to FSC (and possibly YY) might be down the road :-).

i hope there may be some useful information in this partly self-serving
message to you!

best, oe

On Thu, Jul 7, 2011 at 18:46, Montserrat Marimon
<montserrat.marimon at ub.edu> wrote:
> Hi everybody,
>
> Since the SRG is the only grammar which integrates a tagger using SPPP,
> we've decided to use chart parsing to integrate it.
> Is there any document we could read?
>
> Thanks,
>
> --
> Montserrat Marimon
> Departament de Lingüística General
> Facultat de Filologia - Universitat de Barcelona
> Edifici Josep Carner, 5a planta
> Gran Via de les Corts Catalanes, 585
> 08007 BARCELONA
> tel.: + 34 93 4034695
>
> http://stel.ub.edu/linguistica-ub/
>
>