[developers] Malformed RMRS XML output from an ugly but valid PIC:
oe at ifi.uio.no
Tue Nov 17 17:31:19 CET 2009
it looks as if the input layer, grammar, and parser all succeed in
preserving the PRED (of the abstract form "‘foo_bar’"). so my
guess is that the MRS-to-RMRS conversion (arguably wrongly)
interpretes the underscore as the field separator for structured
‘real’ predicates , and then the XML output layer fails to escape
the remaining single quote. both would most likely be easiest for ann
more high-level, this should in principle be recognized as a file name
NE (surrounded by quotes), and then i think both problems would go
away, as the underscore would be part of the CARG parameter of a
‘named’ relation, and the quotes would not be part of the name
i would recommend you convert from using PIC to FSC (or YY, if your
input annotations are limited to tags) and try again. i'm unsure
about the interactions of PIC and token mapping (which would have to
do the NE matching, unless you do that in preprocessing), and i
believe most people (using the ERG) have now completed the transition
to YY or FSC, so it will be easier getting help.
—quite generally, i'm expecting to actively maintain the
(tokenization and) token mapping component in the ERG, hence i'm
interested in wxamples and challenges from additional data sets.
On 17. nov. 2009, at 16.15, Andrew MacKinlay <admackin at gmail.com> wrote:
> Using the chart mapping Pet, I've got a particular PIC which has a
> whole lot of problems with it but is nonetheless valid input AFAICT.
> In the output RMRS, I get the following:
> <ep cfrom='83' cto='130'><realpred lemma='`/usr/portage/distfiles/
> cdemu-0.6' pos='beta.tar.bz2'' sense='jj'/><label vid='7'/><anchor
> vid='10040'/><var sort='e' vid='89' sf='prop'/></ep>
> As you can see, the POS and sense are incorrect and the POS includes
> an unescaped quote that will break XML parsers.
> The corresponding 'w' element in the PIC is:
> <w cend="130" constant="no" cstart="83" id="W013">
> <pos prio="1.0" tag="JJ"/>
> Any thoughts as to why this is occurring? Looks like some
> combination of the '_' and the quotes in the surface string to me.
> I've attached the full RMRS and PIC XML if anyone is interested.
More information about the developers