[developers] Malformed RMRS XML output from an ugly but valid PIC:

Stephan Oepen oe at ifi.uio.no
Tue Nov 17 17:31:19 CET 2009


hi andrew,

it looks as if the input layer, grammar, and parser all succeed in  
preserving the PRED (of the abstract form "‘foo_bar’").  so my  
guess is that the MRS-to-RMRS conversion (arguably wrongly)  
interpretes the underscore as the field separator for structured  
‘real’ predicates , and then the XML output layer fails to escape  
the remaining single quote.  both would most likely be easiest for ann  
to fix.

more high-level, this should in principle be recognized as a file name  
NE (surrounded by quotes), and then i think both problems would go  
away, as the underscore would be part of the CARG parameter of a  
‘named’ relation, and the quotes would not be part of the name  
proper.

i would recommend you convert from using PIC to FSC (or YY, if your  
input annotations are limited to tags) and try again.  i'm unsure  
about the interactions of PIC and token mapping (which would have to  
do the NE matching, unless you do that in preprocessing), and i  
believe most people (using the ERG) have now completed the transition  
to YY or FSC, so it will be easier getting help.

—quite generally, i'm expecting to actively maintain the  
(tokenization and) token mapping component in the ERG, hence i'm  
interested in wxamples and challenges from additional data sets.

best, oe




On 17. nov. 2009, at 16.15, Andrew MacKinlay <admackin at gmail.com> wrote:

> Using the chart mapping Pet, I've got a particular PIC which has a  
> whole lot of problems with it but is nonetheless valid input AFAICT.
>
> In the output RMRS, I get the following:
>
> <ep cfrom='83' cto='130'><realpred lemma='`/usr/portage/distfiles/ 
> cdemu-0.6' pos='beta.tar.bz2'' sense='jj'/><label vid='7'/><anchor  
> vid='10040'/><var sort='e' vid='89' sf='prop'/></ep>
>
> As you can see, the POS and sense are incorrect and the POS includes  
> an unescaped quote that will break XML parsers.
>
> The corresponding 'w' element in the PIC is:
>    <w cend="130" constant="no" cstart="83" id="W013">
>        <surface>`/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2'</ 
> surface>
>        <pos prio="1.0" tag="JJ"/>
>    </w>
>
> Any thoughts as to why this is occurring? Looks like some  
> combination of the '_' and the quotes in the surface string to me.
>
> I've attached the full RMRS and PIC XML if anyone is interested.
>
> Thanks,
> Andy.
>
> <broken-rmrs-pic.xml>
> <broken-rmrs.xml>




More information about the developers mailing list