[developers] Malformed RMRS XML output from an ugly but valid PIC:
Andrew MacKinlay
admackin at gmail.com
Tue Nov 17 17:50:14 CET 2009
On 17/11/2009, at 4:31 PM, Stephan Oepen wrote:
>
> it looks as if the input layer, grammar, and parser all succeed in
> preserving the PRED (of the abstract form "‘foo_bar’"). so my guess
> is that the MRS-to-RMRS conversion (arguably wrongly) interpretes
> the underscore as the field separator for structured ‘real’
> predicates , and then the XML output layer fails to escape the
> remaining single quote. both would most likely be easiest for ann
> to fix.
>
> more high-level, this should in principle be recognized as a file
> name NE (surrounded by quotes), and then i think both problems would
> go away, as the underscore would be part of the CARG parameter of a
> ‘named’ relation, and the quotes would not be part of the name proper.
Ah I didn't realise it was supposed to recognise file name NEs -
that's certainly something that might be handy, so I'll definitely
looking into taking more advantage of some of those features.
>
> i would recommend you convert from using PIC to FSC (or YY, if your
> input annotations are limited to tags) and try again. i'm unsure
> about the interactions of PIC and token mapping (which would have to
> do the NE matching, unless you do that in preprocessing), and i
> believe most people (using the ERG) have now completed the
> transition to YY or FSC, so it will be easier getting help.
Yes, I've had the suggestion to move to YY before, and I avoided it
because PIC seemed to be going the job, and I was under time pressure
etc. Additionally I think if I'm changing formats I'd like to move to
the more general FSC format (although right now I'm only using POS
tags), but I can't seem to locate any documentation on that. In any
case, I guess the YY format is simple enough that it should be easy
enough to convert to it, so I'll probably give that a go if FSC is
still a little bleeding edge to have documenation online.
>
> —quite generally, i'm expecting to actively maintain the
> (tokenization and) token mapping component in the ERG, hence i'm
> interested in wxamples and challenges from additional data sets.
>
>
Great I'll keep that in mind and let you know what else I run up
against.
Thanks for the tips.
cheers,
Andy
>
>
>
>
> On 17. nov. 2009, at 16.15, Andrew MacKinlay <admackin at gmail.com>
> wrote:
>
>> Using the chart mapping Pet, I've got a particular PIC which has a
>> whole lot of problems with it but is nonetheless valid input AFAICT.
>>
>> In the output RMRS, I get the following:
>>
>> <ep cfrom='83' cto='130'><realpred lemma='`/usr/portage/distfiles/
>> cdemu-0.6' pos='beta.tar.bz2'' sense='jj'/><label vid='7'/><anchor
>> vid='10040'/><var sort='e' vid='89' sf='prop'/></ep>
>>
>> As you can see, the POS and sense are incorrect and the POS
>> includes an unescaped quote that will break XML parsers.
>>
>> The corresponding 'w' element in the PIC is:
>> <w cend="130" constant="no" cstart="83" id="W013">
>> <surface>`/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2'</
>> surface>
>> <pos prio="1.0" tag="JJ"/>
>> </w>
>>
>> Any thoughts as to why this is occurring? Looks like some
>> combination of the '_' and the quotes in the surface string to me.
>>
>> I've attached the full RMRS and PIC XML if anyone is interested.
>>
>> Thanks,
>> Andy.
>>
>> <broken-rmrs-pic.xml>
>> <broken-rmrs.xml>
More information about the developers
mailing list