[developers] Malformed RMRS XML output from an ugly but valid PIC:

Andrew MacKinlay admackin at gmail.com
Tue Nov 17 17:50:14 CET 2009


On 17/11/2009, at 4:31 PM, Stephan Oepen wrote:

>
> it looks as if the input layer, grammar, and parser all succeed in  
> preserving the PRED (of the abstract form "‘foo_bar’").  so my guess  
> is that the MRS-to-RMRS conversion (arguably wrongly) interpretes  
> the underscore as the field separator for structured ‘real’  
> predicates , and then the XML output layer fails to escape the  
> remaining single quote.  both would most likely be easiest for ann  
> to fix.
>
> more high-level, this should in principle be recognized as a file  
> name NE (surrounded by quotes), and then i think both problems would  
> go away, as the underscore would be part of the CARG parameter of a  
> ‘named’ relation, and the quotes would not be part of the name proper.

Ah I didn't realise it was supposed to recognise file name NEs -  
that's certainly something that might be handy, so I'll definitely  
looking into taking more advantage of some of those features.


>
> i would recommend you convert from using PIC to FSC (or YY, if your  
> input annotations are limited to tags) and try again.  i'm unsure  
> about the interactions of PIC and token mapping (which would have to  
> do the NE matching, unless you do that in preprocessing), and i  
> believe most people (using the ERG) have now completed the  
> transition to YY or FSC, so it will be easier getting help.

Yes, I've had the suggestion to move to YY before, and I avoided it  
because PIC seemed to be going the job, and I was under time pressure  
etc. Additionally I think if I'm changing formats I'd like to move to  
the more general FSC format (although right now I'm only using POS  
tags), but I can't seem to locate any documentation on that. In any  
case, I guess the YY format is simple enough that it should be easy  
enough to convert to it, so I'll probably give that a go if FSC is  
still a little bleeding edge to have documenation online.


>
> —quite generally, i'm expecting to actively maintain the  
> (tokenization and) token mapping component in the ERG, hence i'm  
> interested in wxamples and challenges from additional data sets.
>
>

Great I'll keep that in mind and let you know what else I run up  
against.

Thanks for the tips.

cheers,
Andy


>
>
>
>
> On 17. nov. 2009, at 16.15, Andrew MacKinlay <admackin at gmail.com>  
> wrote:
>
>> Using the chart mapping Pet, I've got a particular PIC which has a  
>> whole lot of problems with it but is nonetheless valid input AFAICT.
>>
>> In the output RMRS, I get the following:
>>
>> <ep cfrom='83' cto='130'><realpred lemma='`/usr/portage/distfiles/ 
>> cdemu-0.6' pos='beta.tar.bz2'' sense='jj'/><label vid='7'/><anchor  
>> vid='10040'/><var sort='e' vid='89' sf='prop'/></ep>
>>
>> As you can see, the POS and sense are incorrect and the POS  
>> includes an unescaped quote that will break XML parsers.
>>
>> The corresponding 'w' element in the PIC is:
>>  <w cend="130" constant="no" cstart="83" id="W013">
>>      <surface>`/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2'</ 
>> surface>
>>      <pos prio="1.0" tag="JJ"/>
>>  </w>
>>
>> Any thoughts as to why this is occurring? Looks like some  
>> combination of the '_' and the quotes in the surface string to me.
>>
>> I've attached the full RMRS and PIC XML if anyone is interested.
>>
>> Thanks,
>> Andy.
>>
>> <broken-rmrs-pic.xml>
>> <broken-rmrs.xml>





More information about the developers mailing list