[developers] Malformed RMRS XML output from an ugly but valid PIC:

Ann Copestake Ann.Copestake at cl.cam.ac.uk
Fri Nov 20 17:11:52 CET 2009


I have just checked in code which should make the pred conversion a bit more 
robust to Andy's bug, since I now call the code to escape ' etc on the sense 
and pos fields when outputting XML.  I won't make the change to check the pos 
field for conformance to the .dtd right now, because it'll require more work.  
It'd be helpful if people could check this works.

I believe that we should reserve underscores as being  pred splitting.  That 
is, no hand-written predicate symbol should contain underscores except to 
delimit the lemma, pos and sense fields.  An automatically constructed 
predicate symbol should escape all underscores with a \ - i.e., oe's 
_foo\_bar_v_rel proposal. (After thinking about this, I think this is safer 
than trying to count underscores.)

I have changed the pred splitting code to take account of the \ character

oe at ifi.uio.no said:
> speaking of underscores in unknown words, the current ERG would output the
> following predicate
>
>   _/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2/jj_u_unknown_rel

This isn't split (the underscores are treated as splitting but the code checks 
that the remainder is _rel and doesn't split if it isn't).  If the predicate 
were:

_/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2/jj_u_rel

it would be incorrectly split.

Ideally the predicate above would be

 _/usr/portage/distfiles/cdemu-0.6\_beta.tar.bz2/jj_u_unknown_rel

which will give a split predicates with POS=u sense=unknown, as desired.

Escaping underscores is not completely watertight, since the code will now 
break if there's a string which ends with a \ and then passed to unknown word 
machinery which then adds an underscore. e.g.,, the unknown word is "foo\", 
the machinery converts this to "_foo\_u_rel" and then the second underscore is 
interpreted as an escape.  With the current version, this would presumably be 
converted to "_foo\/nn1_u_rel" (if the tag is nn1) so we'll be alright unless 
someone is using a tagset with underscores at the end of tags.

I hope this is clear(!) All this just illustrates the folly of trying to use 
strings to package structured data ...

Ann





More information about the developers mailing list