[developers] Malformed RMRS XML output from an ugly but valid PIC:
Ann Copestake
Ann.Copestake at cl.cam.ac.uk
Fri Nov 20 17:11:52 CET 2009
I have just checked in code which should make the pred conversion a bit more
robust to Andy's bug, since I now call the code to escape ' etc on the sense
and pos fields when outputting XML. I won't make the change to check the pos
field for conformance to the .dtd right now, because it'll require more work.
It'd be helpful if people could check this works.
I believe that we should reserve underscores as being pred splitting. That
is, no hand-written predicate symbol should contain underscores except to
delimit the lemma, pos and sense fields. An automatically constructed
predicate symbol should escape all underscores with a \ - i.e., oe's
_foo\_bar_v_rel proposal. (After thinking about this, I think this is safer
than trying to count underscores.)
I have changed the pred splitting code to take account of the \ character
oe at ifi.uio.no said:
> speaking of underscores in unknown words, the current ERG would output the
> following predicate
>
> _/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2/jj_u_unknown_rel
This isn't split (the underscores are treated as splitting but the code checks
that the remainder is _rel and doesn't split if it isn't). If the predicate
were:
_/usr/portage/distfiles/cdemu-0.6_beta.tar.bz2/jj_u_rel
it would be incorrectly split.
Ideally the predicate above would be
_/usr/portage/distfiles/cdemu-0.6\_beta.tar.bz2/jj_u_unknown_rel
which will give a split predicates with POS=u sense=unknown, as desired.
Escaping underscores is not completely watertight, since the code will now
break if there's a string which ends with a \ and then passed to unknown word
machinery which then adds an underscore. e.g.,, the unknown word is "foo\",
the machinery converts this to "_foo\_u_rel" and then the second underscore is
interpreted as an escape. With the current version, this would presumably be
converted to "_foo\/nn1_u_rel" (if the tag is nn1) so we'll be alright unless
someone is using a tagset with underscores at the end of tags.
I hope this is clear(!) All this just illustrates the folly of trying to use
strings to package structured data ...
Ann
More information about the developers
mailing list