[developers] Malformed RMRS XML output from an ugly but valid PIC:
oe at ifi.uio.no
Wed Nov 18 16:37:56 CET 2009
> the unknown word code is adding the leading _ and the final _rel
> presumably? Really it should be stripping all underscores in the
> string it's turning into a pred. The code is based on the assumption
> we have _lemma_postag[_sense]_rel where lemma, postag and sense
> contain no underscores.
right, so input processing actually is the problem in the example that
andy sent. or maybe the problem rather is using an outdated version of
the ERG, which is why i did not recognize the predicate associated with
the unknown word.
speaking of underscores in unknown words, the current ERG would output
the following predicate
i.e. nowadays the `lemma' houses the actual surface form, paired with
the PoS tag (for reasons discussed in earlier email and in barcelona);
to be lemmatized in post-processing.
i would like it to be possible for arbitrary characters to be present
in the lemma field, rather than actually stripping characters. i can
see either of the following PRED values being decomposed unambiguously
for the first case, only the /last/ three underscores are interpreted
(hence, underscores in the lemma are only possible where both the PoS
and sense fields are present). alternatively, the latter case puts a
standard escape convention for the lemma field in place, parallel to
double quotes in Lisp strings, for example.
would you have a preference for either of the two. i briefly glanced
at rmrs-convert-pred() and it seems either of the two (or both, even)
would be easy to support.
cheers - oe
+++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125
+++ --- oe at ifi.uio.no; stephan at oepen.net; http://www.emmtee.net/oe/ ---
More information about the developers