[developers] More ERG/Redwoods issues

Stephan Oepen oe at ifi.uio.no
Tue Jul 21 15:48:13 CEST 2020


hi again, mike,

> Regarding my second point about unexpected characters in SimpleMRS strings, I tried making PyDelphin more robust to these situations even though I think they should be deemed invalid, but there are some that are simply irredeemable:
>
>     _+-]\?[/NN_u_unknown_rel"<12:18>  (wlb03)
>
> The ] initially threw me off, but even worse is the " after _rel (I included the <12:18> here just for context; note that there is no " at the start of this predicate so this is not a string predicate). I'm not sure how it got there. Maybe an ACE/LKB serialization error?

> In addition, I found a problem with a CARG in ws213:
>
>     [ named<37:41> LBL: h16 CARG: "NP\S"" ARG0: x12 ]
>
> Note that there are two quotation marks at the end of the CARG value. The item it comes from is 1000008400480, which does not have " following NP\S. (The i-input is: This complex category is notated as (NP\\S) instead of V.)

i am copying woodley, because the MRSs you are reading most likely
come from FFTB (i am also adding the 'developers' list, as surely most
folks care about these corner cases).

token mapping will allow the grammar to put virtually any character
into its predicates, and by and large i would say rightly so (even if
not all of the predicate and CARG examples in the above may ultimately
be desirable :-).  thus, MRS serialization may need to be sensitive to
different escaping conventions we have (or may yet have to establish),
as i have tried to summarize in our related M$ GitHub issue:

https://github.com/delph-in/pydelphin/issues/302

>     _output_string(”hello/JJ_u_unknown  (ws202)
>     _employee_name/NN_u_unknown  (ws203)
>
> There are _ characters inside the lemma portion of the predicates, which is not allowed. I don't recall if we came up with a scheme for encoding literal underscores in lemmas.

yes, i agree token mapping should not construct these predicates!  the
immediate solution that comes to my mind would be to backslash-escape
underscores in the lemma (and sense) fields, which i believe would
then bring along escaping of literal backslashes, i.e. in your first
example: _output\_string(”hello/JJ_u_unknown.

but before guarding against these invalid predicates in token mapping,
it would be good to push a little further in terms of cross-platform
agreement on these fine points of (simple) MRS serialization.

best wishes, oe



More information about the developers mailing list