[developers] More ERG/Redwoods issues

goodman.m.w at gmail.com goodman.m.w at gmail.com
Tue Jul 21 17:50:07 CEST 2020


Thanks, Stephan,

On Tue, Jul 21, 2020 at 9:48 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> [...]
> token mapping will allow the grammar to put virtually any character
> into its predicates, and by and large i would say rightly so (even if
> not all of the predicate and CARG examples in the above may ultimately
> be desirable :-).  thus, MRS serialization may need to be sensitive to
> different escaping conventions we have (or may yet have to establish),
> as i have tried to summarize in our related M$ GitHub issue:
>
> https://github.com/delph-in/pydelphin/issues/302
>
>
I'm merging some of the more general discussion from the linked GitHub
issue to this thread.

Regarding the PredicateRfc wiki, I don't think we should read it too
literally, as it was not written with the level of rigor as we put into,
e.g., the [TdlRfc](http://moin.delph-in.net/TdlRfc) page, and I'd call it
more descriptive than prescriptive. But we certainly could improve it to be
such a reference document.

Regarding the shape of predicates, we need to separate our design
considerations for the predicate symbols themselves from any constraints of
a particular serialization format, as they may be used, unquoted, in other
formats beyond SimpleMRS (e.g. EDS 'native' format, PENMAN, Indexed MRS,
etc.) which may have different sets of valid and invalid characters. In an
earlier thread we established that predicates of some different forms are
equivalent if they differ only along these dimensions:

* upper/lower case distinctions (_predicate_n_1 == _PREDICATE_n_1)
* surrounding quotes (_predicate_n_1 == "_predicate_n_1")
* presence of _rel suffix (_predicate_n_1 == _predicate_n_1_rel)

(Aside: I'm not fond of the last one because of the ambiguity with _rel as
a sense field (place_n == place_n_rel?); I'd argue for *requiring* that any
_rel suffix (that isn't a sense) be removed for grammar-external
("exported") MRSs)

I think we can go further and say that quoted predicates are not even part
of the spec for predicates; rather, they are an encoding scheme used by
several serialization formats for predicates that cannot legally be encoded
otherwise. At least, this could be true for exported MRSs. I recognize the
historical purpose of quoted predicates for those that don't have a type
defined in the grammar.

Other serialization formats may use other schemes. In JSON, for instance,
predicates are always quoted and they follow JSON escaping conventions. The
XML formats allow for "real predicates" that separate the lemma, pos, and
sense fields, but they are still bound by XML's encoding conventions.

>     _output_string(”hello/JJ_u_unknown  (ws202)
> >     _employee_name/NN_u_unknown  (ws203)
> >
> > There are _ characters inside the lemma portion of the predicates, which
> is not allowed. I don't recall if we came up with a scheme for encoding
> literal underscores in lemmas.
>
> yes, i agree token mapping should not construct these predicates!  the
> immediate solution that comes to my mind would be to backslash-escape
> underscores in the lemma (and sense) fields, which i believe would
> then bring along escaping of literal backslashes, i.e. in your first
> example: _output\_string(”hello/JJ_u_unknown.
>

I have a slight dispreference for backslash-escaping literal underscores,
because it complicates parsing. We could no longer simply split on _
characters to get the <lemma, pos, sense> components, and must parse the
predicates character-by-character to determine if the \ that precedes _ is
itself escaped, etc. TSDB's strategy might work, using \s or similar. We'd
still need to parse it to get the original form, but we can just split on _
to get the individual components.


> but before guarding against these invalid predicates in token mapping,
> it would be good to push a little further in terms of cross-platform
> agreement on these fine points of (simple) MRS serialization.
>
> best wishes, oe
>


-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200721/05cb719e/attachment.html>


More information about the developers mailing list