[developers] RESTful ERG parsing

Sun Apr 3 02:52:58 CEST 2016

On Sat, Apr 2, 2016 at 2:09 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> i also by and large followed your example for MRS serialization in
> JSON, with the complication that i want to allow variable properties
> on arbitrary argument positions, i.e. make these independent of a
> particular EP.  for now, i ended up with the following:
>
> MRS(30): (mrs-output-json (extract-mrs (first *parse-record*)) :stream
> t :columns 79)
> {"top": {"id": "h1", "type": "h"},
>  "index": {"id": "e3", "type": "e", "properties":
>  {"SF": "prop", "TENSE": "past", "MOOD": "indicative", "PROG": "-",
> "PERF": "-"}},
>  "rels":
>  [{"label": {"id": "h4", "type": "h"}, "predicate": "proper_q", "lnk":
> {"from": 0, "to": 3}, "roles":
>    {"ARG0": {"id": "x6", "type": "x", "properties": {"PERS": "3",
> "NUM": "sg", "IND": "+"}}, "RSTR": {"id": "h5", "type": "h"}, "BODY":
> {"id": "h7", "type": "h"}}},
>   {"label": {"id": "h8", "type": "h"}, "predicate": "named", "lnk":
> {"from": 0, "to": 3}, "roles": {"ARG0": {"id": "x6"}, "CARG": "Kim"}},
>   {"label": {"id": "h2", "type": "h"}, "predicate": "_arrive_v_1",
> "lnk": {"from": 4, "to": 12}, "roles": {"ARG0": {"id": "e3"}, "ARG1":
> {"id": "x6"}}}],
>  "hcons":
>  [{"relation": "qeq", "high": {"id": "h1"}, "low": {"id": "h2"}},
>   {"relation": "qeq", "high": {"id": "h5"}, "low": {"id": "h8"}}]}
>
> this end up less compact than the simple format, in part because all
> variables are objects.  however, the full object content is only
> printed once (upon the first variable occurrence).
>

(aside: I've found that JSON is often not compact; even less than XML at
times, and especially when printed with line-breaks and indentation; but if
HTTP responses are compressed there's hardly any difference to compressed
XML. But JSON is more readable and more convenient if you're consuming the
JSON with Javascript (or even other languages like Python))

Regarding properties not just on EPs: I agree; I was forgetting that MRS
can have variables for dropped arguments which aren't ARG0s of some EP but
may have properties (e.g. through agreement or something).

Regarding variables-as-objects; I recognize they have some internal
structure (variable-sort, variable-id, and assigned properties), but it's a
headache for serialization. When writing, do you always write the full
object, or use some reduced form (e.g., just the "id") for all but the
first occurrence? When reading, if two objects with the same ID both have
properties, do you merge them, or use the first/second/etc., or throw an
error? For these reasons, partly, I would prefer having simple strings for
variables and a separate hash of variable-to-properties. E.g.:

{ "top": "h0", "index": "e2", ..., "properties": { "e2": { "TENSE": "past",
...}}}

For the variable-sort, I'd use a simple regex-based function. E.g.
(assuming here, and below, that your client language is Javascript):
    var variable_re = /(.*?)(\d+)$/;
    function varsort(v) { return v.replace(variable_re, "$1"); }
    varsort("x12");  # returns "x"

Also, if we don't put the variable properties in every variable object,
then you'll have to do some post-processing (after deserializing JSON) in
order to resolve those objects (and see below about re-entrancy). E.g. you
might ask for:
    mrs.rels[2].roles.ARG0.properties
but that information is at:
    mrs.index.properties
Even if we did have the properties objects on every variable occurrence,
the two above expressions don't return the same object; just ones with the
same contents (hopefully).

With the separate properties list, you can do:
    mrs.properties[mrs.index]
or:
    mrs.properties[mrs.rels[2].roles.ARG0]
and get the same object back.

really, what one might want is an explicit re-entrancy, e.g. something like
>
>   { "index": #1={ "id": "x1", "type": "x", ...},
>     ... { "ARG0": #1# ... } ... }
>
> but for all i can tell there is no facility like that available in JSON,
> right?
>

It's not part of the JSON spec. But even XMLs ID and IDREF don't always
result in actual-re-entrancies. You'd need an XML reader that honors that
information, AND probably the DTD (or other schema) that says which
attributes are IDREFs. If you want this in JSON, you'd have to do it
yourself---i.e., write your own post-processing transforms after
deserializing the JSON into Javascript objects---or use a library that does
this for you.

Also see these:
* https://en.wikipedia.org/wiki/JSON#Object_references
* http://www.jspon.org/

> —do you have any suggestions for refining the above further?  i
> debated making the surface vs. abstract predicate distinction explicit
> in the JSON serialization, but i currently look at JSON as an
> alternative to the simple serialization, specifically for the RESTful
> interface, hence i ended up with the above.
>

I usually base my naming off the MRS DTD, e.g., "pred" instead of
"predicate", "hi" and "lo" in HCONS, etc., but I now see the fuller forms
exist in the lisp code, which you may be more used to seeing. As long as
it's documented, I don't think it matters either way.

Regarding surface vs abstract predicates: I don't have strong opinions
here. I think the convention (rule?) that surface predicates begin with an
underscore seems sufficient. For convenience, we could define a simple
function (like the varsort() one above) to return something based on it's
presence/absence (e.g. `predicateType("_arrive_v_1") == "surface"`). But
similar to my thoughts on variables, I think making the value of the
"predicate" key an object (e.g. `"predicate": {"value": "_arrive_v_1",
"type": "surface"}`) would cause more problems than it would help.

Oh and BTW, Emily said (offline) that I shouldn't offer to take technical
discussions offline. What I was withholding was that I tried requesting XML
and got JSON instead of an error 406:

$ curl -v -H "Accept: application/xml"
http://erg.delph-in.net/rest/0.9/parse?input=Abrams%20arrived
[...]
< HTTP/1.1 200  OK
[...]
{"input": "Abrams arrived",
[...]

There are differing opinions on how to treat bad requests (
http://archive.oreilly.com/pub/post/restful_error_handling.html), but I
think that returning descriptive status codes is a good way to help the
client know what to present the user.

cheers, oe
>
>
> On Tue, Mar 29, 2016 at 1:00 AM, Michael Wayne Goodman
> <goodmami at u.washington.edu> wrote:
> > Hi Stephan,
> >
> > On Mon, Mar 28, 2016 at 1:53 PM Stephan Oepen <oe at ifi.uio.no> wrote:
> >>
> >> dear colleagues,
> >>
> >> i used part of the easter break to teach myself about modern
> >> technologies and are currently in the process of providing a RESTful
> >> (programmatic) interface to the on-line ERG demonstrator.  i know of
> >> at least one colleague who has been waiting impatiently for this
> >> functionality :-).
> >>
> >> in a nutshell, client software can now obtain parses using the HTTP
> >> protocol and URIs providing the input string (and a handful of
> >> optional parameters).  for example:
> >>
> >>   http://erg.delph-in.net/rest/0.9/parse?input=Abrams%20arrived.
> >>
> >> parsing results will be returned in machine-readable format,
> >> serialized as a JSON document.  for a little more background on how to
> >> use this new service (including an example client in Python, believe
> >> it or not), please see:
> >>
> >>   http://moin.delph-in.net/ErgApi
> >
> >
> > What a beautiful bike shed :)
> >
> > BTW, Demophin has an undocumented HTTP API, but it's not RESTful:
> >
> > $ curl -F 'sentence=Abrams arrived.'
> > http://chimpanzee.ling.washington.edu/demophin/erg/parse
> >
> > I had hoped to change it to follow the REST principles more closely and
> > document the API, but I'm happy to know that you've already started that
> > effort.
> >
> >>
> >>
> >> there is some more work to be done on the interface (see the page
> >> above), but i would like to ask for help already at this point:
> >>
> >> (0) in case you notice anything surprising in the interactive ERG
> >> demonstrator, please do not hesitate to let me know!
> >
> >
> > If you're defining a new JSON schema for EDS, then maybe we can do
> something
> > more convenient for, e.g., lnk values. Currently the indices are encoded
> in
> > a string:
> >
> >     "lnk": "<0:6>"
> >
> > If we make it a JSON object, then users of the results wouldn't have to
> > parse the string later:
> >
> > "lnk": {"type": "charspan", "cfrom": 0, "cto": 6}
> >
> > (The "type" could be optional if we define "charspan" as the default, or
> if
> > we pretend that the other types don't exist)
> >
> >>
> >> (1) i still need to provide a serialization of MRSs in JSON; in case
> >> anyone has previously tackled this (design) problem, please do get in
> >> touch!
> >
> >
> > Not yet, sorry.  One thing that comes to mind is that JSON doesn't have
> an
> > unordered collection aside from objects (hashes), which require keys. So
> we
> > could treat the RELS, HCONS, and ICONS bags as arrays (lists) (but we
> often
> > do this anyway, so I think it's fine to use arrays). Here's a rather
> direct
> > conversion:
> >
> > { "top": "h0", "index": "e2", "rels": [ {"pred": "proper_q", "lbl": "h3",
> > "arg0": "x4"...}...]...}
> >
> > One thing that isn't obvious is variable properties. They could follow
> the
> > EDS example and put them in the EP object (and similarly be controlled by
> > the "properties" parameter in the URL):
> >
> > { ..., "rels": [ { "pred": "named", ..., "properties": { "PERS": "3"}},
> ...
> > ], ...}
> >
> >>
> >> (2) i think it might be nice to incorporate RESTful parsing as an
> >> option in pyDelphin; mike, could you be interested in collaborating on
> >> this?
> >
> >
> > Yes. Whatever API we settle on, I'd like to incorporate that into
> pyDelphin
> > and use it as the basis for Demophin as well.
> >
> >>
> >> finally, i would be curious to hear comments or suggestions for how to
> >> use and extend this service (though cannot promise i will have a lot
> >> of time to develop this further until another holiday break); please
> >> see towards the bottom of the above wiki page for some candidate
> >> directions.
> >
> >
> > I've done a couple of REST APIs so far, so I have some suggestions (some
> are
> > rather technical so I'm happy to save those for an off-list discussion).
> >
> > One thing that might be relevant to others is how we can request other
> > formats. I see you have parameters "eds=...", "derivation=...",
> "mrs=...",
> > so presumably we could expand it with others ("rmrs=...", "dmrs=...",
> etc.)?
> >
> > Other ideas:
> > * what about generation?
> > * can set a request header for preprocessing? (e.g. morphological
> > segmentation for Jacy or Zhong)
> > * If we have already preprocessed, can we specify the Content-Type (e.g.
> > Content-Type: application/yy)
> >
> >> best wishes; god påske!  oe
> >>
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20160403/b585f20f/attachment-0001.html>