[developers] RESTful ERG parsing

Wed Apr 6 21:54:12 CEST 2016

On Wed, Apr 6, 2016 at 12:15 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> as an afterthought, one final candidate revision: given our reasoning
> about lower- vs. upper-case ‘namespaces’, one could apply the same
> condensing as i suggested for ‘properties’ at the EP level and drop
> the extra ‘arguments’ embedding.  that way, the EP structure would
> become an object that is a little more parallel again to the TFS-like
> rendering in the ‘simple’ serialization.  would you support making
> this change, before we finalize this part of the protocol for now?
>

I actually rather like the current state as it closely mirrors the data
structure I have in pyDelphin (which makes parsing easy), but it wouldn't
be hard to implement the more condensed form. And on a more subjective
note, the condensed form feels like "label" should be "LBL" again, since
it's structurally closer to the SimpleMRS format, even though it would then
be in the arguments' "namespace".

> i had been looking at this bug report:
>
>   https://bugs.python.org/issue1712522
>
> my understanding is that urllib.quote() in 2.7 does not support
> unicode strings, whereas the revised version in 3.x does.  i have not
> tried the work-around of encoding to an UTF-8 byte sequence first;
> strictly speaking, i would think percent escaping should happen at the
> string level (and arguably should support arbitrary unicode strings,
> effectively making urllib an irilib), and the conversion to a byte
> sequence for HTTP transport should be effected by urllib.urlopen().
>

Thanks for the pointer. I tinkered with the urllib/urllib2 modules and did
notice this problem. Encoding to UTF-8 does seem to solve the problem in
Python2 (since the quote() function expects a byte string, which would have
to be encoded for unicode strings). Python3 accepts either bytes or unicode
strings.

The ideal world you described is probably the 3rd-party Requests package:

>>> import requests
>>> resp = requests.get('http://erg.delph-in.net/rest/0.9/parse?input=あ is
a Japanese character.')
>>> resp.json()['input']
'あ is a Japanese character.'

(notice the あ is returned in the response; i.e. it was encoded in the
request AND decoded in the response; furthermore, this works unmodified for
both Python 2 and 3)

But I share your desire for a simple solution that has no dependencies
outside of the standard libraries, so I'll see if I can make it work.

—once you have had a chance to look at RESTful client implementation
> yourself, i will be curious to see which solution you adopt!
>

Python has a nice ImportError that you can catch, and since Python2 doesn't
have the urllib.request or urllib.parse sub-packages, I exploit this to
write custom pre-encoding code for Python2. It sounds a little hacky, but
it's a pretty common pattern for code meant to work with both versions. But
given that Python3's quote function takes either bytes or unicode strings,
I might not need to do this. More soon.

Btw, in the current version of the MRS-JSON format, I noticed that handles
had no "type", where I expected {"type": "h"}.

Thanks
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20160406/4d3edbb3/attachment.html>