[developers] Serializing EDS without a top

goodman.m.w at gmail.com goodman.m.w at gmail.com
Fri Dec 18 08:10:56 UTC 2020


Thanks for the response, Stephan,

On Thu, Dec 17, 2020 at 6:39 PM Stephan Oepen <oe at ifi.uio.no> wrote:

> [...]
> in a nutshell, EDS native serialization is indeed line-oriented, and i
> am inclined to hold fast on the one-node-per-line convention.  i would
> not want to muddy these waters, since the format has been around since
> 2002, and there has been some EDS activity beyond DELPH-IN.  i know of
> at least two EDS readers that rely on the presence of line breaks.
>

Ok, sounds good. Then perhaps my previous message may be informative if the
maintainer(s) of those two readers ever decide to embrace the convenience
of single-line EDS. Other than determining the top of the graph, adapting
the readers should be trivial: just treat \n as any other whitespace.

i do see the benefits of a more compact serialization, however, but
> would recommend you call that something else (say EDSLines), if you
> decide to implement it in pyDelphin.


It's been implemented for some time now. In fact all codecs have a -lines
variant (simplemrs -> simplemrs-lines, dmrx -> dmrx-lines, etc.). E.g., in
the case of XML formats, it outputs each item (<mrs> or <dmrs>) on a line
and suppresses the root nodes (<mrs-list>, <dmrs-list>).


> [...]
> {_: e2:_rain_v_1<3:9>[] e3:_heavy_a_1<10:42>[ARG1 e2] }
> {\n e2:_rain_v_1<3:9>[]\n e3:_heavy_a_1<10:42>[ARG1 e2]\n }
> {: e2:_rain_v_1<3:9>[] e3:_heavy_a_1<10:42>[ARG1 e2] }
>
> the above order reflects what i believe would be my personal ranking
> just now :-).  i frequently use underscores for ‘anonymous’ MRS
> variables, and the first variant feels maybe most natural: there
> should be a top identifier, but in this case it is missing.


The 'anonymous' node identifier for a fake top is fine and, conveniently,
PyDelphin can already read in this variant. The difference is that '_' is a
valid identifier in EDS, so it's not actually missing, just unlinked. I
think logically an unlinked top is the same as a null top, but this means
that PyDelphin may write an EDS that is different (in terms of Python data
structures, viz., upon re-reading the serialization) as the source EDS.



>  The
> second variant also would seem to maintain compatibility with the
> native EDS serialization, only introducing an inline encoding of line
> breaks.


Inserting a literal '\' and 'n' is awkward and changes the format, and I
don't see how it's compatible at all besides having '\' and 'n' in the same
location as your preferred newline characters.


> variant #3, on the other hand, i believe would depart from
> how native serialization deals with missing tops; thus, if you were to
> opt for this format, it would be even more important to maintain a
> clear distinction between EDS native serialization and the pyDelphin
> EDSLines format.
>

If the thing between the first '{' and the first ':' is the top identifier,
then if nothing is there the top is null. This is easy to parse and (I
thought) easy to understand. As EDS native serialization from PyDelphin has
done this for some time, I will continue to read it in, but going forward I
will not write it out. As of the latest commit, I just omit the top
entirely, which is what your newline-ful variant would do if it were simply
newline-less (see the last EDS of my first message). I have written, but
have not yet pushed to GitHub, a change that inserts an anonymous '_' top
if the top is null (if '_' is already used by some node, I try '_0', then
'_1', etc. until I get an unused one).

I have also made the following changes (which I think you'll be happy with):
- The default serialization is now indented with newlines (and this is true
of all codecs); use eds-lines to get the single-line variant
- Conversion from MRS now uses predicate modification by default
- Blank lines are inserted between indented EDSs (not sure if your readers
actually require this)



>
> i hope the above makes sense to you?  oe
>
>
> On Wed, Dec 16, 2020 at 10:41 AM goodman.m.w at gmail.com
> <goodman.m.w at gmail.com> wrote:
> >
> > Hello developers,
> >
> > It's been a while but I'm returning to a discussion we were having about
> serializing EDS in the native format when there is no TOP and when there's
> no INDEX to backoff to. Stephan suggested that EDS is a line-based format
> (i.e., line breaks are required), while I would like to continue to support
> single-line EDS in PyDelphin. I think the last word on the subject from
> Stephan, at least on this list, was mid-September (
> http://lists.delph-in.net/archives/developers/2020/003140.html), where he
> said he'd continue discussion on another thread, which presumably meant the
> thread from late August (
> http://lists.delph-in.net/archives/developers/2020/003127.html). I don't
> think the discussion did continue, so I'm starting this thread in case
> anyone is interested.
> >
> > As an example, here's an EDS (without properties) for "It rained."
> >
> >     {e2:
> >      e2:_rain_v_1<3:9>[]
> >     }
> >
> > In PyDelphin, when an EDS has no TOP, I was outputting the first colon
> anyway, intentionally:
> >
> >     {:
> >      e2:_rain_v_1<3:9>[]
> >     }
> >
> > It's a bit ugly, but it allows me to detect, with 1 token of lookahead,
> if there's a top or not. If the colon is omitted then it's not clear if
> "e2:" is the top or the start of the first node. If line breaks are
> required, we just assume the first line is for the top, whether or not it's
> there. But for single-line EDS, we need 4 tokens of lookahead to determine
> if there's a top (assuming the parser treats variables and predicates as
> the same kinds of tokens):
> >
> >     {e2: e2:_rain_v_1<3:9>[]}
> >     {e2:_rain_v_1<3:9>[]}
> >
> > Here is the parsing algorithm, once we've consumed the first '{':
> >
> > 1. If the 1st lookahead token is ':', '(fragmented)' (or another graph
> status), '}', or '|' (node status), then we know that TOP is missing (the
> ':' is for PyDelphin's current output)
> > 2. Otherwise the 1st and 2nd tokens must be a symbol and a colon, and if
> the 3rd token is a graph or node status, OR if the 4th token is ':', then
> the 1st token is the TOP
> > 3. Otherwise TOP must be missing
> >
> > I think this covers all the cases but let me know if I've missed
> anything.
> >
> > --
> > -Michael Wayne Goodman
>


-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20201218/3200b557/attachment.html>


More information about the developers mailing list