[developers] Serializing EDS without a top

goodman.m.w at gmail.com goodman.m.w at gmail.com
Wed Dec 30 03:19:10 UTC 2020


Hello, just a brief update about EDS serialization with PyDelphin.

I took out, for now, the code that creates an unlinked TOP as described in
the previous message. The other changes (indentation, predicate
modification, blank lines) remain. Since PyDelphin now indents EDS with
newlines by default, those without any TOP or INDEX will only have '{' on
the first line, which I think is how the LKB behaves, based on our
discussions. Inserting an unlinked top seems like it might be useful as a
more general (not just EDS) future extension, if there's a need.

I've just released PyDelphin 1.5.0 with these changes.

On Fri, Dec 18, 2020 at 4:10 PM goodman.m.w at gmail.com <goodman.m.w at gmail.com>
wrote:

> Thanks for the response, Stephan,
>
> On Thu, Dec 17, 2020 at 6:39 PM Stephan Oepen <oe at ifi.uio.no> wrote:
>
>> [...]
>> in a nutshell, EDS native serialization is indeed line-oriented, and i
>> am inclined to hold fast on the one-node-per-line convention.  i would
>> not want to muddy these waters, since the format has been around since
>> 2002, and there has been some EDS activity beyond DELPH-IN.  i know of
>> at least two EDS readers that rely on the presence of line breaks.
>>
>
> Ok, sounds good. Then perhaps my previous message may be informative if
> the maintainer(s) of those two readers ever decide to embrace the
> convenience of single-line EDS. Other than determining the top of the
> graph, adapting the readers should be trivial: just treat \n as any other
> whitespace.
>
> i do see the benefits of a more compact serialization, however, but
>> would recommend you call that something else (say EDSLines), if you
>> decide to implement it in pyDelphin.
>
>
> It's been implemented for some time now. In fact all codecs have a -lines
> variant (simplemrs -> simplemrs-lines, dmrx -> dmrx-lines, etc.). E.g., in
> the case of XML formats, it outputs each item (<mrs> or <dmrs>) on a line
> and suppresses the root nodes (<mrs-list>, <dmrs-list>).
>
>
>> [...]
>> {_: e2:_rain_v_1<3:9>[] e3:_heavy_a_1<10:42>[ARG1 e2] }
>> {\n e2:_rain_v_1<3:9>[]\n e3:_heavy_a_1<10:42>[ARG1 e2]\n }
>> {: e2:_rain_v_1<3:9>[] e3:_heavy_a_1<10:42>[ARG1 e2] }
>>
>> the above order reflects what i believe would be my personal ranking
>> just now :-).  i frequently use underscores for ‘anonymous’ MRS
>> variables, and the first variant feels maybe most natural: there
>> should be a top identifier, but in this case it is missing.
>
>
> The 'anonymous' node identifier for a fake top is fine and, conveniently,
> PyDelphin can already read in this variant. The difference is that '_' is a
> valid identifier in EDS, so it's not actually missing, just unlinked. I
> think logically an unlinked top is the same as a null top, but this means
> that PyDelphin may write an EDS that is different (in terms of Python data
> structures, viz., upon re-reading the serialization) as the source EDS.
>
>
>
>>  The
>> second variant also would seem to maintain compatibility with the
>> native EDS serialization, only introducing an inline encoding of line
>> breaks.
>
>
> Inserting a literal '\' and 'n' is awkward and changes the format, and I
> don't see how it's compatible at all besides having '\' and 'n' in the same
> location as your preferred newline characters.
>
>
>> variant #3, on the other hand, i believe would depart from
>> how native serialization deals with missing tops; thus, if you were to
>> opt for this format, it would be even more important to maintain a
>> clear distinction between EDS native serialization and the pyDelphin
>> EDSLines format.
>>
>
> If the thing between the first '{' and the first ':' is the top
> identifier, then if nothing is there the top is null. This is easy to parse
> and (I thought) easy to understand. As EDS native serialization from
> PyDelphin has done this for some time, I will continue to read it in, but
> going forward I will not write it out. As of the latest commit, I just omit
> the top entirely, which is what your newline-ful variant would do if it
> were simply newline-less (see the last EDS of my first message). I have
> written, but have not yet pushed to GitHub, a change that inserts an
> anonymous '_' top if the top is null (if '_' is already used by some node,
> I try '_0', then '_1', etc. until I get an unused one).
>
> I have also made the following changes (which I think you'll be happy
> with):
> - The default serialization is now indented with newlines (and this is
> true of all codecs); use eds-lines to get the single-line variant
> - Conversion from MRS now uses predicate modification by default
> - Blank lines are inserted between indented EDSs (not sure if your readers
> actually require this)
>
>
>
>>
>> i hope the above makes sense to you?  oe
>>
>>
>> On Wed, Dec 16, 2020 at 10:41 AM goodman.m.w at gmail.com
>> <goodman.m.w at gmail.com> wrote:
>> >
>> > Hello developers,
>> >
>> > It's been a while but I'm returning to a discussion we were having
>> about serializing EDS in the native format when there is no TOP and when
>> there's no INDEX to backoff to. Stephan suggested that EDS is a line-based
>> format (i.e., line breaks are required), while I would like to continue to
>> support single-line EDS in PyDelphin. I think the last word on the subject
>> from Stephan, at least on this list, was mid-September (
>> http://lists.delph-in.net/archives/developers/2020/003140.html), where
>> he said he'd continue discussion on another thread, which presumably meant
>> the thread from late August (
>> http://lists.delph-in.net/archives/developers/2020/003127.html). I don't
>> think the discussion did continue, so I'm starting this thread in case
>> anyone is interested.
>> >
>> > As an example, here's an EDS (without properties) for "It rained."
>> >
>> >     {e2:
>> >      e2:_rain_v_1<3:9>[]
>> >     }
>> >
>> > In PyDelphin, when an EDS has no TOP, I was outputting the first colon
>> anyway, intentionally:
>> >
>> >     {:
>> >      e2:_rain_v_1<3:9>[]
>> >     }
>> >
>> > It's a bit ugly, but it allows me to detect, with 1 token of lookahead,
>> if there's a top or not. If the colon is omitted then it's not clear if
>> "e2:" is the top or the start of the first node. If line breaks are
>> required, we just assume the first line is for the top, whether or not it's
>> there. But for single-line EDS, we need 4 tokens of lookahead to determine
>> if there's a top (assuming the parser treats variables and predicates as
>> the same kinds of tokens):
>> >
>> >     {e2: e2:_rain_v_1<3:9>[]}
>> >     {e2:_rain_v_1<3:9>[]}
>> >
>> > Here is the parsing algorithm, once we've consumed the first '{':
>> >
>> > 1. If the 1st lookahead token is ':', '(fragmented)' (or another graph
>> status), '}', or '|' (node status), then we know that TOP is missing (the
>> ':' is for PyDelphin's current output)
>> > 2. Otherwise the 1st and 2nd tokens must be a symbol and a colon, and
>> if the 3rd token is a graph or node status, OR if the 4th token is ':',
>> then the 1st token is the TOP
>> > 3. Otherwise TOP must be missing
>> >
>> > I think this covers all the cases but let me know if I've missed
>> anything.
>> >
>> > --
>> > -Michael Wayne Goodman
>>
>
>
> --
> -Michael Wayne Goodman
>


-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20201230/d61e7992/attachment-0001.htm>


More information about the developers mailing list