[developers] More TDL cobwebs

goodman.m.w at gmail.com goodman.m.w at gmail.com
Tue Oct 9 22:38:06 CEST 2018


Hi John, replies are inline below...

On Tue, Oct 9, 2018 at 6:24 AM John Carroll <J.A.Carroll at sussex.ac.uk>
wrote:

> Hi Mike,
>
> I've now got an LKB implementation of http://moin.delph-in.net/TdlRfc .
>

That's great to hear. Thanks for your work on this. Since you didn't reply
to my latest message on this thread but to the one prior, I assume you did
not see the updated TdlRfc with :begin ... :end blocks (though I noted that
they are not currently supported by the LKB; also I fear I introduced
ambiguity there as well with the InstanceDef rule), and also the note about
the nesting of #| ... |# comments. These don't have much bearing on your
questions, so I'll ignore these points for now.


> Perhaps I'm missing some subtleties in the BNF, but it seems to me that
> there is an unfortunate ambiguity between Coreference and BlockComment: the
> character | is legal in the identifier part of a coreference, which means
> that differentiating between Coreference and BlockComment inside the body
> of a definition (or more precisely, inside a DocConj) requires
> either unbounded lookahead or non-deterministic parsing. E.g. if we have
> read the following at the start of a line
>
> [ SYNSEM #|aaaaaaaaaaaaaa
>
> then we don't know whether we're meant to be reading a coreference or a
> block comment, and we won't know until we encounter the next whitespace
> character or the characters |# (whichever comes first). For the moment my
> code assumes that # is an attempt to start a block comment if we are at the
> top level of a definition, otherwise the intention is to start
> a coreference.
>
Good observation. This ambiguity is indeed unfortunate. In both the current
LOGON-provided LKB and ACE (which I use to test), the 2-character pattern
#| only begins a block comment (both give errors otherwise). The |
character is not included in the "break-characters" in the Lisp code, but
the LKB seems to rely on Lisp's 'read' function (via lkb-read) to parse
identifiers (based on Lisp syntax I think) rather than these
break-characters. I tried creating some type names with | at the initial,
medial, and final positions, and ACE had no trouble with any of these,
while the LKB wouldn't parse the first one:

    |abc := *top*.
    d|ef := *top*.
    ghi| := *top*.

Now as for a resolution, we could:

  (A1) disallow | in identifiers
  (A2) disallow | at the start of identifiers
  (A3) disallow | at the start of coreference identifiers
  (A4) mandate a parsing order; if | is encountered after # it must start a
comment, otherwise a coreference

(A4) is what I currently use, but I believe it is equivalent with (A3). I'm
happy to go with any of these, even (A1), as I surveyed some grammars (ERG,
GG, Jacy, SRG, Hag, and INDRA) and found no instances of | in type names
(`grep -P '[ \w]\|[ \w]').

You also drew my attention to another issue. My definition of TypedConj
currently allows coreferences after the mandatory TypeName, but not before
(because they are not allowed on FeatureConj). Should we:

  (B1) Allow coreferences on FeatureConj (e.g., in FeatureTerm)
  (B2) Disallow coreferences outside of AVMs

I can't imagine when you'd want a top-level coreference, so (B2) seems like
it would make a tighter syntax, but maybe there's a use case I'm not
thinking of.


>
> There is also an ambiguity between DocString and DQString: at the top
> level of a definition (DocConj again), the character sequence "" is
> ambiguous between an empty DQString or the start of a DocString. Dealing
> with this case correctly requires either 3 characters of lookahead or
> non-deterministic parsing. For the moment my code assumes that " is an
> attempt to start a DocString if we are at the top level of a definition,
> otherwise the intention is to start a DQString. (This is related to your
> previous observation that regular strings don't really appear in top-level
> conjunctions).
>

I think this is a reasonable assumption. I'm using regular expressions for
lexing before I parse, so 3 characters of lookahead isn't a big deal. Also,
with the appropriate structure of the parsing functions I think you only
need 1 character of lookahead: after consuming "" you only need to peek one
character to decide if it's a docstring or an empty string. This
"appropriate structure" might be too drastic a change from what we already
have, unless perhaps docstrings were attributes of the Term instead of the
TypeDef.



> I hope these assumptions are OK? I've tested my new code with a few
> grammars, and the only grammar that fails to load is JACY, due to just one
> single-quoted docstring (on the type adv_adj_head-avm).
>

For Jacy there is in fact already a ticket for this issue:
https://github.com/delph-in/jacy/issues/47

I've just now pushed a change to Jacy that moves the docstring to a comment
(to be moved to a triple-quoted docstring in the future).



> John
>
>
> On 8 Sep 2018, at 19:18, goodman.m.w at gmail.com wrote:
>
> Thanks John and Stephan,
>
> John, thanks for offering to clean up the LKB's TDL reading, and I'll
> gladly leave the Lisping to the experts. If you're very concerned about
> backwards compatibility, then it should be possible to accommodate both the
> double-quoted and the triple-double-quoted variants. I don't think there's
> any meaningful overlap between double-quoted docstrings and regular strings
> because regular strings don't really appear in top-level conjunctions, and
> even if they did the only case it would be ambiguous is if the string was
> the only term in a type-addendum. But allowing for both double-quoted and
> triple-double-quoted docstrings to accommodate the few, if any, grammars
> that made use of them might be more trouble than it's worth.
>
> Rather, I think that Stephan's point about having a grammar's LKB script
> require a certain version of the LKB makes more sense.
>
> With all these improvements and shared efforts, 2018 (or 2019) will
> finally be the year of DELPH-IN on the desktop! ;)
>
> On Sat, Sep 8, 2018 at 7:11 AM Stephan Oepen <oe at ifi.uio.no> wrote:
>
>> colleagues,
>>
>> we put a mechanism into the LKB at some point to allow a grammar to
>> require a minimum revision of the software: see near the top of
>> ’lkb/script‘ in the ERG.
>>
>> i would suggest making the forthcoming release of the ERG require a
>> modern version of the LKB, i.e. getting the TDL reader code adapted to
>> support the new triple-quoted documentation strings, rebuilding the
>> binaries in LOGON (my job) and the LinGO distribution (UW), and encouraging
>> other grammar writers to also add a test of lkb-version-after-p() to their
>> ’script‘ files.
>>
>> come to think of it, in preparing for a new ERG release, dan and i would
>> often go through his accumulated patches to LKB code and consider
>> opportunities for consolidation.  likewise for revisions or additions of
>> [incr tsdb()] skeletons.  as a guiding principle, i would suggest it should
>> be possible to exactly re-create the treebanks in each release using
>> checked-in revisions of all the component pieces (data and software) at the
>> time.
>>
>> best wishes, oe
>>
>>
>> On Sat, 8 Sep 2018 at 13:58 John Carroll <J.A.Carroll at sussex.ac.uk>
>> wrote:
>>
>>> Hi,
>>>
>>> Thanks for trying to fix the LKB.
>>>
>>> I think your TDL clean-ups are a very good idea. The new version of
>>> read-tdl-type-comment in patches.lsp will indeed eventually make it into
>>> the LKB proper. But I was concerned about not being able to patch existing
>>> LKB binaries effectively. When I referred to backward compatibility, I was
>>> thinking about LKB binaries in distributions that may never get updated,
>>> e.g. http://www.cs.upc.edu/~padro/docker-logon.tgz and Knoppix+LKB .
>>> This might not be too much of a problem  in practice except that some LKB
>>> error messages are poor or misleading.
>>>
>>> I'll have a go at making a minimal set of changes that could be put in a
>>> patch file, and add a more considered reimplementation of TDL reading to my
>>> todo list.
>>>
>>> John
>>>
>>> On 8 Sep 2018, at 00:09, goodman.m.w at gmail.com wrote:
>>>
>>> Hi again,
>>>
>>> I spent an hour or two editing patches.lsp to try and make it work, but
>>> my lisp writing and debugging knowledge is too limited to figure it out
>>> right now. Here's what I tried to do:
>>>
>>> * read-tdl-top-conjunction:
>>>   - a copy of read-tdl-conjunction, except for the following...
>>>   - call read-tdl-type-comment if peek-with-comments returns " before
>>> calling read-tdl-defterm
>>>   - append the pair (docstring . term) to the "constraint" variable
>>> instead of just term
>>> * read-tdl-avm-def:
>>>   - remove the part about reading parents
>>>   - expect a pair (docstring . term) from read-tdl-top-conjunction
>>>   - append the docstring to the "comment" variable
>>>   - extract the term as "unif" and proceeds as before
>>> * read-tdl-type-comment:
>>>   - if it doesn't encounter """, it calls unread-char to put those
>>> quotes back on the stream, because it may be a regular "string" or empty ""
>>> string
>>>   - don't print an error if the string doesn't start with """
>>>
>>> I only created read-tdl-top-conjunction so that I didn't have to
>>> redefine all the other places where read-tdl-conjunction was used. Trying
>>> to load the ERG with these changes gives me an "Unexpected unif" error when
>>> it tries to load fundamentals.tdl.
>>>
>>> On Fri, Sep 7, 2018 at 11:59 AM goodman.m.w at gmail.com <
>>> goodman.m.w at gmail.com> wrote:
>>>
>>>> Thanks for the feedback, John,
>>>>
>>>> While I appreciate your arguments and code, I am reluctant to agree
>>>> with any changes now. The LKB has been a pioneer in allowing docstrings,
>>>> but I don't think we should revert the work other developers have put into
>>>> their processors in the last month, not to mention the hard-earned
>>>> consensus over the color of this bike shed. Here are my reasons:
>>>>
>>>> 1. The agreed-upon syntax does not break backward compatibility (except
>>>> regarding the number of quote characters), it only opens up new places
>>>> where docstrings may occur (see (3))
>>>>
>>>> 2. The lack of support for docstrings outside of the LKB hindered their
>>>> adoption, so backward compatibility isn't much of an issue given that
>>>> grammar developers avoided using them (given this, maybe I should have
>>>> pushed harder for docstrings immediately after := or :+... oh well).
>>>>
>>>> 3. The LKB's implementation that parses supertypes (or "parents" as
>>>> used in the lisp code) before other terms is only half-baked. It first
>>>> reads some type names, then looks for a docstring, then reads other terms,
>>>> which may include more type names. I proposed making a change to the syntax
>>>> so that type names must appear before other terms in a top-level
>>>> conjunction, but the only replies I got addressing this point (from Stephan
>>>> and Dan) opposed such a change. Thus, we agreed that type names have no
>>>> special position in conjunctions. Because of this, saying that the
>>>> docstring must occur before the AVM means little, because (a) the AVM may
>>>> appear before a type name, and (b) there may be more than one AVM. For
>>>> instance, the LKB (with the ERG's triple-quoted patch) currently accepts
>>>> these:
>>>>
>>>>     a := b & c """doc""".
>>>>     a := b & """doc""" c.
>>>>     a := b & c & """doc""" [ Q r ].
>>>>     a := b & """doc""" c & [ Q r ].
>>>>     a := b & """doc""" [ Q r ] & c.
>>>>
>>>> but not these:
>>>>
>>>>     a := """doc""" b & c.
>>>>     a := """doc""" b & c & [ Q r ].
>>>>     a := b & c & [ Q r ] """doc""".
>>>>
>>>> Furthermore, it accepts:
>>>>
>>>>     a := b & c & [ Q r ].
>>>>     a := b & [ Q r ] & c.
>>>>
>>>> but not:
>>>>
>>>>     a := [ Q r ] & b & c.
>>>>
>>>> I imagine a grammar developer (who doesn't browse the lisp code) would
>>>> not find these facts consistent. It should either enforce that all
>>>> supertypes appear before other terms, or allow them to mix freely.
>>>>
>>>> So, on the one hand, I think that the LKB is currently deficient WRT
>>>> the above patterns (which are all allowed, according to current consensus).
>>>> I may take a look at fixing the Lisp code, but it would take me a while. On
>>>> the other hand, the LKB merely enforces the conventional layout of TDL
>>>> definitions, so it is unlikely to cause problems for now.
>>>>
>>>> Finally, docstrings are desired for more than just the ERG, so the
>>>> temporary solution in patches.lsp should eventually make it into the LKB
>>>> proper. For instance, the read-tdl-avm-def and read-tdl-conjunction
>>>> functions would need some changes and the read-tdl-type-parents function
>>>> should probably just be removed.
>>>>
>>>> On Fri, Sep 7, 2018 at 4:58 AM John Carroll <J.A.Carroll at sussex.ac.uk>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I've been looking at TDL reading in the LKB, and (partly for pragmatic
>>>>> reasons) I suggest restricting docstrings to occur only in the position
>>>>> immediately preceding the AVM - or just before the final . terminator if
>>>>> there is no AVM. Here are my reasons:
>>>>>
>>>>> 1. The LKB currently only allows docstrings in that position, and
>>>>> changing this while retaining backward compatibility would require an
>>>>> unreasonable amount of patching in a grammar lkb/patches.lsp file
>>>>> 2. This position is analogous to where docstrings are allowed in
>>>>> programming languages / docstring packages
>>>>>
>>>>> In the hope that this is acceptable, at least for the time being, I've
>>>>> sent Dan a new version of his patch to change docstrings from double-quoted
>>>>> to triple double-quoted in the LKB. The patch is attached in case other
>>>>> grammar developers want to pick it up.
>>>>>
>>>>> John
>>>>>
>>>>> On 7 Sep 2018, at 00:29, goodman.m.w at gmail.com wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> There are some remaining issues with TDL that I'd like to clean up.
>>>>> First I will summarize some decisions made (or at least not rejected) in
>>>>> previous email threads:
>>>>>
>>>>> 1. Supertypes appear before other terms in a conjunction only by
>>>>> convention (not enforced in the syntax)
>>>>> 2. Docstrings are triple-quoted and may appear before any top-level
>>>>> term or before the final . terminator
>>>>> 3. Comments may appear in definitions anywhere that spaces can, except
>>>>> within strings/regexes/affixing-patterns
>>>>>
>>>>> The following changes are things I think people agree with, so I'd
>>>>> like to consider them as decided:
>>>>>
>>>>> 4. Removal of the :< operator (if accepted as a variant of :=, throw a
>>>>> warning)
>>>>> 5. Removal of 'single-quoted-symbols
>>>>> 6. Removal of double-quoted "docstrings"
>>>>> 7. Removal of non-regex uses of ^ (otherwise any BNF of TDL is
>>>>> necessarily incomplete because the "extended-syntax" use of ^ is open-ended)
>>>>>
>>>>> And there's at least one point I don't think we reached a decision on:
>>>>>
>>>>> 8. Instances must have exactly 1 "supertype" (which is really just a
>>>>> type and not a supertype, i.e., it doesn't change the type hierarchy)
>>>>>
>>>>> Also:
>>>>>
>>>>> 9. Does anyone know how wild-cards differ from letter-sets? I see HaG
>>>>> has a wild-card and suffix pattern like these:
>>>>>
>>>>>     %(wild-card (?g ui))
>>>>>     ...
>>>>>     %suffix (!c!v !c!vn) (!v?g !vn)
>>>>> My guess is that wild-cards match but are not used in the replacement,
>>>>> which I can imagine is useful if you want the replacement to use the second
>>>>> of two matches but not the first. It makes me wonder why we don't just use
>>>>> regex substitutions for these things.
>>>>>
>>>>> If nobody responds about (1)--(7), I'll make sure the syntax
>>>>> description on the TdlRfc wiki reflects those decisions.
>>>>>
>>>>> --
>>>>> -Michael Wayne Goodman
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> -Michael Wayne Goodman
>>>>
>>>
>>>
>>> --
>>> -Michael Wayne Goodman
>>>
>>>
>>>
>
> --
> -Michael Wayne Goodman
>
>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20181009/1bfbbb54/attachment-0001.html>


More information about the developers mailing list