[developers] More TDL cobwebs

Mon Oct 15 21:42:51 CEST 2018

Hi John,

I'm glad we're in agreement on nearly every point, and I'm also interested
to hear the perspectives of grammar writers.

Regarding the DQStrings / DocStrings, the context can disambiguate some,
but not all, of these ambiguities. For the first one ( """"" a docstring
""" ), this should only be valid if the docstring is one that precedes the
final dot (otherwise there must be a & between "" and """):

    a-type := """"" a docstring """.

So I don't think this one is in fact ambiguous. The latest PyDelphin,
however, does not parse this correctly because it sees the first """ and
commits to parsing a docstring. Parsing it correctly would require more
lookahead or backtracking. It may be worth noting that it is much easier to
parse if the DQString is non-empty (i.e., """" (4 "s) is not ambiguous).
I'm not certain why the second example is ambiguous. It should be a
docstring preceding a term which is an empty string. Inside a docstring,
the first sequence of 3 unescaped double-quotes always terminates the
docstring.

I agree that the final example ( """""""" ) is indeed ambiguous, but the
context helps sometimes:

    a-type := b & """""""" .  ; DQString followed by a DocString
    a-type := """""""" & b.  ; DocString followed by a DQString
    a-type := """"""" .  ; truly ambiguous

While these cases should never come up in practice, they represent an
annoying gap in the TDL syntax description. Maybe we can just add a note
saying that """ always starts/ends DocStrings, and if a grammar writer
wants an empty DQString followed by a DocString as the final term in a
top-level conjunction, they must insert a space between the two ( "" """a
docstring""" ). I think this could be enforced in the description by
modifying DQString to ensure an empty string is not followed by a
double-quote character:

    DQString := ( /""(?!")/ | /"([^"\\]|\\.)+"/ ) Spacing

On Mon, Oct 15, 2018 at 8:05 AM John Carroll <J.A.Carroll at sussex.ac.uk>
wrote:

> Hi Mike,
>
> I didn't reply regarding :begin ... :end blocks because I haven't (yet)
> had a chance to think about them.
>
> Regarding identifiers, I'm happy with your option (A1) disallow | in
> identifiers. Could a grammar writer give an opinion on whether this could
> in future be a problem?
>
> As you say, the LKB currently reads identifiers via the Lisp 'read'
> function. This leads to incorrect behaviour since it accepts such horrors
> as the following (which are all regarded as being valid and equivalent
> coreferences):
>
>   #a\b\c
>   #a|bc|
>   # |Abc|
>
> To fix this I'm re-implementing the LKB's reading of identifiers.
>
> I also agree with your option (B2) Disallow coreferences outside of AVMs.
> I think the only use for a top-level coreference would be to produce a
> feature structure containing a cycle, which is not a valid FS.
>
> We're in agreement about the DQString / DocString issue: an implementation
> could restructure the BNF non-terminals in order to reduce the lookahead.
> However, here's a related oddity:
>
> """"" a docstring """
> """ a docstring """""
>
> are valid character sequences inside a top level conjunction (they consist
> of an empty DQString and a DocString, with the intervening Space being
> empty). So after consuming "", peeking a further " still doesn't
> disambiguate. Even more weirdly,
>
> """"""""
>
> cannot be disambiguated at all - although it's irrelevant which string is
> which. But anyhow, none of this matters in practice as long as there's no
> role for a DQString at the top level.
>
> John
>
> On 9 Oct 2018, at 21:38, goodman.m.w at gmail.com wrote:
>
> Hi John, replies are inline below...
>
> On Tue, Oct 9, 2018 at 6:24 AM John Carroll <J.A.Carroll at sussex.ac.uk>
> wrote:
>
>> Hi Mike,
>>
>> I've now got an LKB implementation of http://moin.delph-in.net/TdlRfc .
>>
>
> That's great to hear. Thanks for your work on this. Since you didn't reply
> to my latest message on this thread but to the one prior, I assume you did
> not see the updated TdlRfc with :begin ... :end blocks (though I noted that
> they are not currently supported by the LKB; also I fear I introduced
> ambiguity there as well with the InstanceDef rule), and also the note about
> the nesting of #| ... |# comments. These don't have much bearing on your
> questions, so I'll ignore these points for now.
>
>
>> Perhaps I'm missing some subtleties in the BNF, but it seems to me that
>> there is an unfortunate ambiguity between Coreference and BlockComment: the
>> character | is legal in the identifier part of a coreference, which means
>> that differentiating between Coreference and BlockComment inside the body
>> of a definition (or more precisely, inside a DocConj) requires
>> either unbounded lookahead or non-deterministic parsing. E.g. if we have
>> read the following at the start of a line
>>
>> [ SYNSEM #|aaaaaaaaaaaaaa
>>
>> then we don't know whether we're meant to be reading a coreference or a
>> block comment, and we won't know until we encounter the next whitespace
>> character or the characters |# (whichever comes first). For the moment my
>> code assumes that # is an attempt to start a block comment if we are at the
>> top level of a definition, otherwise the intention is to start
>> a coreference.
>>
> Good observation. This ambiguity is indeed unfortunate. In both the
> current LOGON-provided LKB and ACE (which I use to test), the 2-character
> pattern #| only begins a block comment (both give errors otherwise). The |
> character is not included in the "break-characters" in the Lisp code, but
> the LKB seems to rely on Lisp's 'read' function (via lkb-read) to parse
> identifiers (based on Lisp syntax I think) rather than these
> break-characters. I tried creating some type names with | at the initial,
> medial, and final positions, and ACE had no trouble with any of these,
> while the LKB wouldn't parse the first one:
>
>     |abc := *top*.
>     d|ef := *top*.
>     ghi| := *top*.
>
> Now as for a resolution, we could:
>
>   (A1) disallow | in identifiers
>   (A2) disallow | at the start of identifiers
>   (A3) disallow | at the start of coreference identifiers
>   (A4) mandate a parsing order; if | is encountered after # it must start
> a comment, otherwise a coreference
>
> (A4) is what I currently use, but I believe it is equivalent with (A3).
> I'm happy to go with any of these, even (A1), as I surveyed some grammars
> (ERG, GG, Jacy, SRG, Hag, and INDRA) and found no instances of | in type
> names (`grep -P '[ \w]\|[ \w]').
>
> You also drew my attention to another issue. My definition of TypedConj
> currently allows coreferences after the mandatory TypeName, but not before
> (because they are not allowed on FeatureConj). Should we:
>
>   (B1) Allow coreferences on FeatureConj (e.g., in FeatureTerm)
>   (B2) Disallow coreferences outside of AVMs
>
> I can't imagine when you'd want a top-level coreference, so (B2) seems
> like it would make a tighter syntax, but maybe there's a use case I'm not
> thinking of.
>
>
>>
>> There is also an ambiguity between DocString and DQString: at the top
>> level of a definition (DocConj again), the character sequence "" is
>> ambiguous between an empty DQString or the start of a DocString. Dealing
>> with this case correctly requires either 3 characters of lookahead or
>> non-deterministic parsing. For the moment my code assumes that " is an
>> attempt to start a DocString if we are at the top level of a definition,
>> otherwise the intention is to start a DQString. (This is related to your
>> previous observation that regular strings don't really appear in top-level
>> conjunctions).
>>
>
> I think this is a reasonable assumption. I'm using regular expressions for
> lexing before I parse, so 3 characters of lookahead isn't a big deal. Also,
> with the appropriate structure of the parsing functions I think you only
> need 1 character of lookahead: after consuming "" you only need to peek one
> character to decide if it's a docstring or an empty string. This
> "appropriate structure" might be too drastic a change from what we already
> have, unless perhaps docstrings were attributes of the Term instead of the
> TypeDef.
>
>
>
>> I hope these assumptions are OK? I've tested my new code with a few
>> grammars, and the only grammar that fails to load is JACY, due to just one
>> single-quoted docstring (on the type adv_adj_head-avm).
>>
>
> For Jacy there is in fact already a ticket for this issue:
> https://github.com/delph-in/jacy/issues/47
>
> I've just now pushed a change to Jacy that moves the docstring to a
> comment (to be moved to a triple-quoted docstring in the future).
>
>
>
>> John
>>
>>
>> On 8 Sep 2018, at 19:18, goodman.m.w at gmail.com wrote:
>>
>> Thanks John and Stephan,
>>
>> John, thanks for offering to clean up the LKB's TDL reading, and I'll
>> gladly leave the Lisping to the experts. If you're very concerned about
>> backwards compatibility, then it should be possible to accommodate both the
>> double-quoted and the triple-double-quoted variants. I don't think there's
>> any meaningful overlap between double-quoted docstrings and regular strings
>> because regular strings don't really appear in top-level conjunctions, and
>> even if they did the only case it would be ambiguous is if the string was
>> the only term in a type-addendum. But allowing for both double-quoted and
>> triple-double-quoted docstrings to accommodate the few, if any, grammars
>> that made use of them might be more trouble than it's worth.
>>
>> Rather, I think that Stephan's point about having a grammar's LKB script
>> require a certain version of the LKB makes more sense.
>>
>> With all these improvements and shared efforts, 2018 (or 2019) will
>> finally be the year of DELPH-IN on the desktop! ;)
>>
>> On Sat, Sep 8, 2018 at 7:11 AM Stephan Oepen <oe at ifi.uio.no> wrote:
>>
>>> colleagues,
>>>
>>> we put a mechanism into the LKB at some point to allow a grammar to
>>> require a minimum revision of the software: see near the top of
>>> ’lkb/script‘ in the ERG.
>>>
>>> i would suggest making the forthcoming release of the ERG require a
>>> modern version of the LKB, i.e. getting the TDL reader code adapted to
>>> support the new triple-quoted documentation strings, rebuilding the
>>> binaries in LOGON (my job) and the LinGO distribution (UW), and encouraging
>>> other grammar writers to also add a test of lkb-version-after-p() to their
>>> ’script‘ files.
>>>
>>> come to think of it, in preparing for a new ERG release, dan and i would
>>> often go through his accumulated patches to LKB code and consider
>>> opportunities for consolidation.  likewise for revisions or additions of
>>> [incr tsdb()] skeletons.  as a guiding principle, i would suggest it should
>>> be possible to exactly re-create the treebanks in each release using
>>> checked-in revisions of all the component pieces (data and software) at the
>>> time.
>>>
>>> best wishes, oe
>>>
>>>
>>> On Sat, 8 Sep 2018 at 13:58 John Carroll <J.A.Carroll at sussex.ac.uk>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thanks for trying to fix the LKB.
>>>>
>>>> I think your TDL clean-ups are a very good idea. The new version of
>>>> read-tdl-type-comment in patches.lsp will indeed eventually make it into
>>>> the LKB proper. But I was concerned about not being able to patch existing
>>>> LKB binaries effectively. When I referred to backward compatibility, I was
>>>> thinking about LKB binaries in distributions that may never get updated,
>>>> e.g. http://www.cs.upc.edu/~padro/docker-logon.tgz and Knoppix+LKB .
>>>> This might not be too much of a problem  in practice except that some LKB
>>>> error messages are poor or misleading.
>>>>
>>>> I'll have a go at making a minimal set of changes that could be put in
>>>> a patch file, and add a more considered reimplementation of TDL reading to
>>>> my todo list.
>>>>
>>>> John
>>>>
>>>> On 8 Sep 2018, at 00:09, goodman.m.w at gmail.com wrote:
>>>>
>>>> Hi again,
>>>>
>>>> I spent an hour or two editing patches.lsp to try and make it work, but
>>>> my lisp writing and debugging knowledge is too limited to figure it out
>>>> right now. Here's what I tried to do:
>>>>
>>>> * read-tdl-top-conjunction:
>>>>   - a copy of read-tdl-conjunction, except for the following...
>>>>   - call read-tdl-type-comment if peek-with-comments returns " before
>>>> calling read-tdl-defterm
>>>>   - append the pair (docstring . term) to the "constraint" variable
>>>> instead of just term
>>>> * read-tdl-avm-def:
>>>>   - remove the part about reading parents
>>>>   - expect a pair (docstring . term) from read-tdl-top-conjunction
>>>>   - append the docstring to the "comment" variable
>>>>   - extract the term as "unif" and proceeds as before
>>>> * read-tdl-type-comment:
>>>>   - if it doesn't encounter """, it calls unread-char to put those
>>>> quotes back on the stream, because it may be a regular "string" or empty ""
>>>> string
>>>>   - don't print an error if the string doesn't start with """
>>>>
>>>> I only created read-tdl-top-conjunction so that I didn't have to
>>>> redefine all the other places where read-tdl-conjunction was used. Trying
>>>> to load the ERG with these changes gives me an "Unexpected unif" error when
>>>> it tries to load fundamentals.tdl.
>>>>
>>>> On Fri, Sep 7, 2018 at 11:59 AM goodman.m.w at gmail.com <
>>>> goodman.m.w at gmail.com> wrote:
>>>>
>>>>> Thanks for the feedback, John,
>>>>>
>>>>> While I appreciate your arguments and code, I am reluctant to agree
>>>>> with any changes now. The LKB has been a pioneer in allowing docstrings,
>>>>> but I don't think we should revert the work other developers have put into
>>>>> their processors in the last month, not to mention the hard-earned
>>>>> consensus over the color of this bike shed. Here are my reasons:
>>>>>
>>>>> 1. The agreed-upon syntax does not break backward compatibility
>>>>> (except regarding the number of quote characters), it only opens up new
>>>>> places where docstrings may occur (see (3))
>>>>>
>>>>> 2. The lack of support for docstrings outside of the LKB hindered
>>>>> their adoption, so backward compatibility isn't much of an issue given that
>>>>> grammar developers avoided using them (given this, maybe I should have
>>>>> pushed harder for docstrings immediately after := or :+... oh well).
>>>>>
>>>>> 3. The LKB's implementation that parses supertypes (or "parents" as
>>>>> used in the lisp code) before other terms is only half-baked. It first
>>>>> reads some type names, then looks for a docstring, then reads other terms,
>>>>> which may include more type names. I proposed making a change to the syntax
>>>>> so that type names must appear before other terms in a top-level
>>>>> conjunction, but the only replies I got addressing this point (from Stephan
>>>>> and Dan) opposed such a change. Thus, we agreed that type names have no
>>>>> special position in conjunctions. Because of this, saying that the
>>>>> docstring must occur before the AVM means little, because (a) the AVM may
>>>>> appear before a type name, and (b) there may be more than one AVM. For
>>>>> instance, the LKB (with the ERG's triple-quoted patch) currently accepts
>>>>> these:
>>>>>
>>>>>     a := b & c """doc""".
>>>>>     a := b & """doc""" c.
>>>>>     a := b & c & """doc""" [ Q r ].
>>>>>     a := b & """doc""" c & [ Q r ].
>>>>>     a := b & """doc""" [ Q r ] & c.
>>>>>
>>>>> but not these:
>>>>>
>>>>>     a := """doc""" b & c.
>>>>>     a := """doc""" b & c & [ Q r ].
>>>>>     a := b & c & [ Q r ] """doc""".
>>>>>
>>>>> Furthermore, it accepts:
>>>>>
>>>>>     a := b & c & [ Q r ].
>>>>>     a := b & [ Q r ] & c.
>>>>>
>>>>> but not:
>>>>>
>>>>>     a := [ Q r ] & b & c.
>>>>>
>>>>> I imagine a grammar developer (who doesn't browse the lisp code) would
>>>>> not find these facts consistent. It should either enforce that all
>>>>> supertypes appear before other terms, or allow them to mix freely.
>>>>>
>>>>> So, on the one hand, I think that the LKB is currently deficient WRT
>>>>> the above patterns (which are all allowed, according to current consensus).
>>>>> I may take a look at fixing the Lisp code, but it would take me a while. On
>>>>> the other hand, the LKB merely enforces the conventional layout of TDL
>>>>> definitions, so it is unlikely to cause problems for now.
>>>>>
>>>>> Finally, docstrings are desired for more than just the ERG, so the
>>>>> temporary solution in patches.lsp should eventually make it into the LKB
>>>>> proper. For instance, the read-tdl-avm-def and read-tdl-conjunction
>>>>> functions would need some changes and the read-tdl-type-parents function
>>>>> should probably just be removed.
>>>>>
>>>>> On Fri, Sep 7, 2018 at 4:58 AM John Carroll <J.A.Carroll at sussex.ac.uk>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I've been looking at TDL reading in the LKB, and (partly for
>>>>>> pragmatic reasons) I suggest restricting docstrings to occur only in the
>>>>>> position immediately preceding the AVM - or just before the final .
>>>>>> terminator if there is no AVM. Here are my reasons:
>>>>>>
>>>>>> 1. The LKB currently only allows docstrings in that position, and
>>>>>> changing this while retaining backward compatibility would require an
>>>>>> unreasonable amount of patching in a grammar lkb/patches.lsp file
>>>>>> 2. This position is analogous to where docstrings are allowed in
>>>>>> programming languages / docstring packages
>>>>>>
>>>>>> In the hope that this is acceptable, at least for the time being,
>>>>>> I've sent Dan a new version of his patch to change docstrings from
>>>>>> double-quoted to triple double-quoted in the LKB. The patch is attached in
>>>>>> case other grammar developers want to pick it up.
>>>>>>
>>>>>> John
>>>>>>
>>>>>> On 7 Sep 2018, at 00:29, goodman.m.w at gmail.com wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> There are some remaining issues with TDL that I'd like to clean up.
>>>>>> First I will summarize some decisions made (or at least not rejected) in
>>>>>> previous email threads:
>>>>>>
>>>>>> 1. Supertypes appear before other terms in a conjunction only by
>>>>>> convention (not enforced in the syntax)
>>>>>> 2. Docstrings are triple-quoted and may appear before any top-level
>>>>>> term or before the final . terminator
>>>>>> 3. Comments may appear in definitions anywhere that spaces can,
>>>>>> except within strings/regexes/affixing-patterns
>>>>>>
>>>>>> The following changes are things I think people agree with, so I'd
>>>>>> like to consider them as decided:
>>>>>>
>>>>>> 4. Removal of the :< operator (if accepted as a variant of :=, throw
>>>>>> a warning)
>>>>>> 5. Removal of 'single-quoted-symbols
>>>>>> 6. Removal of double-quoted "docstrings"
>>>>>> 7. Removal of non-regex uses of ^ (otherwise any BNF of TDL is
>>>>>> necessarily incomplete because the "extended-syntax" use of ^ is open-ended)
>>>>>>
>>>>>> And there's at least one point I don't think we reached a decision on:
>>>>>>
>>>>>> 8. Instances must have exactly 1 "supertype" (which is really just a
>>>>>> type and not a supertype, i.e., it doesn't change the type hierarchy)
>>>>>>
>>>>>> Also:
>>>>>>
>>>>>> 9. Does anyone know how wild-cards differ from letter-sets? I see HaG
>>>>>> has a wild-card and suffix pattern like these:
>>>>>>
>>>>>>     %(wild-card (?g ui))
>>>>>>     ...
>>>>>>     %suffix (!c!v !c!vn) (!v?g !vn)
>>>>>> My guess is that wild-cards match but are not used in the
>>>>>> replacement, which I can imagine is useful if you want the replacement to
>>>>>> use the second of two matches but not the first. It makes me wonder why we
>>>>>> don't just use regex substitutions for these things.
>>>>>>
>>>>>> If nobody responds about (1)--(7), I'll make sure the syntax
>>>>>> description on the TdlRfc wiki reflects those decisions.
>>>>>>
>>>>>> --
>>>>>> -Michael Wayne Goodman
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> -Michael Wayne Goodman
>>>>>
>>>>
>>>>
>>>> --
>>>> -Michael Wayne Goodman
>>>>
>>>>
>>>>
>>
>> --
>> -Michael Wayne Goodman
>>
>>
>>
>
> --
> -Michael Wayne Goodman
>
>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20181015/cd18a436/attachment-0001.html>