[developers] TDL identifiers

Tue Oct 23 19:39:12 CEST 2018

Hi Francis, Stephan, and all,

Sorry to leave you in the dark. John and I were continuing the discussion
off-list to avoid spamming the list with our back-and-forth about some very
minor details (e.g., bugs in the syntax description at the TdlRfc wiki).
Our discussion eventually lead to identifiers, which we should have put it
back on the list. I'll try to summarize.

We found ourselves playing whack-a-mole with various punctuation characters
in identifiers as we turned up ambiguities with other parts of the syntax
(which led to five exclusions from the previous set: ; ' " % and |), and
recognized that we only had a handful of printable punctuation characters
in the lower 128 ASCII set that weren't excluded: @ \ ` { } and ~. So we
thought we'd go with a whitelist instead of a blacklist, in spirit with the
Krieger & Schäfer 1994 description but allowing for unicode alphanumerics
(\w) instead of just [a-zA-Z0-9_] (plus the limited punctuation set
[*+?-]). John is testing his new TDL parser for the LKB, and I just
released PyDelphin v0.9.0 (perhaps too hastily) which used this whitelist
pattern. It seems we may have been too aggressive in restricting identifier
patterns, so I appreciate the feedback.

Yesterday John turned up issues with the middle dot in the KRG as well as
numeric characters like Ⅲ. The middle dot is considered punctuation by
unicode, so I was in favor of excluding it (considering our strategy for
dealing with spaces in identifiers, e.g., ad_hoc_a1, a+bit_det, etc.). The
numeric character was allowed by \w in Python's regular expressions with
the re.UNICODE flag, but the Lisp regex engine (cl-ppcre?) does not allow
it (as I understand). Stephan makes a good point about \w being potentially
locale-dependent. In Python, \w matches alphanumeric characters in any
locale when it's used to match strings of unicode codepoints, but it *is*
locale-dependent if it is used with encoded byte strings. We also don't
mandate unicode normalization, meaning that Ⅰ (U+2160) and I (U+0049) could
be different identifiers (or Ω (U+03A9) and Ω (U+2126), Å (U+00C5) and Å
(U+0041 + U+003A), etc).

And all this work to produce a complete and definitive syntax of
DELPH-IN-style TDL is precisely to help ensure our tools are consistent.
The current LKB, for instance, does not really have a notion of TDL
identifiers because it parses them as Lisp identifiers (see the lkb-read
function called by read-tdl-type). John had some nice examples (on-list) of
identifiers that the LKB treated as equivalent despite differences in form.
I gave an example identifier that ACE parsed but the LKB did not. We are
trying to prevent these inconsistencies. Once we've all agreed on something
final, I'd like to see all processors updated to reflect our decisions.

If the whitelist approach is too restrictive, then the following is the
most recent blacklist we had, which I believe avoids ambiguity with other
parts of the TDL syntax:

    [^\s!"#$%&'(),.\/:;<=>[\]^|]+

Francis: the difference in PyDelphin between regular identifiers and
coreferences (and now-deprecated single-quoted 'symbols) was unintentional.
I'll make a v0.9.1 patch release to fix this, once the issue regarding
identifier syntax is resolved.

On Tue, Oct 23, 2018 at 4:45 AM Francis Bond <bond at ieee.org> wrote:

> Thanks for the feedback!
>
> I tried to list the characters I thought should be excluded.   Did I miss
> any?
>
> On Tue, 23 Oct 2018, 19:37 Stephan Oepen, <oe at ifi.uio.no> wrote:
>
>> personally, i like all of your identifiers, francis, maybe even
>> including the one with a non-breaking space :-).
>>
>> i would think whether or not they are accepted by ‘\w’ depends on the
>> specific interpretation of ‘word characters’, which in turn may well
>> depend on your local set-up, i.e. the current locale.  from the
>> perlre(1) man page:
>>
>>        [...]Thus, under
>>        this modifier, the ASCII platform effectively becomes a Unicode
>>        platform; and hence, for example, "\w" will match any of the more
>> than
>>        100,000 word characters in Unicode.
>>
>>        Unlike most locales, which are specific to a language and country
>> pair,
>>        Unicode classifies all the characters that are letters somewhere
>> in the
>>        world as "\w".
>>
>> to avoid such dependencies on locale context, it might indeed be
>> simpler to define the syntax in terms of everything except a small
>> list of characters (that have operator-like status in TDL).  this is
>> more or less how the current lexers in the LKB and PET work, so might
>> also be easier to make consistent across platforms.
>>
>> cheers, oe
>>
>> On Tue, Oct 23, 2018 at 12:30 PM Francis Bond <bond at ieee.org> wrote:
>> >
>> > G'day,
>> >
>> > currently Zhong has several identifiers which Mike's TDL code
>> > considers invalid, but which the LKB and ACE are fine with:
>> >
>> > ＊-marker := symbol &
>> >  ，_c_1 := conj_-_e_le &
>> >  _n_1 := n_-_pn_le &
>> > 和_c_⚠ := conj_-_e_le &
>> > 格里姆斯比•罗伊洛特_n_1 := n_-_h_pn_le &
>> >
>> > full width *
>> > full width ,
>> > nonbreakspace  [our bad, I will remove]
>> > warning sign (which I like to use for mal-rules).
>> > dot (often used in foreign names)
>> >
>> > And in Jacy:
>> > ザ・ベスト_n_1-tc := ordinary-nohon-n-lex &
>> > full width dot (often used in foreign names)
>> >
>> > PyDelphin defines identifiers to be: ([\w_+*?-]+),
>> > and coreference to be  \#([^\s!"#$&'(),./:;<=>[\]^]+)
>> >
>> > It would be nice to at least include: ・•⚠, in identifiers, but maybe
>> > better to have a list of disallowed things (like coreference, now I
>> > guess with |):
>> >
>> > ([^\s!"#$&'(),./:;<=>[\]^|]+)
>> >
>> > and even better if the LKB, PET, ACE, AGREE and PyDelphin are
>> consistent.
>> >
>> > What do people think?
>> >
>> > --
>> > Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
>> > Division of Linguistics and Multilingual Studies
>> > Nanyang Technological University
>> >
>>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20181023/defee45f/attachment-0001.html>