[developers] TDL identifiers

Stephan Oepen oe at ifi.uio.no
Tue Oct 23 13:36:53 CEST 2018


personally, i like all of your identifiers, francis, maybe even
including the one with a non-breaking space :-).

i would think whether or not they are accepted by ‘\w’ depends on the
specific interpretation of ‘word characters’, which in turn may well
depend on your local set-up, i.e. the current locale.  from the
perlre(1) man page:

       [...]Thus, under
       this modifier, the ASCII platform effectively becomes a Unicode
       platform; and hence, for example, "\w" will match any of the more than
       100,000 word characters in Unicode.

       Unlike most locales, which are specific to a language and country pair,
       Unicode classifies all the characters that are letters somewhere in the
       world as "\w".

to avoid such dependencies on locale context, it might indeed be
simpler to define the syntax in terms of everything except a small
list of characters (that have operator-like status in TDL).  this is
more or less how the current lexers in the LKB and PET work, so might
also be easier to make consistent across platforms.

cheers, oe

On Tue, Oct 23, 2018 at 12:30 PM Francis Bond <bond at ieee.org> wrote:
>
> G'day,
>
> currently Zhong has several identifiers which Mike's TDL code
> considers invalid, but which the LKB and ACE are fine with:
>
> *-marker := symbol &
>  ,_c_1 := conj_-_e_le &
>  _n_1 := n_-_pn_le &
> 和_c_⚠ := conj_-_e_le &
> 格里姆斯比•罗伊洛特_n_1 := n_-_h_pn_le &
>
> full width *
> full width ,
> nonbreakspace  [our bad, I will remove]
> warning sign (which I like to use for mal-rules).
> dot (often used in foreign names)
>
> And in Jacy:
> ザ・ベスト_n_1-tc := ordinary-nohon-n-lex &
> full width dot (often used in foreign names)
>
> PyDelphin defines identifiers to be: ([\w_+*?-]+),
> and coreference to be  \#([^\s!"#$&'(),./:;<=>[\]^]+)
>
> It would be nice to at least include: ・•⚠, in identifiers, but maybe
> better to have a list of disallowed things (like coreference, now I
> guess with |):
>
> ([^\s!"#$&'(),./:;<=>[\]^|]+)
>
> and even better if the LKB, PET, ACE, AGREE and PyDelphin are consistent.
>
> What do people think?
>
> --
> Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>



More information about the developers mailing list