[developers] TDL identifiers

Francis Bond bond at ieee.org
Tue Oct 23 13:45:10 CEST 2018


Thanks for the feedback!

I tried to list the characters I thought should be excluded.   Did I miss
any?

On Tue, 23 Oct 2018, 19:37 Stephan Oepen, <oe at ifi.uio.no> wrote:

> personally, i like all of your identifiers, francis, maybe even
> including the one with a non-breaking space :-).
>
> i would think whether or not they are accepted by ‘\w’ depends on the
> specific interpretation of ‘word characters’, which in turn may well
> depend on your local set-up, i.e. the current locale.  from the
> perlre(1) man page:
>
>        [...]Thus, under
>        this modifier, the ASCII platform effectively becomes a Unicode
>        platform; and hence, for example, "\w" will match any of the more
> than
>        100,000 word characters in Unicode.
>
>        Unlike most locales, which are specific to a language and country
> pair,
>        Unicode classifies all the characters that are letters somewhere in
> the
>        world as "\w".
>
> to avoid such dependencies on locale context, it might indeed be
> simpler to define the syntax in terms of everything except a small
> list of characters (that have operator-like status in TDL).  this is
> more or less how the current lexers in the LKB and PET work, so might
> also be easier to make consistent across platforms.
>
> cheers, oe
>
> On Tue, Oct 23, 2018 at 12:30 PM Francis Bond <bond at ieee.org> wrote:
> >
> > G'day,
> >
> > currently Zhong has several identifiers which Mike's TDL code
> > considers invalid, but which the LKB and ACE are fine with:
> >
> > *-marker := symbol &
> >  ,_c_1 := conj_-_e_le &
> >  _n_1 := n_-_pn_le &
> > 和_c_⚠ := conj_-_e_le &
> > 格里姆斯比•罗伊洛特_n_1 := n_-_h_pn_le &
> >
> > full width *
> > full width ,
> > nonbreakspace  [our bad, I will remove]
> > warning sign (which I like to use for mal-rules).
> > dot (often used in foreign names)
> >
> > And in Jacy:
> > ザ・ベスト_n_1-tc := ordinary-nohon-n-lex &
> > full width dot (often used in foreign names)
> >
> > PyDelphin defines identifiers to be: ([\w_+*?-]+),
> > and coreference to be  \#([^\s!"#$&'(),./:;<=>[\]^]+)
> >
> > It would be nice to at least include: ・•⚠, in identifiers, but maybe
> > better to have a list of disallowed things (like coreference, now I
> > guess with |):
> >
> > ([^\s!"#$&'(),./:;<=>[\]^|]+)
> >
> > and even better if the LKB, PET, ACE, AGREE and PyDelphin are consistent.
> >
> > What do people think?
> >
> > --
> > Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
> > Division of Linguistics and Multilingual Studies
> > Nanyang Technological University
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20181023/dbc95874/attachment.html>


More information about the developers mailing list