<div dir="auto">Thanks for the feedback! <div dir="auto"><br></div><div dir="auto">I tried to list the characters I thought should be excluded. Did I miss any?</div></div><br><div class="gmail_quote"><div dir="ltr">On Tue, 23 Oct 2018, 19:37 Stephan Oepen, <<a href="mailto:oe@ifi.uio.no">oe@ifi.uio.no</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">personally, i like all of your identifiers, francis, maybe even<br>
including the one with a non-breaking space :-).<br>
<br>
i would think whether or not they are accepted by ‘\w’ depends on the<br>
specific interpretation of ‘word characters’, which in turn may well<br>
depend on your local set-up, i.e. the current locale. from the<br>
perlre(1) man page:<br>
<br>
[...]Thus, under<br>
this modifier, the ASCII platform effectively becomes a Unicode<br>
platform; and hence, for example, "\w" will match any of the more than<br>
100,000 word characters in Unicode.<br>
<br>
Unlike most locales, which are specific to a language and country pair,<br>
Unicode classifies all the characters that are letters somewhere in the<br>
world as "\w".<br>
<br>
to avoid such dependencies on locale context, it might indeed be<br>
simpler to define the syntax in terms of everything except a small<br>
list of characters (that have operator-like status in TDL). this is<br>
more or less how the current lexers in the LKB and PET work, so might<br>
also be easier to make consistent across platforms.<br>
<br>
cheers, oe<br>
<br>
On Tue, Oct 23, 2018 at 12:30 PM Francis Bond <<a href="mailto:bond@ieee.org" target="_blank" rel="noreferrer">bond@ieee.org</a>> wrote:<br>
><br>
> G'day,<br>
><br>
> currently Zhong has several identifiers which Mike's TDL code<br>
> considers invalid, but which the LKB and ACE are fine with:<br>
><br>
> *-marker := symbol &<br>
> ,_c_1 := conj_-_e_le &<br>
> _n_1 := n_-_pn_le &<br>
> 和_c_⚠ := conj_-_e_le &<br>
> 格里姆斯比•罗伊洛特_n_1 := n_-_h_pn_le &<br>
><br>
> full width *<br>
> full width ,<br>
> nonbreakspace [our bad, I will remove]<br>
> warning sign (which I like to use for mal-rules).<br>
> dot (often used in foreign names)<br>
><br>
> And in Jacy:<br>
> ザ・ベスト_n_1-tc := ordinary-nohon-n-lex &<br>
> full width dot (often used in foreign names)<br>
><br>
> PyDelphin defines identifiers to be: ([\w_+*?-]+),<br>
> and coreference to be \#([^\s!"#$&'(),./:;<=>[\]^]+)<br>
><br>
> It would be nice to at least include: ・•⚠, in identifiers, but maybe<br>
> better to have a list of disallowed things (like coreference, now I<br>
> guess with |):<br>
><br>
> ([^\s!"#$&'(),./:;<=>[\]^|]+)<br>
><br>
> and even better if the LKB, PET, ACE, AGREE and PyDelphin are consistent.<br>
><br>
> What do people think?<br>
><br>
> --<br>
> Francis Bond <<a href="http://www3.ntu.edu.sg/home/fcbond/" rel="noreferrer noreferrer" target="_blank">http://www3.ntu.edu.sg/home/fcbond/</a>><br>
> Division of Linguistics and Multilingual Studies<br>
> Nanyang Technological University<br>
><br>
</blockquote></div>