<div dir="auto">Thanks for the feedback! <div dir="auto"> </div><div dir="auto">I tried to list the characters I thought should be excluded. Did I miss any?</div></div> <div class="gmail_quote"><div dir="ltr">On Tue, 23 Oct 2018, 19:37 Stephan Oepen, <<a href="mailto:oe@ifi.uio.no">oe@ifi.uio.no</a>> wrote: </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">personally, i like all of your identifiers, francis, maybe even including the one with a non-breaking space :-). i would think whether or not they are accepted by ‘\w’ depends on the specific interpretation of ‘word characters’, which in turn may well depend on your local set-up, i.e. the current locale. from the perlre(1) man page: [...]Thus, under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, "\w" will match any of the more than 100,000 word characters in Unicode. Unlike most locales, which are specific to a language and country pair, Unicode classifies all the characters that are letters somewhere in the world as "\w". to avoid such dependencies on locale context, it might indeed be simpler to define the syntax in terms of everything except a small list of characters (that have operator-like status in TDL). this is more or less how the current lexers in the LKB and PET work, so might also be easier to make consistent across platforms. cheers, oe On Tue, Oct 23, 2018 at 12:30 PM Francis Bond <<a href="mailto:bond@ieee.org" target="_blank" rel="noreferrer">bond@ieee.org</a>> wrote: > > G'day, > > currently Zhong has several identifiers which Mike's TDL code > considers invalid, but which the LKB and ACE are fine with: > > ＊-marker := symbol & > ，_c_1 := conj_-_e_le & > _n_1 := n_-_pn_le & > 和_c_⚠ := conj_-_e_le & > 格里姆斯比•罗伊洛特_n_1 := n_-_h_pn_le & > > full width * > full width , > nonbreakspace [our bad, I will remove] > warning sign (which I like to use for mal-rules). > dot (often used in foreign names) > > And in Jacy: > ザ・ベスト_n_1-tc := ordinary-nohon-n-lex & > full width dot (often used in foreign names) > > PyDelphin defines identifiers to be: ([\w_+*?-]+), > and coreference to be \#([^\s!"#$&'(),./:;<=>[\]^]+) > > It would be nice to at least include: ・•⚠, in identifiers, but maybe > better to have a list of disallowed things (like coreference, now I > guess with |): > > ([^\s!"#$&'(),./:;<=>[\]^|]+) > > and even better if the LKB, PET, ACE, AGREE and PyDelphin are consistent. > > What do people think? > > -- > Francis Bond <<a href="http://www3.ntu.edu.sg/home/fcbond/" rel="noreferrer noreferrer" target="_blank">http://www3.ntu.edu.sg/home/fcbond/</a>> > Division of Linguistics and Multilingual Studies > Nanyang Technological University > </blockquote></div>