[developers] chart mapping missing applicable lexical filtering rule?

paul at haleyai.com
Sun Jul 8 13:55:01 CEST 2018

Hi Stephan,


You led me in good directions, especially elsewhere than ORTH.FIRST.  I found that the following works for both cases (“bank” and “bank.”):


*	veto_capitalized_native_uncapitalized_lfr := lexical_filtering_rule & [+CONTEXT <>,+INPUT <[SYNSEM [PHON.ONSET con_or_voc & [--TL.FIRST.+CARG ^[[:lower:]].*$],LKEYS.KEYREL [PRED named_np_rel, CARG ^[[:upper:]].*$]]]>,+OUTPUT <>].


This does indeed cause a significant reduction in ambiguity for careful text.  Now it works a little better. 


Thank you,



P.S. There is still something fishy going on, though, since for “it is the.” (which I do not expect to parse) a lexical item for the_pn_np1_no remains in the chart with an AVM that matches the above lexical filtering rule but the rule is not applied/logged.  (The ‘period’ lexical rule does not apply in this case.) 


If interested, the chart will be here for a while:  https://1drv.ms/u/s!Am2TXpQ1-kjsjQRUXeFXFK8QJ7mD


From: Stephan Oepen <oe at ifi.uio.no> 
Sent: Saturday, July 7, 2018 5:19 PM
To: paul at haleyai.com
Cc: developers <developers at delph-in.net>
Subject: Re: [developers] chart mapping missing applicable lexical filtering rule?


hi paul,


lexical filtering applies after lexical parsing, i.e. you need to make sure your rule matches the complete lexical item—in the case where there is a trailing period, that will be an instance of the ’period‘ lexical rule with the ’bank‘ lexical entry as its daughter.


not quite sure what the orthographemic machinery does about ORTH values, but i suspect that after the application of the ’period‘ the ORTH value may be either unset or (more likely) normalized to all lower case.  upon the application of orthographemic (aka spelling-changing) rules, the ORTH value of the mother cannot just be determined by unification, e.g. a re-entrancy into the daughter (as is common for lexical rules that do not affect spelling).


so, to make your current approach work, i think you would have to let the trigger rule detect proper names by a property other than ORTH.


alternatively, you could try making ORTH.FIRST re-entrant with TOKENS.+LIST.FIRST.+FORM, so that lexical instantiation will fail against an incoming token feature structure that does not match in case.  i have long been thinking this latter technique (as a type addendum on n_-_pn_le) could make a nice stepping stone towards a case-sensitive configuration of the ERG (which might give non-trivial efficiency gains on carefully edited text :-).


best wishes, oe



On Sat, 7 Jul 2018 at 21:21 <paul at haleyai.com> wrote:

Dear Developers,

In one use case, it would be nice to limit the use of capitalized proper nouns to cases in which the input is capitalized.  I have been successful in doing so with some exception, such as shown below.

I am surprised by the following behavior and either have something to learn or perhaps there is a bug in PET's chart mapping?


Given a capitalized lexical entry such as:

      Bank_NNP := n_-_pn_le & [ORTH <"Bank">,SYNSEM [LKEYS.KEYREL.CARG "Bank",PHON.ONSET con]].

The following lexical filtering rule (which has been simplified for the demonstration purposes of this email):

      veto_capitalized_native_uncapitalized_lfr := lexical_filtering_rule & [+CONTEXT <>,+INPUT <[ORTH.FIRST ^[[:upper:]].*$]>,+OUTPUT <>].

will 'correctly' remove Bank_NNP from the chart when the input is "it is the bank" but fails to do so when a period is appended.

PET's logging of lexical rules shows as follows for the first case:

      [cm] veto_capitalized_native_uncapitalized_lfr fired: I1:85 
      L [85 2-3 the_pn_np1_no (1) -0.1123 {} { : } {}] < blk: 2 dtrs: 50  parents: >
      [cm] veto_capitalized_native_uncapitalized_lfr fired: I1:92 
      L [92 3-4 Bank_NNP (1) 0 {} { : } {}] < blk: 2 dtrs: 51  parents: 98 >
      [cm] veto_capitalized_native_uncapitalized_lfr fired: I1:98 
      P [98 3-4 n_sg_ilr (1) 0 {} { : } {}] < blk: 2 dtrs: 92  parents: >

Surprisingly, only the first of these 3 rules applies in the second case. 

I don't think it matters, but in our case, input is via FSC in which the period is a token.  Thus, the following token mapping rule applies in the second case only:

    [cm] suffix_punctuation_tmr fired: C1:50 I1:48 O1:51 
    I [50 () -1--1 <14:15> "" "." { : } {}] < blk: 0 >
    I [48 () -1--1 <10:14> "" "bank" { : } {}] < blk: 2 >
    I [51 () -1--1 <10:15> "" "bank." { : } {}] < blk: 0 >

A redacted AVM for the surviving lexical item follows. As far as I can tell, it matches the lexical filtering rule above and thus should not remain in the chart.

L [103 3-4 Bank_NNP (1) 0 {} { : w_period_plr} {}] < blk: 0 dtrs: 63  parents: 110 >
[ ...
  SYNSEM   ...
             PHON   phon
                    [ ONSET con
                            [ --TL #16:native_token_cons
                                       [ FIRST token
                                               [ +CLASS #17:alphabetic
                                                            [ +CASE    non_capitalized+lower,
                                                              +INITIAL - ],
                                                 +FROM  #3,
                                                 +FORM  #18:"bank.",
                                                 +TO    "15",
                                                 +CARG  "bank",
                                         REST  native_token_null ] ] ],
             LKEYS  lexkeys_norm
                    [ KEYREL    named_nom_relation
                                [ CFROM #3,
                                  CTO   #29:"15",
                                  PRED  named_rel,
                                  LBL   #15,
                                  LNK   *list*,
                                  ARG0  #14,
                                  CARG  "Bank" ],
  ORTH     orthography
           [ FIRST "Bank",
             REST  *null*,
             FROM  #3,
             CLASS #17,
  TOKENS   tokens
           [ +LIST #16,
             +LAST token
                   [ +CLASS #17,
                     +FROM  "10",
                     +FORM  "bank.",
                     +TO    #29,
                     +CARG  "bank",

