[developers] Parsing with ACE

Thu Jun 25 00:19:15 CEST 2015

Hi Petter,

I notice "format error: unknown type `+’." in the grammar loading log.  There’s nothing to say where that’s coming from, but in fact it refers to line 53 of rpp/lkb.rpp where a rule starts with '+' when ACE ungenerously believes it ought to start with '!'.

The next problem I found is that lexemes have no TOKENS feature.  This feature is introduced on the type `word’ by a type addendum in tmt.tdl, but lexemes do not inherit from `word’.  With a token-aware workflow, the output of the token mapping phase is unified into the TOKENS feature of lexemes; when that feature is missing / not appropriate, it is an unexpected situation.

Additionally, the token mapping rule "generic_name_tmr" is defaulting all tokens to [ +TRAIT: generic_trait ], which means they are incompatible with native lexical entries.  Since there are no POS tags, the generic lexical entries are also incompatible, so you get no lexemes and no parse.

Finally, the tiny-lex.tdl lexicon has a start-of-string lexeme whose orthography is "START" rather than "^", which makes it unable to match the "^" introduced by the REPP rules.

I took the liberty of changing tmt.tdl to introduce TOKENS and the accompanying constraints on word-or-lexrule instead of word, commenting out generic_name_tmr, and rewriting START to ^ in tiny-lex.tdl.  With these changes I can parse "Jon sover" and get a plausible-looking MRS out.

I hope that is helpful advice,
-Woodley

> On Jun 24, 2015, at 5:36 AM, Petter Haugereid <petterha at gmail.com> wrote:
> 
> Hi,
> 
> I am trying to load my Norwegian grammar into ACE, but I run into some issues when I try to parse a sentence.
> 
> Loading the grammar seems to go fine (the config file is based on that of Jacy):
> 
> petter at tor:~/tools/ace-0.9.21$ ./ace -G norsyg.dat -g ../../logon/petter/norsyg/ace/config.tdl
> reading configuration       from `../../logon/petter/norsyg/ace/config.tdl'
> reading instance            from `../../logon/petter/norsyg/ace/../pet/qc.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../mtr.tdl'
> grammar version             Norsyg (1206)
> format error: unknown type `+'.
> reading grammar             from `../../logon/petter/norsyg/ace/../norwegian.tdl'
> reading lexical-filtering-rulefrom `../../logon/petter/norsyg/ace/../lfr.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../matrix.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../nor.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../infl-codes.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../tmt.tdl'
> reading types               from `../../logon/petter/norsyg/ace/../unknown.tdl'
> reading lexical entries     from `../../logon/petter/norsyg/ace/../tiny-lex.tdl'
> reading token-mapping-rule  from `../../logon/petter/norsyg/ace/../tmr/prelude.tdl'
> reading token-mapping-rule  from `../../logon/petter/norsyg/ace/../tmr/pos.tdl'
> reading token-mapping-rule  from `../../logon/petter/norsyg/ace/../tmr/pos-ipa.tdl'
> reading token-mapping-rule  from `../../logon/petter/norsyg/ace/../tmr/finis.tdl'
> reading generic-lex-entry   from `../../logon/petter/norsyg/ace/../gle.tdl'
> reading rules               from `../../logon/petter/norsyg/ace/../rules.tdl'
> reading lexical rules       from `../../logon/petter/norsyg/ace/../tiny-irules.tdl'
> reading instance            from `../../logon/petter/norsyg/ace/../labels.tdl'
> reading instance            from `../../logon/petter/norsyg/ace/../roots.tdl'
> checking for glbs...        0.53 sec
> processing constraints...   0.67 sec
> processing rules            35 ms
> processing lex-rules        0 ms
> reading irregular forms     from ../irregs.tab
> processing lexicon...       1 ms
> simple lexemes              0 / 3 = 0.00%
> 3336 types (1501 glb), 3 lexemes, 77 rules, 1 orules, 983 instances, 722 strings, 234 features
> loading maxent model        0 ms
> reading tree labels         from `../../logon/petter/norsyg/ace/../labels.tdl'
> loading tree-node-labels
> rule filter...              83.3% blocked (39.1% ss)
> rule filter...              83.3% blocked (39.0% ss)
> rule filter...              83.3% blocked (39.0% ss)
> rf-transitive closure...    1 ms
> loaded grammar in 2.41391s
>  types: 33.9M rules: 8.4M lex-info: 500 
>  miscellaneous: 62K lex-dgs: 71K miscellaneous: 13.7M sem-index: 85K stochastic-model: 0 latmap rules: 18K
>  ... freezing 55.8M to file map 0x6000000000
> 
> 
> But when I try to parse the sentence "Jon sover", I get an error message:
> 
> petter at tor:~/tools/ace-0.9.21$ ./ace -g norsyg.dat -Tf1
> Jon sover
> ERROR: toklist or toklast missing on a token
> NOTE: lexemes do not span position 0 `^'!
> NOTE: post reduction gap
> SKIP: Jon sover
> NOTE: ignoring `Jon sover'
> 
> It should be noted that I use REPP to add "^ " at the beginning of every input string, so the string the grammar attempts to parse is "^ Jon sover". ("^" has a lexical entry.) 
> I don't quite understand the meaning of the ERROR message. I have tried to find out if there are any TOKENS features that are missing in the grammar, but I don't know what is expected of the grammar. I am attaching a stripped down version of the grammar in case anyone would like to try to find out what goes wrong. (The config file is in ace/.) 
> 
> Best regards,
> 
> Petter
> <norsyg_2015-06-24.tgz>