[developers] Bug report for ERG

Dan Flickinger danf at stanford.edu
Thu Nov 12 03:40:44 CET 2020


One of the unfortunate consequences of the change in tokenization for the trunk ERG (treating punctuation marks as separate tokens) is that we no longer correctly handle web addresses in text, because the tokenizer now splits at slashes and periods, `exploding' URLs into many separate tokens.  This is obviously not the desired behavior, and Stephan has been leading an effort to get a uniform preprocessing mechanism into the various platforms so we can cope with URLs and the like, by ensuring that they are single tokens by the time the parser sees them.

In the meantime, Alexandre, perhaps you can write a little temporary script that replaces URLs with a single simple token before presenting a sentence to ACE for parsing.

 Dan

________________________________
From: developers-bounces at emmtee.net <developers-bounces at emmtee.net> on behalf of goodman.m.w at gmail.com <goodman.m.w at gmail.com>
Sent: Wednesday, November 11, 2020 6:19 PM
To: Alexandre Rademaker <arademaker at gmail.com>
Cc: developers <developers at delph-in.net>
Subject: Re: [developers] Bug report for ERG

Hi Alexandre,

I was able to reproduce the issue using the ERG 2018 (which creates a named EP with the URL as its CARG) and a ~3-month old trunk version of the ERG (which tokenized the URL). I'll leave the question of the ERG's behavior to the pros, and I'll address the MRS syntax problem.

PyDelphin reported the syntax error at the '.' character because that's the point at which the SimpleMRS parser was unable to proceed, but the problem is in fact the '_' in the lemma portion of the predicate symbol. Currently there is no agreed-upon way to have a lemma containing '_', as '_' is the delimiter between the lemma and pos fields. The so-called "TypePred" production in the SimpleMRS BNF at http://moin.delph-in.net/MrsRfc#Simple is overly permissive (note: I wrote it, adapting Bec's original). Stephan and I had some discussion about the mini-format of predicate symbols on GitHub (https://github.com/delph-in/pydelphin/issues/302) but unfortunately little of that conversation made it to this list.

In short, I propose a character-escaping solution for use in predicate symbols for all serialization formats. For this, we could recycle TSDB's three escapes (\s, \n, and \\), where in this case the separator \s is '_' instead of '@'. The serialization formats (SimpleMRS, MRX, EDS native, etc.). Any other characters that might cause issues in parsing (such as a space or '<' in SimpleMRS, also '[', '{', or '(' in EDS, etc.) would be handled by those formats individually. For SimpleMRS, I suggest quoting any predicate that contains a space or '<' (and quotes are not part of the predicate format, only part of SimpleMRS's), and then escaping quotes (\") inside predicates. This means that abstract predicates (compound, udef_q, etc) would also be quoted, if they had a space or '<'. In MRX, a predicate with '<' would need to replace it with &lt;, and so on.

If we agree on such a change, then both PyDelphin and ACE (and other processors) would need to be modified to get around the issue you're experiencing. Of course, this specific issue could be sidestepped by getting the ERG to put URLs back into CARGs instead of being tokenized and parsed into generic predicate symbols.

On Thu, Nov 12, 2020 at 12:54 AM Alexandre Rademaker <arademaker at gmail.com<mailto:arademaker at gmail.com>> wrote:

BTW, regardless the tokenisation issue, an invalid MRS should not be produced, right?

Best,
Alexandre

> On 10 Nov 2020, at 18:39, Alexandre Rademaker <arademaker at gmail.com<mailto:arademaker at gmail.com>> wrote:
>
> Hi,
>
> I am trying to parse the sentences from EWT corpus (https://github.com/universaldependencies/UD_English-EWT) but in the DEV set I have a non-sense sentence with only an url between brackets:
>
> [http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]
>
> ACE reports an invalid MRS. The error is in the character 2666, so probably the error is the predicate:
>
> _search_x.htm?csp=34/NN_u_unknown
>
> But the regex for predicates seems to support dot in the name of the predicate:
>
> http://moin.delph-in.net/MrsRfc#SerializationFormats
>
> Anyway, the pre-processing of the sentence seems wrong to me in ERG trunk version, the tokenisation broke the url into many tokens and consumed the protocol `http://` prefix:
>
> % ace -g ~/hpsg/wn/terg-mac.dat -E
> [http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]
> www.usatoday. com / tech/ science / space/ 2005 – 03 – 09 - nasa - search_x.htm?csp=34
>
> ERG (2018) produced what I was expecting:
>
> % ace -g erg-mac.dat -E
> [http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]
> www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34<http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34>
>
> ERG (1214) produced what I was expecting:
>
> % ace -g erg-lingo-mac.dat -E
> [http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]
> [ http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34 ]
>
>
>>>> response = ace.parse(grm, '[http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]')
> NOTE: hit RAM limit while unpacking
> NOTE: parsed 1 / 1 sentences, avg 1536033k, time 51.15306s
>
>>>> response.result(0).mrs()
> Traceback (most recent call last):
>  File "<stdin>", line 1, in <module>
>  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/interface.py", line 146, in mrs
>    mrs = simplemrs.decode(mrs)
>  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py", line 112, in decode
>    return _decode_mrs(lexer)
>  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py", line 200, in _decode_mrs
>    rels.append(_decode_rel(lexer, variables))
>  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py", line 252, in _decode_rel
>    _, label = lexer.expect((FEATURE, 'LBL'), (SYMBOL, None))
>  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/util.py", line 473, in expect
>    raise self._errcls('expected: ' + err,
> delphin.mrs._exceptions.MRSSyntaxError:
>  line 1, character 2666
>    [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] RELS: < [ implicit_conj<8:79> LBL: h1 ARG0: e2 ARG1: e4 [ e SF: prop TENSE: tensed MOOD: indicative ] ARG2: e5 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] ]  [ unknown<8:21> LBL: h1 ARG0: e4 ARG: u6 ]  [ _www.usatoday./JJ_u_unknown<8:21> LBL: h1 ARG0: e7 [ e SF: prop ] ARG1: u6 ]  [ implicit_conj<21:79> LBL: h1 ARG0: e5 ARG1: e8 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] ARG2: e9 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] ]  [ unknown<21:49> LBL: h1 ARG0: e8 ARG: x10 ]  [ udef_q<21:49> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]  [ udef_q<21:24> LBL: h14 ARG0: x15 [ x PERS: 3 NUM: sg ] RSTR: h16 BODY: h17 ]  [ _com/NN_u_unknown<21:24> LBL: h18 ARG0: x15 ]  [ _and_c<24:25> LBL: h19 ARG0: x10 ARG1: x15 ARG2: x20 ]  [ udef_q<25:49> LBL: h21 ARG0: x20 RSTR: h22 BODY: h23 ]  [ udef_q<25:37> LBL: h24 ARG0: x25 [ x PERS: 3 NUM: sg ] RSTR: h26 BODY: h27 ]  [ _tech//JJ_u_unknown<25:30> LBL: h28 ARG0: e29 [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1: x25 ]  [ _science_n_1<30:37> LBL: h28 ARG0: x25 ]  [ _and_c<37:38> LBL: h30 ARG0: x20 ARG1: x25 ARG2: x31 [ x PERS: 3 NUM: sg IND: + ] ]  [ proper_q<38:49> LBL: h32 ARG0: x31 RSTR: h33 BODY: h34 ]  [ compound<38:49> LBL: h35 ARG0: e36 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x31 ARG2: x37 [ x PT: pt ] ]  [ udef_q<38:44> LBL: h38 ARG0: x37 RSTR: h39 BODY: h40 ]  [ _space//NN_u_unknown<38:44> LBL: h41 ARG0: x37 ]  [ yofc<44:48> LBL: h35 CARG: "2005" ARG0: x31 ]  [ implicit_conj<49:79> LBL: h1 ARG0: e9 ARG1: e43 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] ARG2: e44 [ e SF: prop-or-ques TENSE: tensed MOOD: indicative ] ]  [ unknown<49:52> LBL: h1 ARG0: e43 ARG: x45 [ x PERS: 3 NUM: sg IND: + ] ]  [ proper_q<49:52> LBL: h46 ARG0: x45 RSTR: h47 BODY: h48 ]  [ yofc<49:51> LBL: h49 CARG: "03" ARG0: x45 ]  [ implicit_conj<52:79> LBL: h1 ARG0: e44 ARG1: e51 [ e SF: prop-or-ques TENSE: tensed MOOD: indicati!
 ve ] ARG2: e52 [ e SF: prop-or-ques ] ]  [ unknown<52:55> LBL: h1 ARG0: e51 ARG: x53 [ x PERS: 3 NUM: sg IND: + ] ]  [ proper_q<52:55> LBL: h54 ARG0: x53 RSTR: h55 BODY: h56 ]  [ yofc<52:54> LBL: h57 CARG: "09" ARG0: x53 ]  [ unknown<55:79> LBL: h1 ARG0: e52 ARG: x59 [ x PERS: 3 NUM: sg ] ]  [ udef_q<55:79> LBL: h60 ARG0: x59 RSTR: h61 BODY: h62 ]  [ compound<55:79> LBL: h63 ARG0: e64 [ e SF: prop TENSE: untensed MOOD: indicative PROG: - PERF: - ] ARG1: x59 ARG2: x65 [ x PERS: 3 NUM: sg IND: + PT: pt ] ]  [ proper_q<55:60> LBL: h66 ARG0: x65 RSTR: h67 BODY: h68 ]  [ named<55:59> LBL: h69 CARG: "NASA" ARG0: x65 ]  [ _search_x.htm?csp=34/NN_u_unknown<60:79> LBL: h63 ARG0: x59 ] > HCONS: < h0 qeq h1 h12 qeq h19 h16 qeq h18 h22 qeq h30 h26 qeq h28 h33 qeq h35 h39 qeq h41 h47 qeq h49 h55 qeq h57 h61 qeq h63 h67 qeq h69 > ICONS: < > ]
>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      !
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         ^
> MRSSyntaxError: expected: a feature
>
>
> Best,
> Alexandre
>




--
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20201112/05eb6b9f/attachment-0001.html>


More information about the developers mailing list