[developers] Bug report for ERG

goodman.m.w at gmail.com goodman.m.w at gmail.com
Thu Nov 12 03:19:02 CET 2020


Hi Alexandre,

I was able to reproduce the issue using the ERG 2018 (which creates a named
EP with the URL as its CARG) and a ~3-month old trunk version of the ERG
(which tokenized the URL). I'll leave the question of the ERG's behavior to
the pros, and I'll address the MRS syntax problem.

PyDelphin reported the syntax error at the '.' character because that's the
point at which the SimpleMRS parser was unable to proceed, but the problem
is in fact the '_' in the lemma portion of the predicate symbol. Currently
there is no agreed-upon way to have a lemma containing '_', as '_' is the
delimiter between the lemma and pos fields. The so-called "TypePred"
production in the SimpleMRS BNF at http://moin.delph-in.net/MrsRfc#Simple
is overly permissive (note: I wrote it, adapting Bec's original). Stephan
and I had some discussion about the mini-format of predicate symbols on
GitHub (https://github.com/delph-in/pydelphin/issues/302) but unfortunately
little of that conversation made it to this list.

In short, I propose a character-escaping solution for use in predicate
symbols for all serialization formats. For this, we could recycle TSDB's
three escapes (\s, \n, and \\), where in this case the separator \s is '_'
instead of '@'. The serialization formats (SimpleMRS, MRX, EDS native,
etc.). Any other characters that might cause issues in parsing (such as a
space or '<' in SimpleMRS, also '[', '{', or '(' in EDS, etc.) would be
handled by those formats individually. For SimpleMRS, I suggest quoting any
predicate that contains a space or '<' (and quotes are not part of the
predicate format, only part of SimpleMRS's), and then escaping quotes (\")
inside predicates. This means that abstract predicates (compound, udef_q,
etc) would also be quoted, if they had a space or '<'. In MRX, a predicate
with '<' would need to replace it with &lt;, and so on.

If we agree on such a change, then both PyDelphin and ACE (and other
processors) would need to be modified to get around the issue you're
experiencing. Of course, this specific issue could be sidestepped by
getting the ERG to put URLs back into CARGs instead of being tokenized and
parsed into generic predicate symbols.

On Thu, Nov 12, 2020 at 12:54 AM Alexandre Rademaker <arademaker at gmail.com>
wrote:

>
> BTW, regardless the tokenisation issue, an invalid MRS should not be
> produced, right?
>
> Best,
> Alexandre
>
> > On 10 Nov 2020, at 18:39, Alexandre Rademaker <arademaker at gmail.com>
> wrote:
> >
> > Hi,
> >
> > I am trying to parse the sentences from EWT corpus (
> https://github.com/universaldependencies/UD_English-EWT) but in the DEV
> set I have a non-sense sentence with only an url between brackets:
> >
> > [
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> ]
> >
> > ACE reports an invalid MRS. The error is in the character 2666, so
> probably the error is the predicate:
> >
> > _search_x.htm?csp=34/NN_u_unknown
> >
> > But the regex for predicates seems to support dot in the name of the
> predicate:
> >
> > http://moin.delph-in.net/MrsRfc#SerializationFormats
> >
> > Anyway, the pre-processing of the sentence seems wrong to me in ERG
> trunk version, the tokenisation broke the url into many tokens and consumed
> the protocol `http://` prefix:
> >
> > % ace -g ~/hpsg/wn/terg-mac.dat -E
> > [
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> ]
> > www.usatoday. com / tech/ science / space/ 2005 – 03 – 09 - nasa -
> search_x.htm?csp=34
> >
> > ERG (2018) produced what I was expecting:
> >
> > % ace -g erg-mac.dat -E
> > [
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> ]
> > www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> >
> > ERG (1214) produced what I was expecting:
> >
> > % ace -g erg-lingo-mac.dat -E
> > [
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> ]
> > [
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34
> ]
> >
> >
> >>>> response = ace.parse(grm, '[
> http://www.usatoday.com/tech/science/space/2005-03-09-nasa-search_x.htm?csp=34]
> ')
> > NOTE: hit RAM limit while unpacking
> > NOTE: parsed 1 / 1 sentences, avg 1536033k, time 51.15306s
> >
> >>>> response.result(0).mrs()
> > Traceback (most recent call last):
> >  File "<stdin>", line 1, in <module>
> >  File
> "/Users/ar/.venv/lib/python3.9/site-packages/delphin/interface.py", line
> 146, in mrs
> >    mrs = simplemrs.decode(mrs)
> >  File
> "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py",
> line 112, in decode
> >    return _decode_mrs(lexer)
> >  File
> "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py",
> line 200, in _decode_mrs
> >    rels.append(_decode_rel(lexer, variables))
> >  File
> "/Users/ar/.venv/lib/python3.9/site-packages/delphin/codecs/simplemrs.py",
> line 252, in _decode_rel
> >    _, label = lexer.expect((FEATURE, 'LBL'), (SYMBOL, None))
> >  File "/Users/ar/.venv/lib/python3.9/site-packages/delphin/util.py",
> line 473, in expect
> >    raise self._errcls('expected: ' + err,
> > delphin.mrs._exceptions.MRSSyntaxError:
> >  line 1, character 2666
> >    [ LTOP: h0 INDEX: e2 [ e SF: prop-or-ques TENSE: tensed MOOD:
> indicative ] RELS: < [ implicit_conj<8:79> LBL: h1 ARG0: e2 ARG1: e4 [ e
> SF: prop TENSE: tensed MOOD: indicative ] ARG2: e5 [ e SF: prop-or-ques
> TENSE: tensed MOOD: indicative ] ]  [ unknown<8:21> LBL: h1 ARG0: e4 ARG:
> u6 ]  [ _www.usatoday./JJ_u_unknown<8:21> LBL: h1 ARG0: e7 [ e SF: prop ]
> ARG1: u6 ]  [ implicit_conj<21:79> LBL: h1 ARG0: e5 ARG1: e8 [ e SF:
> prop-or-ques TENSE: tensed MOOD: indicative ] ARG2: e9 [ e SF: prop-or-ques
> TENSE: tensed MOOD: indicative ] ]  [ unknown<21:49> LBL: h1 ARG0: e8 ARG:
> x10 ]  [ udef_q<21:49> LBL: h11 ARG0: x10 RSTR: h12 BODY: h13 ]  [
> udef_q<21:24> LBL: h14 ARG0: x15 [ x PERS: 3 NUM: sg ] RSTR: h16 BODY: h17
> ]  [ _com/NN_u_unknown<21:24> LBL: h18 ARG0: x15 ]  [ _and_c<24:25> LBL:
> h19 ARG0: x10 ARG1: x15 ARG2: x20 ]  [ udef_q<25:49> LBL: h21 ARG0: x20
> RSTR: h22 BODY: h23 ]  [ udef_q<25:37> LBL: h24 ARG0: x25 [ x PERS: 3 NUM:
> sg ] RSTR: h26 BODY: h27 ]  [ _tech//JJ_u_unknown<25:30> LBL: h28 ARG0: e29
> [ e SF: prop TENSE: untensed MOOD: indicative PROG: bool PERF: - ] ARG1:
> x25 ]  [ _science_n_1<30:37> LBL: h28 ARG0: x25 ]  [ _and_c<37:38> LBL: h30
> ARG0: x20 ARG1: x25 ARG2: x31 [ x PERS: 3 NUM: sg IND: + ] ]  [
> proper_q<38:49> LBL: h32 ARG0: x31 RSTR: h33 BODY: h34 ]  [ compound<38:49>
> LBL: h35 ARG0: e36 [ e SF: prop TENSE: untensed MOOD: indicative PROG: -
> PERF: - ] ARG1: x31 ARG2: x37 [ x PT: pt ] ]  [ udef_q<38:44> LBL: h38
> ARG0: x37 RSTR: h39 BODY: h40 ]  [ _space//NN_u_unknown<38:44> LBL: h41
> ARG0: x37 ]  [ yofc<44:48> LBL: h35 CARG: "2005" ARG0: x31 ]  [
> implicit_conj<49:79> LBL: h1 ARG0: e9 ARG1: e43 [ e SF: prop-or-ques TENSE:
> tensed MOOD: indicative ] ARG2: e44 [ e SF: prop-or-ques TENSE: tensed
> MOOD: indicative ] ]  [ unknown<49:52> LBL: h1 ARG0: e43 ARG: x45 [ x PERS:
> 3 NUM: sg IND: + ] ]  [ proper_q<49:52> LBL: h46 ARG0: x45 RSTR: h47 BODY:
> h48 ]  [ yofc<49:51> LBL: h49 CARG: "03" ARG0: x45 ]  [
> implicit_conj<52:79> LBL: h1 ARG0: e44 ARG1: e51 [ e SF: prop-or-ques
> TENSE: tensed MOOD: indicati!
>  ve ] ARG2: e52 [ e SF: prop-or-ques ] ]  [ unknown<52:55> LBL: h1 ARG0:
> e51 ARG: x53 [ x PERS: 3 NUM: sg IND: + ] ]  [ proper_q<52:55> LBL: h54
> ARG0: x53 RSTR: h55 BODY: h56 ]  [ yofc<52:54> LBL: h57 CARG: "09" ARG0:
> x53 ]  [ unknown<55:79> LBL: h1 ARG0: e52 ARG: x59 [ x PERS: 3 NUM: sg ] ]
> [ udef_q<55:79> LBL: h60 ARG0: x59 RSTR: h61 BODY: h62 ]  [ compound<55:79>
> LBL: h63 ARG0: e64 [ e SF: prop TENSE: untensed MOOD: indicative PROG: -
> PERF: - ] ARG1: x59 ARG2: x65 [ x PERS: 3 NUM: sg IND: + PT: pt ] ]  [
> proper_q<55:60> LBL: h66 ARG0: x65 RSTR: h67 BODY: h68 ]  [ named<55:59>
> LBL: h69 CARG: "NASA" ARG0: x65 ]  [
> _search_x.htm?csp=34/NN_u_unknown<60:79> LBL: h63 ARG0: x59 ] > HCONS: < h0
> qeq h1 h12 qeq h19 h16 qeq h18 h22 qeq h30 h26 qeq h28 h33 qeq h35 h39 qeq
> h41 h47 qeq h49 h55 qeq h57 h61 qeq h63 h67 qeq h69 > ICONS: < > ]
> >
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>                                                                 !
>
>
>
>
>
>
>
>
>                            ^
> > MRSSyntaxError: expected: a feature
> >
> >
> > Best,
> > Alexandre
> >
>
>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20201112/f2ce71fd/attachment-0001.html>


More information about the developers mailing list