[developers] character-based discriminants

Mon Mar 23 17:55:10 CET 2020

Hi Woodley and Stephan,
[and with apologies to everyone else for the cryptic flavor of this note, which has to do with a conversion of the ERG to treat punctuation marks as separate tokens, for better interoperability with the rest of the universe]

I was able to use the converted `decision' files that you constructed during my visit in February, Woodley, with some non-zero additional manual disambiguation, and this morning I completed updating of the full set of 2018 gold trees into the makeover universe, including wsj00-04.  I would now be grateful if you could also provide converted decision files for the wsj05-12 profiles that had also been updated with the 2018 grammar after it was released.  Since the 2018mo grammar doesn't really have a natural home in SVN, I have put a full copy of it here, and included in its tsdb/gold directory both the recent updated profiles, and the 2018 ones for wsj05-wsj12 that I hope you'll convert:
http://lingo.stanford.edu/danf/2018mo.tgz

My intention is to now update these gold profiles from that time-warped 2018mo grammar to the SVN `mo' grammar (which we branched from `trunk' during my visit to Oslo in November).  If all goes well, we should then be in position to anoint `mo' as the official new `trunk' version, and use this as the basis for the next stable ERG release, ideally this summer.

I would also be interested to know if these now-manually-updated profiles allow you to train a better disambiguation model than the one you trained in February just on the automatically updated items.

Thanks for the help so far!

 Dan

________________________________
From: developers-bounces at emmtee.net <developers-bounces at emmtee.net> on behalf of Woodley Packard <sweaglesw at sweaglesw.org>
Sent: Tuesday, February 4, 2020 4:35 PM
To: Stephan Oepen <oe at ifi.uio.no>
Cc: developers at delph-in.net <developers at delph-in.net>
Subject: Re: [developers] character-based discriminants

Stephan and Dan, and other interested parties,

Happy new year to you all.  In the course of taking a closer look at how
the proposed character-based discriminant system might work, I've run
across a few cases that perhaps would benefit from a bit of discussion.
First, my attempt to distill the proposed action plan for an automatic
update (downdate?) of the ERG treebanks to the venerable PTB punctuation
convention is as follows:

1. Modify ACE and other engines to use input character positions as
token vertex identifiers, so that data coming out -- particularly the
full forest record in the "edge" relation -- uses these to identify
constituent boundaries instead of the existing identifiers
(corresponding roughly to whitespace areas).

2. Mechanically revise a copy of the "decisions" relation from the old
gold treebank so that the vertex identifiers in it are also
character-based, in hopes of matching those used in the new full forest
profiles.  Destroy any discriminants that are judged unlikely to match
correctly.

3. Run an automatic treebank update to achieve a high coverage gold
treebank under the new punctuation convention; manually fix any items
that didn't quite make it.

Stephan pointed out that the +FROM/+TO values on token AVMs are a way to
convert existing vertices to character positions.  Thinking a bit more
closely about this, there is at least one obvious problem: adjacent
tokens T1,T2 do not generally have the property that T1.+TO = T2.+FROM,
because there is usually whitespace between them.  Therefore the revised
scheme will have the property that whitespace adjacent to a constituent
will in a sense be considered part of the constituent in some cases.  I
consider that slightly weird, but perhaps not too big a deal.  The main
thing is we need to pick a convention as to which position in the
whitespace is to be considered the label of the vertex.  One candidate
convention would be that for any given vertex, its character-based label
is the smallest +FROM value of any token starting from it, if any, and
if no token starts at it, then the largest +TO value of any token ending
at it.  I would expect that at least in ordinary cases, possibly all
cases, all the incident +FROMs would be identical and all the +TOs would
be identical also, just with a difference between the +FROMs and +TOs.

A somewhat more troubling problem is that multiple token vertices in the
ERG can share the same +FROM and +TO.  This happens quite productively
with hyphenation, e.g.:

A four-footed zebra arose.

The historical ERG assigns [ +FROM "2" +TO "13" ] to both "four" and
"footed" even while the token lattice is split in the middle, i.e. there
are two tokens and there is a vertex "in between" them, but there is no
sensible character offset available to assign to it.  In the existing
vertex labeling scheme, the vertex labels are generated based on a
topological sort of the lattice, so we get:
a(0,1)
four(1,2)
footed(2,3)
zebra(3,4)
arose(4,5)

Using the convention proposed above, this would translate into:
a(0,3)
four(3,3)
footed(3,14)
zebra(14,20)
arose(20,26)

As you can see, there is a problem: two distinct vertices got smushed
into character position 3.  The situation is detectable automatically,
of course, and ACE actually already has a built-in hack to adjust token
+FROM and +TO in this case (making it possible to use the mouse to
select parts of a hyphenated group like that in FFTB), but relying on
that hack means hoping that ACE made the same decisions as the new
punctuation rules in this case and any others that I haven't thought of.

I am tempted to look at an alternative way of achieving the primary goal
(i.e. synchronizing the ERG treebanks to the revised punctuation
scheme).  It would I believe be possible, maybe even straightforward, to
make a tool that takes as input two token lattices (the old one and the
new one for the same sentence) and computes an alignment between them
that minimizes some notion of edit distance.  With that in hand, the
vertex identifiers of the old discriminants could be rewritten without
resorting to character positions or having to solve the above snafu.  It
also would require no changes to the parsing engines or the treebanking
tool, and would likely be at least partially reusable for future
tokenization changes.

Any suggestions?
Woodley

On 11/24/2019 03:43 PM, Stephan Oepen wrote:
> many thanks for the quick follow-up, woodley!
>
> in general, character-based discriminants feel attractive because the idea
> promises increased robustness to variation over time in tokenization.  and
> i am not sure yet i understand the difference in expressivity that you
> suggest?  an input to parsing is segmented into a sequence of vertices (or
> breaking points); whether to number these continuously (0, 1, 2, …) or
> discontinuously according to e.g. corresponding character positions or time
> stamps (into a speech signal)—i would think i can encode the same broad
> range of lattices either way?
>
> closer to home, i was in fact thinking that the conversion from an existing
> set of discriminants to a character-based regime could in fact be more
> mechanic than the retooling you sketch.  each current vertex should be
> uniquely identified with a left and right character position, viz. the
> +FROM and +TO values, respectively, on the underlying token feature
> structures (i am assuming that all tokens in one cell share the same
> values).  for the vast majority of discriminants, would it not just work to
> replace their start and end vertices with these characters positions?
>
> i am prepared to lose some discriminants, e.g. any choices on the
> punctuation lexical rules that are being removed, but possibly also some
> lexical choices that in the old universe end up anchored to a sub-string
> including one or more punctuation marks.  in the 500-best treebanks, it
> used to be the case that pervasive redundancy of discriminants meant one
> could afford to lose a non-trivial number of discriminants during an update
> and still arrive at a unique solution.  but maybe that works differently in
> the full-forest universe?
>
> finally, i had not yet considered the ‘twigs’ (as they are an FFTB-specific
> innovation).  yes, it would seem unfortunate to just lose all twigs that
> included one or more of the old punctuation rules!  so your candidate
> strategy of cutting twigs into two parts (of which one might often come out
> empty) at occurrences of these rules strikes me as a promising (still quite
> mechanic) way of working around this problem.  formally, breaking up twigs
> risks losing some information, but in this case i doubt this would be the
> case in actuality.
>
> thanks for tossing around this idea!  oe
>
>
> On Sat, 23 Nov 2019 at 20:30 Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>
>> Hi Stephan,
>>
>> My initial reaction to the notion of character-based discriminants is (1)
>> it will not solve your immediate problem without a certain amount of custom
>> tooling to convert old discriminants to new ones in a way that is sensitive
>> to how the current punctuation rules work, i.e. a given chart vertex will
>> have to be able to map to several different character positions depending
>> on how much punctuation has been cliticized so far.  The twig-shaped
>> discriminants used by FFTB will in some cases have to be bifurcated into
>> two or more discriminants, as well. Also, (2) this approach loses the
>> (theoretical if perhaps not recently used) ability to treebank a nonlinear
>> lattice shaped input, e.g. from an ASR system.  I could imagine treebanking
>> lattices from other sources as well — perhaps an image caption generator.
>>
>> Given the custom tooling required for updating the discriminants, I’m not
>> sure switching to character-based anchoring would be less painful than
>> having that tool compute the new chart vertex anchoring instead — though I
>> could be wrong.  What other arguments can be made in favor of
>> character-based discriminants?
>>
>> In terms of support from FFTB, I think there are relatively few places in
>> the code that assume the discriminants’ from/to are interpretable beyond
>> matching the from/to values of the `edge’ relation.  I think I would
>> implement this by (optionally, I suppose, since presumably other grammars
>> won’t want to do this at least for now) replacing the from/to on edges read
>> from the profile with character positions and more or less pretend that
>> there is a chart vertex for every character position.  Barring unforeseen
>> complications, that wouldn’t be too hard.
>>
>> Woodley
>>
>>> On Nov 23, 2019, at 5:58 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
>>>
>>> hi again, woodley,
>>>
>>> dan and i are currently exploring a 'makeover' of ERG input
>>> processing, with the overall goal of increased compatibility with
>>> mainstream assumptions about tokenization.
>>>
>>> among other things, we would like to move to the revised (i.e.
>>> non-venerable) PTB (and OntoNotes and UD) tokenization conventions and
>>> avoid subsequent re-arranging of segmentation in token mapping.  this
>>> means we would have to move away from the pseudo-affixation treatment
>>> of punctuation marks to a 'pseudo-clitization' approach, meaning that
>>> punctuation marks are lexical entries in their own right and attach
>>> via binary constructions (rather than as lexical rules).  the 'clitic'
>>> metaphor, here, is intended to suggest that these lexical entries can
>>> only attach at the bottom of the derivation, i.e. to non-clitic
>>> lexical items immediately to their left (e.g. in the case of a comma)
>>> or to their right (in the case of, say, an opening quote or
>>> parenthesis).
>>>
>>> dan is currently visiting oslo, and we would like to use the
>>> opportunity to estimate the cost of moving to such a revised universe.
>>> treebank maintenance is a major concern here, as such a radical change
>>> in the yields of virtually all derivations would render discriminants
>>> invalid when updating to the new forests.  i believe a cute idea has
>>> emerged that, we optimistically believe, might eliminate much of that
>>> concern: character-based discriminant positions, instead of our
>>> venerable way of counting chart vertices.
>>>
>>> for the ERG at least, we believe that leaf nodes in all derivations
>>> are reliably annotated with character start and end positions (+FROM
>>> and +TO, as well as the +ID lists on token feature structures).  these
>>> sub-string indices will hardly be affected by the above change to
>>> tokenization (except for cases where our current approach to splitting
>>> at hyphens and slashes first in token mapping leads to overlapping
>>> ranges).  hence if discriminants were anchored over character ranges
>>> instead of chart cells ... i expect the vast majority of them might
>>> just carry over?
>>>
>>> we would be grateful if you (and others too, of course) could give the
>>> above idea some critical thought and look for possible obstacles that
>>> dan and i may just be overlooking?  technically, i imagine one would
>>> have to extend FFTB to (optionally) extract discriminant start and end
>>> positions from the sub-string 'coverage' of each constituent, possibly
>>> once convert existing treebanks to character-based indexing, and then
>>> update into the new universe using character-based matching.  does
>>> such an approach seem feasible to you in principle?
>>>
>>> cheers, oe
>>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20200323/ae9a5ef4/attachment-0001.html>