<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<style type="text/css" style="display:none;"> P {margin-top:0;margin-bottom:0;} </style>
</head>
<body dir="ltr">
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Hi Woodley and Stephan,</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
[and with apologies to everyone else for the cryptic flavor of this note, which has to do with a conversion of the ERG to treat punctuation marks as separate tokens, for better interoperability with the rest of the universe]</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I was able to use the converted `decision' files that you constructed during my visit in February, Woodley, with some non-zero additional manual disambiguation, and this morning I completed updating of the full set of 2018 gold trees into the makeover universe,
including wsj00-04. I would now be grateful if you could also provide converted decision files for the wsj05-12 profiles that had also been updated with the 2018 grammar after it was released. Since the 2018mo grammar doesn't really have a natural home in
SVN, I have put a full copy of it here, and included in its tsdb/gold directory both the recent updated profiles, and the 2018 ones for wsj05-wsj12 that I hope you'll convert:</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<a href="http://lingo.stanford.edu/danf/2018mo.tgz" id="LPlnk632953">http://lingo.stanford.edu/danf/2018mo.tgz</a></div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
My intention is to now update these gold profiles from that time-warped 2018mo grammar to the SVN `mo' grammar (which we branched from `trunk' during my visit to Oslo in November). If all goes well, we should then be in position to anoint `mo' as the official
new `trunk' version, and use this as the basis for the next stable ERG release, ideally this summer.</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
I would also be interested to know if these now-manually-updated profiles allow you to train a better disambiguation model than the one you trained in February just on the automatically updated items.<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Thanks for the help so far!</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
Dan</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div style="font-family: Calibri, Arial, Helvetica, sans-serif; font-size: 12pt; color: rgb(0, 0, 0);">
<br>
</div>
<div id="appendonsend"></div>
<hr style="display:inline-block;width:98%" tabindex="-1">
<div id="divRplyFwdMsg" dir="ltr"><font face="Calibri, sans-serif" style="font-size:11pt" color="#000000"><b>From:</b> developers-bounces@emmtee.net <developers-bounces@emmtee.net> on behalf of Woodley Packard <sweaglesw@sweaglesw.org><br>
<b>Sent:</b> Tuesday, February 4, 2020 4:35 PM<br>
<b>To:</b> Stephan Oepen <oe@ifi.uio.no><br>
<b>Cc:</b> developers@delph-in.net <developers@delph-in.net><br>
<b>Subject:</b> Re: [developers] character-based discriminants</font>
<div> </div>
</div>
<div class="BodyFragment"><font size="2"><span style="font-size:11pt;">
<div class="PlainText">Stephan and Dan, and other interested parties,<br>
<br>
Happy new year to you all. In the course of taking a closer look at how <br>
the proposed character-based discriminant system might work, I've run <br>
across a few cases that perhaps would benefit from a bit of discussion. <br>
First, my attempt to distill the proposed action plan for an automatic <br>
update (downdate?) of the ERG treebanks to the venerable PTB punctuation <br>
convention is as follows:<br>
<br>
1. Modify ACE and other engines to use input character positions as <br>
token vertex identifiers, so that data coming out -- particularly the <br>
full forest record in the "edge" relation -- uses these to identify <br>
constituent boundaries instead of the existing identifiers <br>
(corresponding roughly to whitespace areas).<br>
<br>
2. Mechanically revise a copy of the "decisions" relation from the old <br>
gold treebank so that the vertex identifiers in it are also <br>
character-based, in hopes of matching those used in the new full forest <br>
profiles. Destroy any discriminants that are judged unlikely to match <br>
correctly.<br>
<br>
3. Run an automatic treebank update to achieve a high coverage gold <br>
treebank under the new punctuation convention; manually fix any items <br>
that didn't quite make it.<br>
<br>
Stephan pointed out that the +FROM/+TO values on token AVMs are a way to <br>
convert existing vertices to character positions. Thinking a bit more <br>
closely about this, there is at least one obvious problem: adjacent <br>
tokens T1,T2 do not generally have the property that T1.+TO = T2.+FROM, <br>
because there is usually whitespace between them. Therefore the revised <br>
scheme will have the property that whitespace adjacent to a constituent <br>
will in a sense be considered part of the constituent in some cases. I <br>
consider that slightly weird, but perhaps not too big a deal. The main <br>
thing is we need to pick a convention as to which position in the <br>
whitespace is to be considered the label of the vertex. One candidate <br>
convention would be that for any given vertex, its character-based label <br>
is the smallest +FROM value of any token starting from it, if any, and <br>
if no token starts at it, then the largest +TO value of any token ending <br>
at it. I would expect that at least in ordinary cases, possibly all <br>
cases, all the incident +FROMs would be identical and all the +TOs would <br>
be identical also, just with a difference between the +FROMs and +TOs.<br>
<br>
A somewhat more troubling problem is that multiple token vertices in the <br>
ERG can share the same +FROM and +TO. This happens quite productively <br>
with hyphenation, e.g.:<br>
<br>
A four-footed zebra arose.<br>
<br>
The historical ERG assigns [ +FROM "2" +TO "13" ] to both "four" and <br>
"footed" even while the token lattice is split in the middle, i.e. there <br>
are two tokens and there is a vertex "in between" them, but there is no <br>
sensible character offset available to assign to it. In the existing <br>
vertex labeling scheme, the vertex labels are generated based on a <br>
topological sort of the lattice, so we get:<br>
a(0,1)<br>
four(1,2)<br>
footed(2,3)<br>
zebra(3,4)<br>
arose(4,5)<br>
<br>
Using the convention proposed above, this would translate into:<br>
a(0,3)<br>
four(3,3)<br>
footed(3,14)<br>
zebra(14,20)<br>
arose(20,26)<br>
<br>
As you can see, there is a problem: two distinct vertices got smushed <br>
into character position 3. The situation is detectable automatically, <br>
of course, and ACE actually already has a built-in hack to adjust token <br>
+FROM and +TO in this case (making it possible to use the mouse to <br>
select parts of a hyphenated group like that in FFTB), but relying on <br>
that hack means hoping that ACE made the same decisions as the new <br>
punctuation rules in this case and any others that I haven't thought of.<br>
<br>
I am tempted to look at an alternative way of achieving the primary goal <br>
(i.e. synchronizing the ERG treebanks to the revised punctuation <br>
scheme). It would I believe be possible, maybe even straightforward, to <br>
make a tool that takes as input two token lattices (the old one and the <br>
new one for the same sentence) and computes an alignment between them <br>
that minimizes some notion of edit distance. With that in hand, the <br>
vertex identifiers of the old discriminants could be rewritten without <br>
resorting to character positions or having to solve the above snafu. It <br>
also would require no changes to the parsing engines or the treebanking <br>
tool, and would likely be at least partially reusable for future <br>
tokenization changes.<br>
<br>
Any suggestions?<br>
Woodley<br>
<br>
On 11/24/2019 03:43 PM, Stephan Oepen wrote:<br>
> many thanks for the quick follow-up, woodley!<br>
><br>
> in general, character-based discriminants feel attractive because the idea<br>
> promises increased robustness to variation over time in tokenization. and<br>
> i am not sure yet i understand the difference in expressivity that you<br>
> suggest? an input to parsing is segmented into a sequence of vertices (or<br>
> breaking points); whether to number these continuously (0, 1, 2, …) or<br>
> discontinuously according to e.g. corresponding character positions or time<br>
> stamps (into a speech signal)—i would think i can encode the same broad<br>
> range of lattices either way?<br>
><br>
> closer to home, i was in fact thinking that the conversion from an existing<br>
> set of discriminants to a character-based regime could in fact be more<br>
> mechanic than the retooling you sketch. each current vertex should be<br>
> uniquely identified with a left and right character position, viz. the<br>
> +FROM and +TO values, respectively, on the underlying token feature<br>
> structures (i am assuming that all tokens in one cell share the same<br>
> values). for the vast majority of discriminants, would it not just work to<br>
> replace their start and end vertices with these characters positions?<br>
><br>
> i am prepared to lose some discriminants, e.g. any choices on the<br>
> punctuation lexical rules that are being removed, but possibly also some<br>
> lexical choices that in the old universe end up anchored to a sub-string<br>
> including one or more punctuation marks. in the 500-best treebanks, it<br>
> used to be the case that pervasive redundancy of discriminants meant one<br>
> could afford to lose a non-trivial number of discriminants during an update<br>
> and still arrive at a unique solution. but maybe that works differently in<br>
> the full-forest universe?<br>
><br>
> finally, i had not yet considered the ‘twigs’ (as they are an FFTB-specific<br>
> innovation). yes, it would seem unfortunate to just lose all twigs that<br>
> included one or more of the old punctuation rules! so your candidate<br>
> strategy of cutting twigs into two parts (of which one might often come out<br>
> empty) at occurrences of these rules strikes me as a promising (still quite<br>
> mechanic) way of working around this problem. formally, breaking up twigs<br>
> risks losing some information, but in this case i doubt this would be the<br>
> case in actuality.<br>
><br>
> thanks for tossing around this idea! oe<br>
><br>
><br>
> On Sat, 23 Nov 2019 at 20:30 Woodley Packard <sweaglesw@sweaglesw.org><br>
> wrote:<br>
><br>
>> Hi Stephan,<br>
>><br>
>> My initial reaction to the notion of character-based discriminants is (1)<br>
>> it will not solve your immediate problem without a certain amount of custom<br>
>> tooling to convert old discriminants to new ones in a way that is sensitive<br>
>> to how the current punctuation rules work, i.e. a given chart vertex will<br>
>> have to be able to map to several different character positions depending<br>
>> on how much punctuation has been cliticized so far. The twig-shaped<br>
>> discriminants used by FFTB will in some cases have to be bifurcated into<br>
>> two or more discriminants, as well. Also, (2) this approach loses the<br>
>> (theoretical if perhaps not recently used) ability to treebank a nonlinear<br>
>> lattice shaped input, e.g. from an ASR system. I could imagine treebanking<br>
>> lattices from other sources as well — perhaps an image caption generator.<br>
>><br>
>> Given the custom tooling required for updating the discriminants, I’m not<br>
>> sure switching to character-based anchoring would be less painful than<br>
>> having that tool compute the new chart vertex anchoring instead — though I<br>
>> could be wrong. What other arguments can be made in favor of<br>
>> character-based discriminants?<br>
>><br>
>> In terms of support from FFTB, I think there are relatively few places in<br>
>> the code that assume the discriminants’ from/to are interpretable beyond<br>
>> matching the from/to values of the `edge’ relation. I think I would<br>
>> implement this by (optionally, I suppose, since presumably other grammars<br>
>> won’t want to do this at least for now) replacing the from/to on edges read<br>
>> from the profile with character positions and more or less pretend that<br>
>> there is a chart vertex for every character position. Barring unforeseen<br>
>> complications, that wouldn’t be too hard.<br>
>><br>
>> Woodley<br>
>><br>
>>> On Nov 23, 2019, at 5:58 AM, Stephan Oepen <oe@ifi.uio.no> wrote:<br>
>>><br>
>>> hi again, woodley,<br>
>>><br>
>>> dan and i are currently exploring a 'makeover' of ERG input<br>
>>> processing, with the overall goal of increased compatibility with<br>
>>> mainstream assumptions about tokenization.<br>
>>><br>
>>> among other things, we would like to move to the revised (i.e.<br>
>>> non-venerable) PTB (and OntoNotes and UD) tokenization conventions and<br>
>>> avoid subsequent re-arranging of segmentation in token mapping. this<br>
>>> means we would have to move away from the pseudo-affixation treatment<br>
>>> of punctuation marks to a 'pseudo-clitization' approach, meaning that<br>
>>> punctuation marks are lexical entries in their own right and attach<br>
>>> via binary constructions (rather than as lexical rules). the 'clitic'<br>
>>> metaphor, here, is intended to suggest that these lexical entries can<br>
>>> only attach at the bottom of the derivation, i.e. to non-clitic<br>
>>> lexical items immediately to their left (e.g. in the case of a comma)<br>
>>> or to their right (in the case of, say, an opening quote or<br>
>>> parenthesis).<br>
>>><br>
>>> dan is currently visiting oslo, and we would like to use the<br>
>>> opportunity to estimate the cost of moving to such a revised universe.<br>
>>> treebank maintenance is a major concern here, as such a radical change<br>
>>> in the yields of virtually all derivations would render discriminants<br>
>>> invalid when updating to the new forests. i believe a cute idea has<br>
>>> emerged that, we optimistically believe, might eliminate much of that<br>
>>> concern: character-based discriminant positions, instead of our<br>
>>> venerable way of counting chart vertices.<br>
>>><br>
>>> for the ERG at least, we believe that leaf nodes in all derivations<br>
>>> are reliably annotated with character start and end positions (+FROM<br>
>>> and +TO, as well as the +ID lists on token feature structures). these<br>
>>> sub-string indices will hardly be affected by the above change to<br>
>>> tokenization (except for cases where our current approach to splitting<br>
>>> at hyphens and slashes first in token mapping leads to overlapping<br>
>>> ranges). hence if discriminants were anchored over character ranges<br>
>>> instead of chart cells ... i expect the vast majority of them might<br>
>>> just carry over?<br>
>>><br>
>>> we would be grateful if you (and others too, of course) could give the<br>
>>> above idea some critical thought and look for possible obstacles that<br>
>>> dan and i may just be overlooking? technically, i imagine one would<br>
>>> have to extend FFTB to (optionally) extract discriminant start and end<br>
>>> positions from the sub-string 'coverage' of each constituent, possibly<br>
>>> once convert existing treebanks to character-based indexing, and then<br>
>>> update into the new universe using character-based matching. does<br>
>>> such an approach seem feasible to you in principle?<br>
>>><br>
>>> cheers, oe<br>
>><br>
<br>
</div>
</span></font></div>
</body>
</html>