[developers] character-based discriminants

Mon Nov 25 12:09:05 CET 2019

Hi Woodley, Stephan, and all,

I'm happy to hear of an effort to improve compatibility with "mainstream"
tokenization. I have a few questions/concerns, but for now I'll just voice
one:

Last I heard, ACE and the LKB are using characterization procedures that
predate Bec's work and can output different character positions than the
standalone REPP tool (or PET, which uses the same implementation). If we
are moving to use character positions more generally, then this is a good
time to ensure that we give consistent characterization by all our tools. I
performed a quick comparison of PyDelphin's output with that of ACE and the
standalone REPP binary and I'll report my findings in a separate thread. On
a related note, the default ERG configs for ACE and the LKB (and PET, if
pet/repp.set is used) activate a different set of REPP modules. For
example, ACE loads the html set while the LKB does not, so ACE parses
something like "Abrams will <s>not</s> go." as "Abrams will go." while the
LKB parses it as "Abrams will not go.". If we don't update our tools and
grammars to be more consistent, we should at least clearly document which
tool/config is used, for reproducibility.

On Mon, Nov 25, 2019 at 2:08 PM Woodley Packard <sweaglesw at sweaglesw.org>
wrote:

> Hi again,
>
> Certainly there is no change in the expressiveness of the system if all
> you propose is that the vertices in the token lattice be, for convenience,
> labelled whenever possible by character position.  Of course one can come
> up with theoretical examples of a lattice involving two vertices that would
> naturally be identified with the same character position but with different
> incident edges, but I have no reason to suspect such lattices would be
> useful in real scenarios, and in any case the ability to express them is
> not lost so long as the convention of labelling vertices by character
> positions can be broken by the advanced user who has a good reason to do
> so.  If something about the redesign or reimplementation of the system were
> to demand strictly that all vertex labels be interpretable as character
> offsets, then I believe my hesitation is valid -- namely that some
> conceivably useful input lattices may have no natural notion of character
> position.
>
> In terms of the conversion process, I believe the mechanism you outline
> will work correctly for phrase structure-level discriminants, but fail for
> discriminants that involve lexical rules applying before punctuation rules,
> of which I was prepared to guess there are a significant number.  I don’t
> yet have a sense as to how robust the full forest update process can be to
> losing a few discriminants.  In the top-500 universe, the system used to
> store all of the discriminants that were inferred to be negative.  For
> better or for worse, FFTB doesn’t do that, as unless I am recalling
> incorrectly, there can be a pretty nontrivial number of them.  Whether by
> that mechanism or another, I certainly would agree that a nearly-fully
> automatic update to the revised punctuation scheme could be hoped for.
>
> Woodley
>
> On Nov 24, 2019, at 3:43 PM, Stephan Oepen <oe at ifi.uio.no> wrote:
>
> many thanks for the quick follow-up, woodley!
>
> in general, character-based discriminants feel attractive because the idea
> promises increased robustness to variation over time in tokenization.  and
> i am not sure yet i understand the difference in expressivity that you
> suggest?  an input to parsing is segmented into a sequence of vertices (or
> breaking points); whether to number these continuously (0, 1, 2, …) or
> discontinuously according to e.g. corresponding character positions or time
> stamps (into a speech signal)—i would think i can encode the same broad
> range of lattices either way?
>
> closer to home, i was in fact thinking that the conversion from an
> existing set of discriminants to a character-based regime could in fact be
> more mechanic than the retooling you sketch.  each current vertex should be
> uniquely identified with a left and right character position, viz. the
> +FROM and +TO values, respectively, on the underlying token feature
> structures (i am assuming that all tokens in one cell share the same
> values).  for the vast majority of discriminants, would it not just work to
> replace their start and end vertices with these characters positions?
>
> i am prepared to lose some discriminants, e.g. any choices on the
> punctuation lexical rules that are being removed, but possibly also some
> lexical choices that in the old universe end up anchored to a sub-string
> including one or more punctuation marks.  in the 500-best treebanks, it
> used to be the case that pervasive redundancy of discriminants meant one
> could afford to lose a non-trivial number of discriminants during an update
> and still arrive at a unique solution.  but maybe that works differently in
> the full-forest universe?
>
> finally, i had not yet considered the ‘twigs’ (as they are an
> FFTB-specific innovation).  yes, it would seem unfortunate to just lose all
> twigs that included one or more of the old punctuation rules!  so your
> candidate strategy of cutting twigs into two parts (of which one might
> often come out empty) at occurrences of these rules strikes me as a
> promising (still quite mechanic) way of working around this problem.
>  formally, breaking up twigs risks losing some information, but in this
> case i doubt this would be the case in actuality.
>
> thanks for tossing around this idea!  oe
>
>
> On Sat, 23 Nov 2019 at 20:30 Woodley Packard <sweaglesw at sweaglesw.org>
> wrote:
>
>> Hi Stephan,
>>
>> My initial reaction to the notion of character-based discriminants is (1)
>> it will not solve your immediate problem without a certain amount of custom
>> tooling to convert old discriminants to new ones in a way that is sensitive
>> to how the current punctuation rules work, i.e. a given chart vertex will
>> have to be able to map to several different character positions depending
>> on how much punctuation has been cliticized so far.  The twig-shaped
>> discriminants used by FFTB will in some cases have to be bifurcated into
>> two or more discriminants, as well. Also, (2) this approach loses the
>> (theoretical if perhaps not recently used) ability to treebank a nonlinear
>> lattice shaped input, e.g. from an ASR system.  I could imagine treebanking
>> lattices from other sources as well — perhaps an image caption generator.
>>
>> Given the custom tooling required for updating the discriminants, I’m not
>> sure switching to character-based anchoring would be less painful than
>> having that tool compute the new chart vertex anchoring instead — though I
>> could be wrong.  What other arguments can be made in favor of
>> character-based discriminants?
>>
>> In terms of support from FFTB, I think there are relatively few places in
>> the code that assume the discriminants’ from/to are interpretable beyond
>> matching the from/to values of the `edge’ relation.  I think I would
>> implement this by (optionally, I suppose, since presumably other grammars
>> won’t want to do this at least for now) replacing the from/to on edges read
>> from the profile with character positions and more or less pretend that
>> there is a chart vertex for every character position.  Barring unforeseen
>> complications, that wouldn’t be too hard.
>>
>> Woodley
>>
>> > On Nov 23, 2019, at 5:58 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
>> >
>> > hi again, woodley,
>> >
>> > dan and i are currently exploring a 'makeover' of ERG input
>> > processing, with the overall goal of increased compatibility with
>> > mainstream assumptions about tokenization.
>> >
>> > among other things, we would like to move to the revised (i.e.
>> > non-venerable) PTB (and OntoNotes and UD) tokenization conventions and
>> > avoid subsequent re-arranging of segmentation in token mapping.  this
>> > means we would have to move away from the pseudo-affixation treatment
>> > of punctuation marks to a 'pseudo-clitization' approach, meaning that
>> > punctuation marks are lexical entries in their own right and attach
>> > via binary constructions (rather than as lexical rules).  the 'clitic'
>> > metaphor, here, is intended to suggest that these lexical entries can
>> > only attach at the bottom of the derivation, i.e. to non-clitic
>> > lexical items immediately to their left (e.g. in the case of a comma)
>> > or to their right (in the case of, say, an opening quote or
>> > parenthesis).
>> >
>> > dan is currently visiting oslo, and we would like to use the
>> > opportunity to estimate the cost of moving to such a revised universe.
>> > treebank maintenance is a major concern here, as such a radical change
>> > in the yields of virtually all derivations would render discriminants
>> > invalid when updating to the new forests.  i believe a cute idea has
>> > emerged that, we optimistically believe, might eliminate much of that
>> > concern: character-based discriminant positions, instead of our
>> > venerable way of counting chart vertices.
>> >
>> > for the ERG at least, we believe that leaf nodes in all derivations
>> > are reliably annotated with character start and end positions (+FROM
>> > and +TO, as well as the +ID lists on token feature structures).  these
>> > sub-string indices will hardly be affected by the above change to
>> > tokenization (except for cases where our current approach to splitting
>> > at hyphens and slashes first in token mapping leads to overlapping
>> > ranges).  hence if discriminants were anchored over character ranges
>> > instead of chart cells ... i expect the vast majority of them might
>> > just carry over?
>> >
>> > we would be grateful if you (and others too, of course) could give the
>> > above idea some critical thought and look for possible obstacles that
>> > dan and i may just be overlooking?  technically, i imagine one would
>> > have to extend FFTB to (optionally) extract discriminant start and end
>> > positions from the sub-string 'coverage' of each constituent, possibly
>> > once convert existing treebanks to character-based indexing, and then
>> > update into the new universe using character-based matching.  does
>> > such an approach seem feasible to you in principle?
>> >
>> > cheers, oe
>>
>>
>

-- 
-Michael Wayne Goodman
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.delph-in.net/archives/developers/attachments/20191125/53a3019d/attachment.html>