[From nobody Wed Feb 29 19:48:24 2012 Message-ID: <461CC654.7040404@hf.ntnu.no> Date: Wed, 11 Apr 2007 13:28:20 +0200 From: Ben Waldron <ben.waldron@hf.ntnu.no> User-Agent: Thunderbird 1.5.0.10 (X11/20070306) MIME-Version: 1.0 To: oe@csli.Stanford.EDU CC: ann.copestake@cl.cam.ac.uk, developers@delph-in.net Subject: Re: [developers] boring though important: generalizing characterization References: <200704022338.l32NcDAQ006508@mv.uio.no> <E1HYiWH-00069f-00@mta2.cl.cam.ac.uk> <200704031730.l33HUHB5020754@mv.uio.no> In-Reply-To: <200704031730.l33HUHB5020754@mv.uio.no> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit the (smaf) xml input is designed to support any pointers which can be encoded as strings (xml attribute values). specifically, we've experimented with integer offset pointers (eg. character positions, could also encode time offsets for speech input) and xpoint-based pointers (xpath expression to specify xml node, plus integer offset to specify eg. point in text string inside a text node), but "encoding pointers as strings" is sufficiently general to encode any notion of pointer. the only operation required on these pointers when inside the deep parser (when last i looked, at least) is a simple < comparison: pointers (currently cfrom/cto integers) are copied from the preprocessed input (smaf xml, pic xml, yy text, whatever), passed on from daughter to mother edges in the chart (cfrom of mother=min cfrom of daughters, cto of mother=max cto of daughters) in such a way (quite a fiddly way...) that the eps in the mrs store corresponding cfrom/cto, and occur in output (r)mrs as cfrom/cto values on eps. the min/max above use the < relation (although note that if the ordering of the pointers is compatible with the ordering of the edges -- eg. the preprocessor hasn't shuffled around it's input -- then the ordering of the edges relative to nodes in the chart suffices for this < relation). so my request: support encoding of from/to pointers as strings, where a custom < relation (Lisp function) can be provided for whatever pointer scheme is used in a particular instance. both for the lkb and for pet, obviously. stephan, i think all the pointer/LNK mechanisms you listed can be useful given the right context. token ids can be used if you know what they refer to -- eg. when the user provides a token lattice as input, but if the user inputs a string, they will clearly want back pointers into that string). sorry, i've injured my arm so can only compose emails very slowly. converting the current cfrom/cto mechanism inside the deep parsers into something more general will be a very welcome development. ben Stephan Oepen wrote: > hi again, thanks for the quick feedback, > > >> this proposal appears to be missing the generalised notion of >> character ranges that we require for XML input - i.e., the >> combination of xpath with character position. >> > > good, so actually more reason to generalize than i knew about. how > could i learn more about that scheme? more centralized LNK handling > should make it easy, in principle, to add another scheme. however, > more is not always better, of course. i would hope to avoid legacy > support for a wider range of schemes than is really needed. > > but i take it there is no LKB code yet for the scheme you suggest? as > far as i could see, existing code assumes CFROM and CTO are integers. > > >> The token lattice notion is _not_ adequate for our purposes, given >> that we require a general notion of standoff annotation that can work >> with multiple tokenisations (and indeed, for processing that is not >> token based). >> > > i would have thought the token-based view was the most general. if we > defined the deep parser as a process mapping from a token lattice into > analyses (trees, MRSs, et al.), then the creation of the input lattice > (tokenization, tagging, spell correction, NE recognition, you name it) > is external to the parser. with multiple tokenizations, each will use > unique token identifiers, and someone prior to parsing has to decide on > which tokens are provided as input to the parser. all parser analyses, > ultimately, are composed of those input elements, and as long as there > is a non-ambiguous reference from (parts of) analyses to such elements, > then all available information is preserved. on this view, there could > either be indirect reference (from an EP to its input tokens, and from > each input token to its position in the preprocessing pipeline), or one > could re-map (or augment) analyses as a post-processing step outside of > the parser, i.e. as part of embedding it into a larger context. > > i imagine, if only for practical reasons, various pipelines (contexts) > will have varying indexing schemes (we have seen quite a few already). > ideally, the interface into and out of the parser should be unaffected > by such external variation. well, maybe a topic for the Summit ... > > best - oe > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > +++ Universitetet i Oslo (IFI); Boks 1080 Blindern; 0316 Oslo; (+47) 2284 0125 > +++ CSLI Stanford; Ventura Hall; Stanford, CA 94305; (+1 650) 723 0515 > +++ --- oe@csli.stanford.edu; oe@ifi.uio.no; oepen@idi.ntnu.no --- > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ]