ERG to PTB-style tokenization alignment and unaligned strings

Matic Horvat matic.horvat at cl.cam.ac.uk
Thu Apr 16 12:54:41 CEST 2015


I was wondering if there are existing approaches of aligning elementary
predications to PTB-style tokenized string?

The output of the ACE parser aligns the MRS to the original, untokenized
string, via the use of character spans. However, I would like to use tokens
further in the pipeline.

Also, has anyone worked on an approach to align 'semantically empty' words,
e.g. 'of', 'for', etc., which are not covered by any character span in the
MRS representation, to their 'parents' or elementary predications they
logically belong to?

For example, in 'definition of algorithm', 'of' substring is not aligned to
anything, but belongs to definition. Similarly for auxiliary verbs etc.

