[developers] New ERG with improved tokenization/preprocessing for PET

Francis Bond bond at ieee.org
Sat Jul 18 12:54:50 CEST 2009


2009/7/18 Stephan Oepen <stephan.oepen at gmail.com>:
> howdy,
> i was expecting the topic to (also) come up naturally on wednesday morning.
>  maybe we can see which specific questions we want to address related to
> praraphrasing, and which ones we leave to chart mapping and pre-processing.
>  and maybe i'll even manage to summarize before then what we concluded for
> this forthcoming ERG release from the earlier discussion ...

That would be great.  I was planning to talk a little bit about (i)
what we wanted to be able to do with unknown words in paraphrasing and
(ii) one possible approach.   I would be happy to stop there, and
leave the full discussion for Wednesday.

> i don't suppose you noticed that generating involving unknown words to
> parsing now works (in the original paraphrase setup, i.e. /not/ EnEn)?

You mean passing something like "frodo_n_unk_rel" and hoping it would
generate?  Yes we noticed that, thank you.  We rely on it in JaEn.

I also noticed with sorrow:

 (mt::parse-interactively "The frodo barks.")
TSNLP(11): [18:48:47] translate(): read 1 MRS as generator input.
[18:48:47] translate(): processing MRS # 0 (6 EPs).
[18:48:47] translate(): error `invalid predicates: |"_frodo/NN_u_unknown_rel"|'.
[18:48:49] gc-after-hook(): {L#89 N=2.4m O=0 E=80%} [S=1009m R=822m].

So I hope to discuss it a little in paraphrasing, and then more on
Wednesday morning.

Francis Bond <http://www3.ntu.edu.sg/home/fcbond/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University

