[pet] preprocessing and fsc

Fri Jun 25 15:53:26 CEST 2010

Hello,

After several private conversation with Rebecca and used her code, I've 
found a solution of my question but not yet a reason.

My problem started with a parsing failure of a simple sentence: "I don't 
know.", my OpenNLP tokenizer tokenizes it into "I do n't know .", which 
should be completely okay for PET but after converting it to FSC format 
PET doesn't parse:

(1) `I do 't know.' [0] --- 0 (-0.00|0.04s) <26:141> (5872.3K) [0.0s]

It turned out that PET seems not able to handle escaped characters in XML:

the <str> element for n't:
<str>n&apos;t</str>

But Rebecca's C++ code outputs it like this:

<str><![CDATA[n't]]></str>

After modifying my code accordingly, PET parses. (Thanks, Rebecca!) I'll 
stick with the OpenNLP tokenizer for now since it also provides 
PTB-style tokenizations. But for references, both Rebecca's C++ codes 
and LKB provides the interface for external function/program calls.

It's very strange that I use xerces-j to output XML and PET uses 
xerces-c to parse XML so escaping characters should be handled 
seamlessly. I consider it a bug of PET unless someone argues against it 
or until someone fixes it.;-)

Nevertheless, this bug is not mentioned in the wikipage 
(http://wiki.delph-in.net/moin/PetInputFsc), so it would be nice if 
someone can modify it accordingly or assign me the right to modify it to 
prevent future confusion, if the bug will not be fixed in PET.

Xuchen

On 06/23/2010 03:07 PM, Rebecca Dridan wrote:
> Hi Xuchen,
>
> If you are thinking of the REPP code 
> (http://wiki.delph-in.net/moin/ReppTop), it's in Lisp 
> ($LOGONROOT/lingo/lkb/src/glue/repp.lsp), not C++. The older stuff was 
> FSPP, but that was also Lisp code in the LKB source. As far as I know 
> (maybe others can correct me), there's never been native 
> pre-processing code in PET, it's generally been a pre-processing step, 
> or Lisp code compiled into PET with ECL.
>
> I ran into similar issues in the last year or so, wanting to use FSC, 
> and I actually have my own C++ version of REPP. I can't guarantee it 
> is exactly in sync with the current Lisp code, since I haven't been 
> watching the changes in the last year too closely, but it still seems 
> to work with the current rpp files in the grammar and can produce FSC 
> output. I'm happy to share the code if you'd like to use it, modifying 
> it, port it.
>
> The other option I found was to get the REPP output from Lisp, and 
> merge it with my own FSC stuff. That can get a little fuzzy if you 
> have different tokenisation, but I have merging code that generally 
> works.
>
> Let me know if any of my code will be useful to you.
>
> Rebecca
>
>
>
> On 23/06/10 22:39, Xuchen Yao wrote:
>> Hi,
>>
>> I was told that there's a pre-processing module in PET to re-format 
>> the input a little bit (such as dealing with punctuations, numbers, 
>> etc, e.g. $14,000) so a better job can be done for parsing. But if 
>> the input mode is the FSC format (Chart Mapping as in the cm branch), 
>> this pre-processing stage is bypassed (correct me if it's not the case).
>>
>> Currently I'm using FSC input and also want to gain some advantages 
>> from this pre-processing stage. I'm writing in Java and thinking if I 
>> can have the C++ references from PET, I can easily re-write the 
>> preprocessing step in my code (hopefully this isn't too much work) so 
>> finally feed some better formatted input to PET. So could someone 
>> kindly point me to the C++ code in PET where pre-processing happens? 
>> Thanks a lot!
>>
>> Xuchen
>>
>