[developers] PET scripting (REPP)

Woodley Packard sweaglesw at sweaglesw.org
Mon May 27 08:17:10 CEST 2013


You could `cat' a file with one sentence per line into that same command, e.g.:

$ cat test.txt
"Squeak!" said the mouse.
The dog said, "Woof."
$ cat test.txt | ./logon/bin/cheap -t -repp -preprocess-only=yy logon/lingo/erg/english
[.....]
(1, 0, 1, <0:1>, 1, "“", 0, "null")
(2, 1, 2, <1:7>, 1, "Squeak", 0, "null")
(3, 2, 3, <7:8>, 1, "!", 0, "null")
(4, 3, 4, <8:9>, 1, "”", 0, "null")
(5, 4, 5, <10:14>, 1, "said", 0, "null")
(6, 5, 6, <15:18>, 1, "the", 0, "null")
(7, 6, 7, <19:24>, 1, "mouse", 0, "null")
(8, 7, 8, <24:25>, 1, ".", 0, "null")
(9, 0, 1, <0:3>, 1, "The", 0, "null")
(10, 1, 2, <4:7>, 1, "dog", 0, "null")
(11, 2, 3, <8:12>, 1, "said", 0, "null")
(12, 3, 4, <12:13>, 1, ",", 0, "null")
(13, 4, 5, <14:15>, 1, "“", 0, "null")
(14, 5, 6, <15:19>, 1, "Woof", 0, "null")
(15, 6, 7, <19:20>, 1, ".", 0, "null")
(16, 7, 8, <20:21>, 1, "”", 0, "null")

I guess you can separate the sentences by seeing when the "from" vertex identifier resets to 0.

For an entirely different approach, you could try the -Ev options with ACE.  The output contains the same data, but it is printed in a different format:

$ cat test.txt | ~/cdev/ace/ace -g ~/cdev/ace/erg.dat -Ev 2>/dev/null | grep -v '^NOTE'
“<0:1> Squeak<1:7> !<7:8> ”<8:9> said<10:14> the<15:18> mouse<19:24> .<24:25>


The<0:3> dog<4:7> said<8:12> ,<12:13> “<14:15> Woof<15:19> .<19:20> ”<20:21>


Good luck,
Woodley

On May 26, 2013, at 11:04 PM, Megan Schneider wrote:

> Does anyone know of a good way to get bulk REPP tokenization for a set of sentences? The one-by-one method appears to be:
> 
> echo <sentence> | ./logon/bin/cheap -t -repp -preprocess-only=yy ./logon/lingo/erg/english
> 
> Is there a good way to do this without needing to reload the rules/types every sentence? Not looking for a functional difference, just an efficiency difference.
> 
> 
> Thanks!
> Megan




More information about the developers mailing list