[lkb] Script LKB to tokenize a document

Wed Mar 6 23:12:15 CET 2013

hi noa,

tokenization is not the primary purpose of the LKB software, even
though the LKB includes several options for tokenizing text.  i am
guessing you might be interested in the REPP machinery (Regular
Expression Pre-Processor), discussed by Dridan & Oepen (2012)
in ACL?  REPP itself is just a framework for tokenization, and the
LKB does include an implementation of REPP (although i would
suggest you rather use the REPP implementation bundled with
the PET parser).  to apply REPP for tokenzation, one will need a
set of rules, in the format described at

  http://moin.delph-in.net/ReppTop

for english at least, there is a set of REPP rules aiming to comply
with the tokenization standards of the venerable Penn Treebank.
i imagine these rules might work reasonably well for some other
(similar, in terms of orthographic conventions) languages.  also,
one of the goals in REPP is of course easy customization, so a
user should not find it too difficult to fine-tune tokenization rules.

if this is in the general direction you wanted to go, i recommend
you obtain PET and the ERG (which includes tokenization rules
for english) as part of the LOGON tree (which should run ‘out of
the box’ on any reasonably recent Linux machine), see:

  http://moin.delph-in.net/ErgProcessing

however, rather than full parsing, you would probably opt to just
run the tokenization phase of PET.  there is no script support in
the LOGON tree for that, but it should work to do something like
the following:

  cd $LOGONROOT
  echo "Kim didn't arrive in Berlin." \
  | cheap -t -repp -preprocess-only=yy ./lingo/erg/english

this way, the PET parser (‘cheap’) will read one line of input
from stdin (one sentence-like unit), tokenize, and print the
resulting token sequence to stdout.  the syntax used for the
output is what we call the YY token format, e.g.

(1, 0, 1, <0:3>, 1, "Kim", 0, "null")
(2, 1, 2, <4:7>, 1, "did", 0, "null")
(3, 2, 3, <7:10>, 1, "n't", 0, "null")
(4, 3, 4, <11:17>, 1, "arrive", 0, "null")
(5, 4, 5, <18:20>, 1, "in", 0, "null")
(6, 5, 6, <21:27>, 1, "Berlin", 0, "null")
(7, 6, 7, <27:28>, 1, ".", 0, "null")

for further background, please see:

  http://moin.delph-in.net/PetInput

i must admit, the above seems like a relatively heavy-handed
solution to me, if all you want to do is tokenization.  i expect
there are much more lightweight, PTB-compliant tokenizers
readily available, even though Dridan & Oepen (2012) might
suggest that the combination of REPP with the ERG rules is
more PTB-compliant than some of the more standard tools.

nb: all of the above presumes sentence-segmented text, i.e.
sentence boundary detection prior to tokenization.  there is
a survey article by Read et al. (2012) in COLING reviewing
common tools for sentence segmentation, and in fact some
of them (including the best-performing one, called tokenizer)
can also perform tokenization.  maybe try that first, rather
than digging into the full DELPH-IN toolchain?

best wishes, oe

On Tue, Mar 5, 2013 at 10:00 AM, Noa Patricia Cruz Diaz
<noa.cruz at dti.uhu.es> wrote:
> Dear All,
>
> I am new using LKB so sorry for my misinformation. I would like to tokenize
> a set of documents but I don't know how to do it in a bach mode.
>
>
>
> Would anyone have a script for doing it? Or maybe someone knows the steps to
> follow. I would be grateful!
>
>
>
> Thank you.