[developers] Convenient ways to parse a corpus into DM
oe at ifi.uio.no
Mon May 6 22:14:47 CEST 2019
dear zhaofeng (if i may),
i am of course glad you are interested in the DM bi-lexical
representations. regrettably, there is no _very_ easy way of
producing these from running text yet; the process is kind of
involved, at least under the hood.
but the situation is not quite hopeless either. there are two ways of
invoking the full conversion to DM (using the same steps as applied in
the preparation of the SDP release), either (a) going through the
RESTful interface, or (b) parsing into an [incr tsdb()] profile and
then exporting from it.
a lot will depend on how much data you want to process :-). i expect
the most scalable solution will be option (b), as outlined on this
on a suitable linux server, the parsing step should work out of the
box. if not, i will be happy to try and assist! for scalability, we
have found it useful to limit each profile to, say, at most 10,000
sentences (and you will, of course, have to do your own sentence
segmentation before parsing with the ERG).
assuming you can get the parsing to work, in current LOGON versions
(as of a few weeks ago), you can use a variant of the 'redwoods'
export step, giving it '--export dm' instead of the comma-separated
option value from the web page above (or just add 'dm' to the list, of
please let me know in case you run into any obstacles on this path.
best wishes, oe
ps: for background, there was a related thread earlier this year:
On Mon, May 6, 2019 at 10:03 PM Zhaofeng Wu <zfw7 at cs.uw.edu> wrote:
> What would be a convenient way to parse a corpus into the DM
> representation using an ERG-based parser? The ERG API does the job if
> I pass `dm: sdp`, but it is not suitable for parsing a large corpus. I
> read that I can use `$LOGONROOT/www`, but I’m encountering some errors
> running that. Before I ask about those errors, I want to first make
> sure that this is indeed the easiest way to go.
More information about the developers