[developers] [sdp-organizers] From EDS/RMS to DM

Stephan Oepen oe at ifi.uio.no
Mon Feb 25 22:16:40 CET 2019


hi again, alexandre,

> Good idea. If you prefer, I can also write my questions in http://delphinqa.ling.washington.edu forum, are you using it?

truth be told, i am not monitoring the discourse site.  unless we
collectively decide to retire the older mailing lists, i suspect you
still reach a larger share of DELPH-IN contributors on this list.

> Unfortunately, looks like dtm script is not working with profiles produced with ACE (http://moin.delph-in.net/LogonAnswer):

the converter is overly specific in looking for a derivation tree in
the export file; it actually expects to recognize one of the ERG start
symbols (‘root_strict’ and the like).  ACE by default deviates from
the [incr tsdb()] conventions for recording derivation trees and does
not write out the start symbol at the root of the tree.  i believe as
of a year or two ago there is an ACE command-line option to request
output of the start symbol that licensed each derivation; please try
with ‘--rooted-derivations’.

> I have also tried to convert from data exported with redwoods from a profile created with ACE+ART using a different ‘home’ directory for TSDB:

[...]

> But all files exported by this command above are empty. For example:

i must confess i know very little about ART; i believe it provides an
alternative to parallelization of batch parsing in [incr tsdb()].  if
exporting from those profiles fails, i suspect the ART output deviates
from some [incr tsdb()] convention(s) that the export code
presupposes.  to debug this problem, you could either compare a tiny
[incr tsdb()]-generated profile to the ART equivalent, or look (and
instrument) inside the function tsdb::export-tree() (in
‘redwoods.lisp’).

in general, i suspect you are venturing into unexplored territory
here: in some sense, there are two versions of the DELPH-IN toolchain,
(a) the ‘classic’ ensemble of the LKB, [incr tsdb()], and PET; and (b)
the modern tools from the pacific northwest (ACE and friends, and
pyDelphin).  i have to admit, i am mostly stuck in the past still,
hence have not been experimenting much with ACE myself.  that means
that the non-trivial degrees of interoperability that we already enjoy
are primarily owed to woodley trying to identify and reverse-engineer
explicit and implicit assumptions in the classic tools, including the
[incr tsdb()] data format.

if your current primary goal is to obtain ERG semantics in the DM
format, i would almost recommend you start a tad conservative, for
example using the 2018 ERG with PET, or running ACE as an [incr
tsdb()] client (which means i get to write the profile contents and
sanitize and fill in defaults as i thought should be the case at the
time).

> It is not clear to me the parameters and how they change the behaviour of the parse and redwoods scripts. It seems that I could run the codes directly from the lisp REPL inside Emacs, right? These scripts call many different tools, right? PET, ACE, [incr tsdb()] etc.

yes, correct.  many of the LOGON scripts of convenience (e.g. ‘parse’
and ‘redwoods’) accept a ‘--cat’ command-line option, which has to be
given first; in this mode, they will merely write out the sequence of
s-expressions that they ordinarily evaluate in the LOGON universe.

> Moreover, I would like to use the trunk version of ERG (2018), not the 1214 available in the LOGON tree. The terg argument didn’t work for me.

please see the LogonExtras page on the wiki for instructions on how to
maintain two versions of the ERG side-by-side, where 2018 currently
would correspond to what LOGON calls ‘terg’.  the primary ERG version
in the LOGON tree, for now, remains 2014 because i am still doing some
active work against that older release.

> Besides all the questions above and any feedback that you may give… I need to process 5,600 sentences. If I submit one file with 5,600 lines, the parse script stopped without any error message but in the profile I only have 8 lines in the result file. I tried to split the 5,600 lines into smaller files. If I parse a file with 2000 lines/sentences, during the first 1-5 sentences the script seems to be parsing but after some time it starts to produce outputs very fast, it looks like it is skipping the sentences and it finished with only 4 lines in the result file in the profile.

from past experience, we have found it convenient to break down
profiles into chunks of maybe 1000 sentences (though, when parsing the
english wikipedia, each profile has tens of thousands if inputs, which
[incr tsdb()] and PET in the past have been able to handle in one-best
mode).  to judge your report about unhappy results from the ‘parse’
script, i would need to know how exactly you called the script, and
what kind of output was printed while it was running.  very few
entries in the ‘results’ relation indicates that not many inputs
parsed successfully, which could have any number of reasons; the
‘parse’ relation, on the other hand, should contain one record per
parsing attempt, i.e. at least one per input item?

> Since I am a Mac user, I am using docker (http://moin.delph-in.net/LkbMacintosh) for running all the code in a Lisp environment, may something related to it? Maybe something related to the PVM code?

in principle, this should not be an issue, but PVM is of course
dependent on a functional network layer (and probably has its own set
of assumptions that may or may not hold inside of your container).  in
general, there should be an awful volume of logging output, both
printed to the terminal while things are running but also recorded in
a file in the $LOGONLOG directory.  maybe start by scanning these
files carefully for anything that looks surprising?

alternatively, you could just try on a dedicated linux server, to see
whether that makes any difference?

> Adding this export option would be nice in the redwoods script. Please!

if you ‘svn up’ your LOGON tree now, that should give you new
precompiled LOGON binaries (and updates to the ‘redwoods’ script, plus
a few more changes).  i have added ‘dm’ as a valid format to the
‘--export’ option, so the interface is now exposed all the way to the
command line.  but it will, sadly (and of course) only work once you
can actually export from your profiles and successfully invoke the DTM
converter.  the ‘dm’ export option, essentially, just calls these
steps in sequence.

> But once we have the gz files and the sdp files, how the create-index from the wesearch code is able to link all representations of each sentence ? I suppose the final RDF link all representations of the same sentence using the sentence id, right?

yes, that has to be the case.  i do not know the WeSearch code
terribly well from the inside, but we have long paid premium attention
to maintaining a ‘global’ space of unique item identifiers across all
ERG treebanks.

> Nice, the rest.py scripts works fine. I haven’t read the page http://moin.delph-in.net/ErgApi before, it is definitely useful for some tests, but I need to process a large amount of non public data. It would be really nice to be able to instantiate my own ERG endpoint. Any documentation about it?

of course there is documentation :-).  please see the LogonOnline
page.  in a nutshell, ‘$LOGONROOT/www --erg’ should give you a local
server that you can query RESTfully.

best wishes, oe



More information about the developers mailing list