[erg] New release of ERG
Dan Flickinger
danf at csli.stanford.edu
Tue Aug 2 19:54:04 CEST 2005
Hi colleagues -
It's high time we put this mailing list to some good use! I'm pleased to
announce the release of a new version of the ERG which incorporates
several changes, a few of them relatively dramatic. For those of you
who having been working with the grammar, you may need to make some
adjustments for changes in MRS predicate names, or because of the changes
in tokenization that result from treating punctuation as affixation
(rather than as separate tokens). I'm attaching a copy of the README
file that comes with the grammar in CVS, and will look forward to
hearing of your experiences and suggestions. Note that the grammar
will be happiest with an up-to-date version of the LKB (using its
improved treatment of morphology), but should still work with somewhat
earlier versions of the LKB.
Cheers,
Dan
------------------------------------------------------------------------------
------------------------------------------------------------------------------
Release notes for version "LinGO (Jul-05)"
This release incorporates several significant changes to the previous
release, but at long last also includes a first step at documenting an
external semantic interface for the grammar. The changes will soon be
described in a little more detail on the ERG Wiki, but in summary:
1. Punctuation as affixation
Previous versions of the grammar implemented a treatment of punctuation
adopting a standard but linguistically dubious strategy of using a
preprocessor to make all punctuation marks distinct tokens, adding
spaces around each one. This version implements an analysis which
leaves the input string unchanged with respect to punctuation (except
for apostrophes), and treats the punctuation marks as spell-changing
affixes. This change creates backward incompatibilities with earlier
treebanks because the tokenization for each sentence is now different.
A few infelicities remain from making this change, including
- minor inconsistencies in the readers of affixation rules for the
LKB and PET (and even for previous and current versions of the LKB)
- imperfect interaction of irregular inflected forms and punctuation
- imperfect interaction of multi-words and punctuation
There are work-arounds for some of these, awaiting better resolution.
2. Semantics
a. Semantically empty prepositions no longer introduce an EP (they
used to add an EP whose predicate name ended in "_sel_rel", for
lexically 'selected'). So the generator trigger rules have been
augmented to automatically introduce the necessary lexical entries
for generation, currently based on predicate-naming conventions
for the lexical entries that select empty prepositions.
b. Messages now introduce an additional attribute, ARG0, whose value
is the event of the highest-scoping verbal EP within the scope of
the message. The main motivation is to make it simpler for
applications to identify the relevant event properties of a
clause's semantics without looking 'inside' the clause's MRS.
c. All lexical predicates now have some value in the 'sense' field
of the predicate name (Background: by convention in the ERG, each
lexical predicate name has the following form: _ORTH_POS_SENSE_rel
where ORTH is the lexeme's orthography, POS is a coarse-grained
sense distinction drawing from the vocabulary [v n a p x q c], and
SENSE is an arbitrary sequence of characters (excluding |_|), and
where each of the fields is separated by an underscore. Earlier,
the sense field could have been left empty.) The default value for
the sense field is now '1'.
d. Relational nouns now specify in their sense field the orthography
of the preposition marking their oblique complement (usually 'of').
e. Tag questions previously discarded the semantics of the tag phrase,
contrary to the monotonicity assumption in the ERG. This is now
corrected, with the result that the semantics of sentences with
tag questions is now rather more baroque. The main benefit of the
reanalysis is that lexical rules now properly always preserve the
semantics of their input lexemes.
f. Sentential subjects were previously analyzed via a nominalization
rule. This simplified the syntactic analysis of "That Abrams
arrived annoyed Browne" since the "annoy" lexeme could always
unify its ARG1 value with the semantic index of its subject. But
the resulting asymmetry for the 'extraposed' and non-extraposed
variants of lexemes like 'annoy' was annoying. This version of
the grammar now provides the same MRS for both variants ('It
annoyed Browne that Abrams arrived' and the above example), via
a syntactic variant of an 'it-extraposition' lexical rule, with
thanks to Ann Copestake for the suggested implementation. One
consequence is that the earlier treatment of examples like "The
problem was that Abrams arrived" no longer works, since the
identity copula was being used, and requires its complement to
supply a referential index. So there is also yet another entry
for the verb 'be', which supplies an EP similar to the identity
'be'.
g. Verbal modifiers of nouns were being given an inconsistent
semantics, with postnominal modifiers as in 'people singing arias'
supplying a message for the modifier phrase, but with prenominal
modifiers as in 'the singing people' not contributing a message.
In this version of the grammar, verbal projections now always
supply a message, making the world a little more consistent, but
leaving a sharper contrast now between "the singing children"
and "the interesting children" where 'interesting' is analyzed
as an adjective and hence does not supply a message.
3. Lexicon
New lexical entries have been added drawn from the Norwegian tourism
domain of the LOGON development corpus, bringing the current number
of lexemes to 22,750 for this release, of which about 2700 are proper
names.
4. SEM-I
A first draft of the semantic interface for the grammar is now
presented in the file erg-full.smi, including the predicate names and
semantic arguments of all predicates introduced either by lexical
entries or by the grammar (either via lexical/syntactic rules or via
abstractions over more specific predicates). Documentation of this
file is under active development.
5. Naming conventions
The feature name DIVISIBLE on referential indices has been shortened
to DIV for better readability of MRSs.
6. LKB warnings on grammar loading
The LKB's new and improved treatment of morphology offers several
advantages, and the current version of the grammar benefits from
these, but still results in some warning messages when loading.
Users can ignore these messages for now, while the developers resolve
the underlying causes. The first is about the 'punct_bang_rule',
and the others warn of lexical rules that can feed themselves.
------------------------------------------------------------------------------
Release notes for version "LinGO (30-Apr-05)"
This is a minor update to the Apr-05 version, including some lexical
additions, adjustments to the semantic predicate hierarchy, and tuning
of syntactic analyses, all designed to improve end-to-end translation
for LOGON. The only substantive difference is in the analysis of
possessive constructions, where the grammar now produces nearly
identical MRSs for the two noun phrases "our book" and "a book of ours",
using a new lexical entry for "ours" distinct from the ordinary "ours"
of "ours are not ready". One consequence of this reanalysis, which
unifies the treatment of the two possessive constructions, is that
the two arguments in the old 'poss_rel' EP have been reversed: what was
the ARG1 is now ARG2, and vice versa.
------------------------------------------------------------------------------
Release notes for version "LinGO (Apr-05)"
Overview of changes:
- Lexicon size increased to 21000 entries
- MRS quality improved
- Unicode now used for lexicon: foreign proper names, archaic spellings
- Coverage added for fragments, locative inversion, 'free' parentheticals
- Changed analyses to allow PP-modif of PPs, APs; adverb-modif of APs
- Support for new domains: 'shanghai', 'gcide'
--Lexicon--
BNC - Based on months of hard labor by former Stanford students Hansook Lee
and Mike Orme (with help from Ara Kim), the lexicon now contains all verb
subcat entries for the 2000 most frequent verb stems in the British
National Corpus. This should enable some interesting experimentation in
automated lexical acquisition, since there are fewer lexical types that
need to be hypothesized for non-verbs.
GCIDE - The lexicon now also contains entries for all words observed in the
first 10,000 definition 'sentences' in the GNU Contemporary International
Dictionary of English (GCIDE), to enable more precise evaluation of
syntactic coverage of these definitions.
Shanghai - Based on some 1500 entries constructed by Yi Zhang at CoLI in
Saarbruecken, the lexicon now also contains entries for most of the words
found in a Web-derived corpus on tourism in Shanghai, analogous to the
Rondane corpus built by Becky Neil for the LOGON project in Norway.
--MRS quality--
Based on a substantial implementation effort by Stefan Thater and
colleagues at CoLi, Saarbruecken, to check for well-formedness of MRSs
produced by the grammar for the Redwoods and Rondane corpora, many errors
were identified, enabling improvements in MRS construction in the ERG.
Further improvements were enabled by the systematic use of existing
capabilities in the LKB for diagnosing MRS errors in ERG analyses. While
the current release still produces some flawed MRSs for these data sets,
they are largely confined to a small inventory of known and somewhat
problematic minor phenomena.
--Unicode--
Drawing on the combined expertise of Stephan Oepen and Francis Bond, the
ERG is now fully Unicode-compliant, including the PSQL database. This
enables proper representation in the lexicon for orthography of non-English
proper names such as "østerbø", and archaic English spellings such as
"coöperation". The necessary infrastructure for Unicode is admirably and
demonstrably in place in the LKB, PET, [incr tsdb()], and PostgreSQL.
--Coverage--
Fragments - Further work on the treatment of fragments has been motivated
largely by the effort to parse the definition sentences in GCIDE, and to
give them a consistent semantic representation. New fragment types now
licensed include VPs and PPs with NP gaps, as in "To devour." or
"Relying on.".
Locative inversion - The grammar now analyzes some locative inversion
phenomena, currently restricted to sentences headed by the finite copula
'be' as in "Near the park is a large dog" but not (yet) "Near the park
stood a large tree". These appear with some frequency in the Rondane data,
and have also been waiting patiently for twenty years in the CSLI test
suite.
'Free' parentheticals - Sentences containing some classes of parenthetical
material (which would not survive in situ without the parentheses) will now
be analyzed, though further work will be needed in designing the target
semantics. Example now covered: "That dog (you should see its owner!)
barked."
--Changed analyses--
Modification - Based on more systematic analysis of phenomena found in the
Rondane corpus, and corroborated in the Shanghai corpus, the ERG now
permits more interesting modification structures. Prepositional phrases,
formerly restricted to modifying only VPs and nominal phrases, can now also
modify adjective phrase and other PPs. Similarly, adverbs can now also
modify adjective phrases, as in "the wildly happy dog barked", freeing the
grammar from its former requirement that duplicate degree-specifier lexical
entries be added for many adverbs.
--New domains--
The GCIDE corpus has been taken from the GCIDE web site, and carefully
prepared by Eric Nichols at NTT in collaboration with Francis Bond,
including identification of sentence breaks, normalization, and formatting,
all of which are now automated via Perl scripts converting the original
GCIDE data into, among other things, an 'item' file format for use with the
fine system.
The Shanghai corpus is being collected by Yi Zhang in Saarbruecken as part
of his thesis work, and consists of text on tourism in Shanghai, written in
English and mostly but not entirely by native English speakers. The corpus
may still be revised, so a profile of this data is not (yet) being
distributed with the ERG.
------------------------------------------------------------------------------
More information about the erg
mailing list