[erg] New release of ERG

Tue Aug 2 19:54:04 CEST 2005

Hi colleagues -

It's high time we put this mailing list to some good use!  I'm pleased to
announce the release of a new version of the ERG which incorporates
several changes, a few of them relatively dramatic.  For those of you
who having been working with the grammar, you may need to make some
adjustments for changes in MRS predicate names, or because of the changes
in tokenization that result from treating punctuation as affixation
(rather than as separate tokens).  I'm attaching a copy of the README
file that comes with the grammar in CVS, and will look forward to
hearing of your experiences and suggestions.  Note that the grammar
will be happiest with an up-to-date version of the LKB (using its
improved treatment of morphology), but should still work with somewhat
earlier versions of the LKB.

Cheers,

 Dan

------------------------------------------------------------------------------
------------------------------------------------------------------------------
Release notes for version "LinGO (Jul-05)"

This release incorporates several significant changes to the previous 
release, but at long last also includes a first step at documenting an
external semantic interface for the grammar.  The changes will soon be
described in a little more detail on the ERG Wiki, but in summary:

1. Punctuation as affixation

   Previous versions of the grammar implemented a treatment of punctuation
   adopting a standard but linguistically dubious strategy of using a
   preprocessor to make all punctuation marks distinct tokens, adding
   spaces around each one.  This version implements an analysis which
   leaves the input string unchanged with respect to punctuation (except
   for apostrophes), and treats the punctuation marks as spell-changing
   affixes.  This change creates backward incompatibilities with earlier
   treebanks because the tokenization for each sentence is now different.
   A few infelicities remain from making this change, including
     - minor inconsistencies in the readers of affixation rules for the
       LKB and PET (and even for previous and current versions of the LKB)
     - imperfect interaction of irregular inflected forms and punctuation
     - imperfect interaction of multi-words and punctuation
   There are work-arounds for some of these, awaiting better resolution.

2. Semantics

   a. Semantically empty prepositions no longer introduce an EP (they
      used to add an EP whose predicate name ended in "_sel_rel", for
      lexically 'selected').  So the generator trigger rules have been
      augmented to automatically introduce the necessary lexical entries
      for generation, currently based on predicate-naming conventions
      for the lexical entries that select empty prepositions.
   b. Messages now introduce an additional attribute, ARG0, whose value
      is the event of the highest-scoping verbal EP within the scope of
      the message.  The main motivation is to make it simpler for
      applications to identify the relevant event properties of a
      clause's semantics without looking 'inside' the clause's MRS.
   c. All lexical predicates now have some value in the 'sense' field
      of the predicate name (Background: by convention in the ERG, each
      lexical predicate name has the following form: _ORTH_POS_SENSE_rel
      where ORTH is the lexeme's orthography, POS is a coarse-grained
      sense distinction drawing from the vocabulary [v n a p x q c], and
      SENSE is an arbitrary sequence of characters (excluding |_|), and
      where each of the fields is separated by an underscore.  Earlier,
      the sense field could have been left empty.)  The default value for
      the sense field is now '1'.
   d. Relational nouns now specify in their sense field the orthography 
      of the preposition marking their oblique complement (usually 'of').
   e. Tag questions previously discarded the semantics of the tag phrase,
      contrary to the monotonicity assumption in the ERG.  This is now
      corrected, with the result that the semantics of sentences with
      tag questions is now rather more baroque.  The main benefit of the
      reanalysis is that lexical rules now properly always preserve the
      semantics of their input lexemes.
   f. Sentential subjects were previously analyzed via a nominalization
      rule.  This simplified the syntactic analysis of "That Abrams
      arrived annoyed Browne" since the "annoy" lexeme could always
      unify its ARG1 value with the semantic index of its subject.  But
      the resulting asymmetry for the 'extraposed' and non-extraposed
      variants of lexemes like 'annoy' was annoying.  This version of
      the grammar now provides the same MRS for both variants ('It
      annoyed Browne that Abrams arrived' and the above example), via
      a syntactic variant of an 'it-extraposition' lexical rule, with
      thanks to Ann Copestake for the suggested implementation.  One
      consequence is that the earlier treatment of examples like "The
      problem was that Abrams arrived" no longer works, since the 
      identity copula was being used, and requires its complement to
      supply a referential index.  So there is also yet another entry
      for the verb 'be', which supplies an EP similar to the identity 
      'be'.
   g. Verbal modifiers of nouns were being given an inconsistent
      semantics, with postnominal modifiers as in 'people singing arias'
      supplying a message for the modifier phrase, but with prenominal
      modifiers as in 'the singing people' not contributing a message.
      In this version of the grammar, verbal projections now always
      supply a message, making the world a little more consistent, but
      leaving a sharper contrast now between "the singing children"
      and "the interesting children" where 'interesting' is analyzed
      as an adjective and hence does not supply a message.

3. Lexicon

   New lexical entries have been added drawn from the Norwegian tourism
   domain of the LOGON development corpus, bringing the current number
   of lexemes to 22,750 for this release, of which about 2700 are proper
   names.

4. SEM-I

   A first draft of the semantic interface for the grammar is now 
   presented in the file erg-full.smi, including the predicate names and
   semantic arguments of all predicates introduced either by lexical
   entries or by the grammar (either via lexical/syntactic rules or via
   abstractions over more specific predicates).  Documentation of this
   file is under active development.

5. Naming conventions

   The feature name DIVISIBLE on referential indices has been shortened
   to DIV for better readability of MRSs.

6. LKB warnings on grammar loading

   The LKB's new and improved treatment of morphology offers several
   advantages, and the current version of the grammar benefits from
   these, but still results in some warning messages when loading.  
   Users can ignore these messages for now, while the developers resolve
   the underlying causes.  The first is about the 'punct_bang_rule',
   and the others warn of lexical rules that can feed themselves.

------------------------------------------------------------------------------
Release notes for version "LinGO (30-Apr-05)"

This is a minor update to the Apr-05 version, including some lexical 
additions, adjustments to the semantic predicate hierarchy, and tuning
of syntactic analyses, all designed to improve end-to-end translation
for LOGON.  The only substantive difference is in the analysis of 
possessive constructions, where the grammar now produces nearly
identical MRSs for the two noun phrases "our book" and "a book of ours",
using a new lexical entry for "ours" distinct from the ordinary "ours"
of "ours are not ready".  One consequence of this reanalysis, which
unifies the treatment of the two possessive constructions, is that
the two arguments in the old 'poss_rel' EP have been reversed: what was 
the ARG1 is now ARG2, and vice versa.

------------------------------------------------------------------------------
Release notes for version "LinGO (Apr-05)"

Overview of changes:

 - Lexicon size increased to 21000 entries
 - MRS quality improved
 - Unicode now used for lexicon: foreign proper names, archaic spellings
 - Coverage added for fragments, locative inversion, 'free' parentheticals
 - Changed analyses to allow PP-modif of PPs, APs; adverb-modif of APs
 - Support for new domains: 'shanghai', 'gcide'

--Lexicon--

BNC - Based on months of hard labor by former Stanford students Hansook Lee
and Mike Orme (with help from Ara Kim), the lexicon now contains all verb
subcat entries for the 2000 most frequent verb stems in the British
National Corpus. This should enable some interesting experimentation in
automated lexical acquisition, since there are fewer lexical types that
need to be hypothesized for non-verbs.

GCIDE - The lexicon now also contains entries for all words observed in the
first 10,000 definition 'sentences' in the GNU Contemporary International
Dictionary of English (GCIDE), to enable more precise evaluation of
syntactic coverage of these definitions.

Shanghai - Based on some 1500 entries constructed by Yi Zhang at CoLI in
Saarbruecken, the lexicon now also contains entries for most of the words
found in a Web-derived corpus on tourism in Shanghai, analogous to the
Rondane corpus built by Becky Neil for the LOGON project in Norway.

--MRS quality--

Based on a substantial implementation effort by Stefan Thater and
colleagues at CoLi, Saarbruecken, to check for well-formedness of MRSs
produced by the grammar for the Redwoods and Rondane corpora, many errors
were identified, enabling improvements in MRS construction in the ERG.
Further improvements were enabled by the systematic use of existing
capabilities in the LKB for diagnosing MRS errors in ERG analyses.  While
the current release still produces some flawed MRSs for these data sets,
they are largely confined to a small inventory of known and somewhat
problematic minor phenomena.

--Unicode--

Drawing on the combined expertise of Stephan Oepen and Francis Bond, the
ERG is now fully Unicode-compliant, including the PSQL database.  This
enables proper representation in the lexicon for orthography of non-English
proper names such as "Ã¸sterbÃ¸", and archaic English spellings such as
"coÃ¶peration".  The necessary infrastructure for Unicode is admirably and
demonstrably in place in the LKB, PET, [incr tsdb()], and PostgreSQL.

--Coverage--

Fragments - Further work on the treatment of fragments has been motivated
largely by the effort to parse the definition sentences in GCIDE, and to
give them a consistent semantic representation.  New fragment types now
licensed include VPs and PPs with NP gaps, as in "To devour." or
"Relying on.".

Locative inversion - The grammar now analyzes some locative inversion
phenomena, currently restricted to sentences headed by the finite copula
'be' as in "Near the park is a large dog" but not (yet) "Near the park
stood a large tree".  These appear with some frequency in the Rondane data,
and have also been waiting patiently for twenty years in the CSLI test
suite.

'Free' parentheticals - Sentences containing some classes of parenthetical
material (which would not survive in situ without the parentheses) will now
be analyzed, though further work will be needed in designing the target
semantics.  Example now covered: "That dog (you should see its owner!)
barked."

--Changed analyses--

Modification - Based on more systematic analysis of phenomena found in the
Rondane corpus, and corroborated in the Shanghai corpus, the ERG now
permits more interesting modification structures.  Prepositional phrases,
formerly restricted to modifying only VPs and nominal phrases, can now also
modify adjective phrase and other PPs.  Similarly, adverbs can now also
modify adjective phrases, as in "the wildly happy dog barked", freeing the
grammar from its former requirement that duplicate degree-specifier lexical
entries be added for many adverbs.

--New domains--

The GCIDE corpus has been taken from the GCIDE web site, and carefully
prepared by Eric Nichols at NTT in collaboration with Francis Bond,
including identification of sentence breaks, normalization, and formatting,
all of which are now automated via Perl scripts converting the original
GCIDE data into, among other things, an 'item' file format for use with the
fine system.

The Shanghai corpus is being collected by Yi Zhang in Saarbruecken as part
of his thesis work, and consists of text on tourism in Shanghai, written in
English and mostly but not entirely by native English speakers.  The corpus
may still be revised, so a profile of this data is not (yet) being
distributed with the ERG.
------------------------------------------------------------------------------