[developers] [incr tsdb()] Process | Vocabulary output

Mon Aug 20 20:53:27 CEST 2012

Thanks for the reply!  The reason I'm playing around with this
is that I'm trying to get a handle on lexical coverage of an auto-generated
grammar for Chintang.  I'd like to be able to measure, for a given
profile, the number of word forms that the grammar has some analysis
of (at both the type and token level).  For my purposes, naive
tokenization is just fine, if that means breaking on whitespace.

For development purposes, it's already been useful as an
error analysis tool.  I was hoping to also use it for evaluation
(to help me quantify as above), but it sounds like that might
require figuring out the right definition of dag-inflected-p such
that I get a measure of the number of fully inflected forms.

Emily

On Mon, Aug 20, 2012 at 11:47 AM, Stephan Oepen <oe at ifi.uio.no> wrote:
> from a quick glance at the (ancient) code, the two numbers
> come from the fields 'words' and 'l-stasks', which (in the LKB)
> are counted (in the chart after parsing) as follows:
>
>   (cond
>    ((rule-p rule)
>     (incf pedges)
>     (when (lexical-rule-p rule) (incf l-stasks)))
>    ((not (rule-p rule))
>     (when (and dag (dag-inflected-p dag)) (incf words)))))
>
> the default definition of dag-inflected-p() is
>
>   (defun dag-inflected-p (dag)
>     (declare (ignore dag))
>     t)
>
> so, yes, unless a grammar changes dag-inflected-p(), it
> should come to the number of lexical entries instantiated,
> and the number of successful applications of lexical rules.
>
> i do not immediately recall the utility of dag-inflected-p(),
> but vaguely recall something in the space of compiling
> out a full-form lexicon ...
>
> do you actual find this functionality useful?  i am almost
> tempted to disable it, as it is severely flawed, for example
> performing naive tokenization on the [incr tsdb()] side.
>
> cheers, oe
>
>
> On Mon, Aug 20, 2012 at 8:11 PM, Emily M. Bender <ebender at uw.edu> wrote:
>> Following up on the thread below:
>>
>> Either I haven't understood your answer properly, or something else
>> is going on.  For my current Chintang grammar, I get results like
>> this:
>>
>> ba-ce-ko | 1 references | [1 + 2] lexical entrie(s);
>>
>> In lexicon.tdl, I have an entry with [ ORTH "ba" ], but none with
>> [ORTH "ba-ce-ko" ].   Could this instead mean something like
>> "there is one analysis available of this word, and it involves
>> the application of two lexical rules"?  (That would be consistent
>> with what I see in the parse chart, parsing just this item.)
>>
>> In some other cases, it looks like a result of [ 1 + 0 ] lexical entrie(s)
>> means that the orthographic rules allowed the LKB to strip
>> affixes and find a stem, but then the tfs associated with those
>> lexical rules didn't unify.
>>
>> Regarding dag-inflected-p, I haven't said anything yet in user-fns.lsp,
>> so that's whatever the default is in the LKB.  (And I'm not sure I
>> would know how to write something more appropriate, as we're
>> now using a complex fs as the value of INFLECTED.  Can
>> dag-inflected-p check whether the value of that feature unifies
>> with some other type, or can it only check types as strings?)
>>
>> Thanks,
>> Emily
>>
>>
>>
>>
>> On Wed, Aug 8, 2012 at 2:53 PM, Dan Flickinger <danf at stanford.edu> wrote:
>>> The test being applied is the function dag-inflected-p, which can be defined in a grammar's user-fns.lsp file.  In the ERG, this is a check for the value of the feature INFLECTD.
>>>
>>>  Dan
>>>
>>> ----- Original Message -----
>>> From: "Emily M. Bender" <ebender at uw.edu>
>>> To: "Dan Flickinger" <danf at stanford.edu>
>>> Cc: "developers" <developers at delph-in.net>
>>> Sent: Wednesday, August 8, 2012 2:16:28 PM
>>> Subject: Re: [developers] [incr tsdb()] Process | Vocabulary output
>>>
>>> Thanks, Dan. What property is being used to indicate "fully inflected"
>>> in this case?
>>>
>>> Emily
>>>
>>> On Wed, Aug 8, 2012 at 1:50 PM, Dan Flickinger <danf at stanford.edu> wrote:
>>>> Hi Emily -
>>>>
>>>> My understanding is that the first of the two numbers in the "[n + n]" report is the number of already inflected lexical entries defined in the lexicon with the given orthography, and the second number is the number of lexical rules that can apply to a lexicon-defined entry's stem to produce the given orthography.  So [basic + derived].  Thus, in the ERG, "dog" is reported as "[0 + 2]" since there is no fully inflected lexical entry defined with that spelling, but there are two derived forms with that spelling, applying an inflectional rule to each of the noun and verb entries defined in the lexicon.
>>>>
>>>>  Dan
>>>>
>>>> ----- Original Message -----
>>>> From: "Emily M. Bender" <ebender at uw.edu>
>>>> To: "developers" <developers at delph-in.net>
>>>> Sent: Thursday, July 19, 2012 4:00:58 PM
>>>> Subject: [developers] [incr tsdb()] Process | Vocabulary output
>>>>
>>>> Dear all,
>>>>
>>>> Is there any documentation on the output of the Process | Vocabulary
>>>> function in [incr tsdb()]?  I'm curious in particular what the two
>>>> numbers in "[n + n] lexical entries" mean, but couldn't turn up
>>>> anything on the wiki.
>>>>
>>>> Thanks,
>>>> Emily
>>>>
>>>> --
>>>> Emily M. Bender
>>>> Associate Professor
>>>> Department of Linguistics
>>>> Check out CLMS on facebook! http://www.facebook.com/uwclma
>>>
>>>
>>>
>>> --
>>> Emily M. Bender
>>> Associate Professor
>>> Department of Linguistics
>>> Check out CLMS on facebook! http://www.facebook.com/uwclma
>>
>>
>>
>> --
>> Emily M. Bender
>> Associate Professor
>> Department of Linguistics
>> Check out CLMS on facebook! http://www.facebook.com/uwclma

-- 
Emily M. Bender
Associate Professor
Department of Linguistics
Check out CLMS on facebook! http://www.facebook.com/uwclma