[developers] DOS/*nix issue with irregular morphological forms in the LKB

Woodley Packard sweaglesw at sweaglesw.org
Sat Mar 12 17:03:47 CET 2016


Thanks for the tip.  I’ve now (in svn trunk) proactively added robustness to CRLF line endings to ACE’s irregular morphology loading, and also VPM loading.  It was already robust in TDL handling.

-Woodley

> On Mar 12, 2016, at 4:53 AM, Ann Copestake <aac10 at cl.cam.ac.uk> wrote:
> 
> regardless of svn it can happen (I assume) when someone creates an irregs.tab file under Windows
> 
> and sure, I will look at it at some point.  I am tempted to make the " optional
> (that's what I meant by the horrible format - the requirement to have these at the beginning and
> end of the file) and read it in more robustly, though.
> 
> On 12/03/2016 12:33, Stephan Oepen wrote:
>> that is an interesting constellation, indeed :-).
>> 
>> when downloading via SVN to windows, the line-ending conventions are
>> helpfully updated: the file is unix-style (LF) natively, but that is
>> padded to windows-style (CRLF) by the SVN client in your set-up.  with
>> the LKB running in a un*x environment while reading data from the
>> windows filesystem, this problem arises.
>> 
>> we could mark the file as binary in SVN, to prevent the conversion.
>> but probably it would be better and more robust to add something like
>> 
>>   (string-right-trim '(#\Return) ...)
>> 
>> to the code that reads those strings read from ‘irregs.tab’.  since
>> you have the testing environment readily available, i would like to
>> defer to you to actually put that into the LKB code.
>> 
>> cheers, oe
>> 
>> 
>> On Sat, Mar 12, 2016 at 1:05 PM, Ann Copestake <aac10 at cl.cam.ac.uk> wrote:
>>> bit of a blast from the past, but I thought it worth recording, since I
>>> might even get round to doing the fix one day
>>> 
>>> If one uses the extremely useful UbuntuLKB/Virtual box under Windows with an
>>> ERG (and presumably other grammars) downloaded from Windows (in my case via
>>> Tortoise svn), one should be aware that reading in of the irregs.tab file
>>> may not work properly because of the different line-ending conventions.  The
>>> effect is that a spurious ^M character gets tacked onto the end of the stem
>>> when morph analysing e.g., slept and so irregular forms are not correctly
>>> recognised. i.e., the symptom is that one can't parse sentences with a
>>> morphologically irregular form.  The  work-around is to save the file in
>>> *nix format.  The solution is to check for this in the LKB when reading the
>>> irregs.tab file, which is anyway in a stupid format, but I guess there's no
>>> enthusiasm for changing that now.
>>> 
>>> Ann
>>> 
> 




More information about the developers mailing list