<div dir="ltr">Thanks Francis, Ann, and Matic,<div><br></div><div>Matic, I&#39;ll look over the code you wrote. It sounds pretty close to what I was after. Thanks for sharing!</div><div><br></div><div>And Ann, to (attempt to) answer your question, I think the tokenization requirement is currently for preprocessing (e.g., he runs a named-entity recognizer over a tokenized string, then uses the results for anonymizing NEs in the DMRS graph). I think his seq-to-seq neural system also uses tokens (as opposed to characters or other sub-word units), but I don&#39;t think it&#39;s currently necessary to retokenize for training/decoding. You can see the code and links to the paper here: <a href="https://github.com/sinantie/NeuralAmr">https://github.com/sinantie/NeuralAmr</a>.</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 27, 2017 at 3:41 AM, Matic Horvat <span dir="ltr">&lt;<a href="mailto:matic.horvat@cl.cam.ac.uk" target="_blank">matic.horvat@cl.cam.ac.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div>Hi,</div><div><br></div>To expand on Ann&#39;s email, the problem I needed to solve was to align the DMRS EPs with PTB-style tokenized sentence. In addition to punctuation and whitespace differences, I needed to align null semantics items that are not associated with any predicate in the ERG. The latter is done using heuristics. The code is available under MIT license here: <a href="https://github.com/matichorvat/pydmrs" target="_blank">https://github.com/<wbr>matichorvat/pydmrs</a>. <div><br></div><div>The relevant modules are: </div><div>General alignment (without null semantics items): <a href="https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/token_align.py" target="_blank">https://github.com/<wbr>matichorvat/pydmrs/blob/<wbr>master/dmrs_preprocess/token_<wbr>align.py</a></div><div>Null semantics item alignment: <a href="https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/unaligned_tokens_align.py" target="_blank">https://github.com/<wbr>matichorvat/pydmrs/blob/<wbr>master/dmrs_preprocess/<wbr>unaligned_tokens_align.py</a><br></div><div>Heuristics for null semantics item alignment: <a href="https://github.com/matichorvat/pydmrs/blob/master/dmrs_preprocess/unaligned_tokens_heuristics.py" target="_blank">https://github.com/<wbr>matichorvat/pydmrs/blob/<wbr>master/dmrs_preprocess/<wbr>unaligned_tokens_heuristics.py</a></div><div><br></div><div>I hope that helps!</div><div><br></div><div>Best,<br>Matic</div><div><br><div><br></div><div><br></div></div></div><div class="HOEnZb"><div class="h5"><div class="gmail_extra"><br><div class="gmail_quote">On Tue, Jun 27, 2017 at 9:26 AM, Ann Copestake <span dir="ltr">&lt;<a href="mailto:aac10@cl.cam.ac.uk" target="_blank">aac10@cl.cam.ac.uk</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
  
    
  
  <div bgcolor="#FFFFFF" text="#000000">
    <p>Matic&#39;s thesis indeed has an approach to the version of the
      problem he had to deal with (not quite the same), and he will make
      code available.  The thesis will be generally available once he&#39;s
      done some corrections.  But - he&#39;s now working in a company so
      won&#39;t be supporting the code, and it was anyway far from perfect.</p>
    <p>Is the system you&#39;re trying to integrate with really simply
      space-tokenized?  People generally use something a little more
      complex.<br>
    </p>
    <p>All best,<br>
    </p>
    <br>
    Ann<div><div class="m_8009082640403460885h5"><br>
    <br>
    <div class="m_8009082640403460885m_2364004831508039477moz-cite-prefix">On 26/06/2017 05:54, Francis Bond
      wrote:<br>
    </div>
    </div></div><blockquote type="cite"><div><div class="m_8009082640403460885h5">
      <div dir="ltr">I am pretty sure Matic has done some work on this
        problem, ...</div>
      <div class="gmail_extra"><br>
        <div class="gmail_quote">On Mon, Jun 26, 2017 at 6:50 AM,
          Michael Wayne Goodman <span dir="ltr">&lt;<a href="mailto:goodmami@uw.edu" target="_blank">goodmami@uw.edu</a>&gt;</span> wrote:<br>
          <blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
            <div dir="ltr">Thanks Woodley,
              <div class="gmail_extra"><br>
                <div class="gmail_quote"><span>On Sun, Jun 25,
                    2017 at 8:03 PM, Woodley Packard <span dir="ltr">&lt;<a href="mailto:sweaglesw@sweaglesw.org" target="_blank">sweaglesw@sweaglesw.org</a>&gt;</span>
                    wrote:<br>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Have you
                      considered passing a pre-tokenized string
                      (produced by REPP or otherwise) into ACE? 
                      Character spans will then automatically be
                      produced relative to that string.  Or maybe I
                      misunderstood your goal?</blockquote>
                    <div><br>
                    </div>
                  </span>
                  <div>Yes, I have tried this, but (a) I still get
                    things like the final period being in the same span
                    as the final word (now with the additional space);
                    (b) I&#39;m concerned about *over*-tokenization, if the
                    REPP rules find something in the tokenized string to
                    further split up; and (c) while it was able to parse
                    &quot;The dog could n&#39;t bark .&quot;, it fails to parse things
                    like &quot;The kids &#39; toys are in the closet .&quot;.</div>
                  <div><br>
                  </div>
                  <div>As to my goal, consider again &quot;The dog couldn&#39;t
                    bark.&quot; The initial (post-REPP) tokens are:</div>
                  <div><br>
                  </div>
                  <div>
                    <div style="font-size:12.8px">    &lt;0:3&gt;    
                       &quot;The&quot;</div>
                    <div style="font-size:12.8px">    &lt;4:7&gt;    
                       &quot;dog&quot;</div>
                    <div style="font-size:12.8px">    &lt;8:13&gt;    
                      &quot;could&quot;</div>
                    <div style="font-size:12.8px">    &lt;13:16&gt;  
                       &quot;n’t&quot;</div>
                    <div style="font-size:12.8px">    &lt;17:21&gt;  
                       &quot;bark&quot;</div>
                    <div style="font-size:12.8px">    &lt;21:22&gt;  
                       &quot;.&quot;</div>
                  </div>
                  <div style="font-size:12.8px"><br>
                  </div>
                  <div style="font-size:12.8px">The internal tokens are:</div>
                  <div style="font-size:12.8px"><br>
                  </div>
                  <div style="font-size:12.8px">
                    <div style="font-size:12.8px">    &lt;0:3&gt;    
                       &quot;the&quot;</div>
                    <div style="font-size:12.8px">    &lt;4:7&gt;    
                       &quot;dog&quot;</div>
                    <div style="font-size:12.8px">    &lt;8:16&gt;    
                      &quot;couldn’t&quot;</div>
                    <div style="font-size:12.8px">    &lt;17:22&gt;  
                       &quot;bark.&quot;</div>
                    <div><br>
                    </div>
                  </div>
                  <div>I would like to adjust the latter values to fit
                    the string where the initial tokens are all space
                    separated. So the new string is &quot;The dog could n&#39;t
                    bark .&quot;, and the LNK values would be:</div>
                  <div><br>
                  </div>
                  <div style="font-size:12.8px">    &lt;0:3&gt;    
                     _the_q</div>
                  <div style="font-size:12.8px">    &lt;4:7&gt;    
                     _dog_n_1</div>
                  <div style="font-size:12.8px">    &lt;8:17&gt;    
                    _can_v_modal, neg  (CTO + 1 from the internal space)</div>
                  <div style="font-size:12.8px">    &lt;18:22&gt;  
                     _bark_v_1  (CFROM + 1 from previous adjustment; CTO
                    - 1 to get rid of the final period)</div>
                  <div><br>
                  </div>
                  <div>My colleague uses these to anonymize named
                    entities, numbers, etc., and for this task he says
                    he can be somewhat flexible. But he also uses them
                    for an attention layer in his neural setup, in which
                    case he&#39;d need exact alignments.</div>
                  <span>
                    <div><br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><span class="m_8009082640403460885m_2364004831508039477m_-6852063119269118278gmail-m_3482906361254902959HOEnZb"><font color="#888888"><br>
                          Woodley<br>
                        </font></span>
                      <div class="m_8009082640403460885m_2364004831508039477m_-6852063119269118278gmail-m_3482906361254902959HOEnZb">
                        <div class="m_8009082640403460885m_2364004831508039477m_-6852063119269118278gmail-m_3482906361254902959h5"><br>
                          <br>
                          <br>
                          <br>
                          &gt; On Jun 25, 2017, at 3:14 PM, Michael
                          Wayne Goodman &lt;<a href="mailto:goodmami@uw.edu" target="_blank">goodmami@uw.edu</a>&gt;
                          wrote:<br>
                          &gt;<br>
                          &gt; Hi all,<br>
                          &gt;<br>
                          &gt; A colleague of mine is attempting to use
                          ERG semantic outputs in a system originally
                          created for another representation, and his
                          system requires the semantics to be paired
                          with a tokenized string (e.g., with
                          punctuation separated from the word tokens).<br>
                          &gt;<br>
                          &gt; I can get the space-delimited tokenized
                          string, e.g., from repp or from ACE with the
                          -E option, but then the CFROM/CTO values in
                          the MRS no longer align to the string. The
                          initial tokens (&#39;p-input&#39; in the &#39;parse&#39; table
                          of a [incr tsdb()] profile) can tell me the
                          span of individual tokens in the original
                          string, which I could use to compute the
                          adjusted spans. This seems simple enough, but
                          then it gets complicated as there are
                          separated tokens that should still count as a
                          single range (e.g. &quot;could n&#39;t&quot;, where
                          &#39;_can_v_modal&#39; and &#39;neg&#39; both select the full
                          span of &quot;could n&#39;t&quot;) and also those I want
                          separated, like punctuation (but not all
                          punctuation, like &#39; in &quot;The kids&#39; toys are in
                          the closet.&quot;).<br>
                          &gt;<br>
                          &gt; Has anyone else thought about this
                          problem and can share some solutions? Or, even
                          better, code to realign EPs to the tokenized
                          string?<br>
                          &gt;<br>
                          &gt; --<br>
                          &gt; Michael Wayne Goodman<br>
                          &gt; Ph.D. Candidate, UW Linguistics<br>
                        </div>
                      </div>
                    </blockquote>
                  </span></div>
                <span><br>
                  <br clear="all">
                  <div><br>
                  </div>
                  -- <br>
                  <div class="m_8009082640403460885m_2364004831508039477m_-6852063119269118278gmail-m_3482906361254902959gmail_signature">
                    <div dir="ltr">Michael Wayne Goodman
                      <div>Ph.D. Candidate, UW Linguistics</div>
                    </div>
                  </div>
                </span></div>
            </div>
          </blockquote>
        </div>
        <br>
        <br clear="all">
        <div><br>
        </div>
        -- <br>
        <div class="m_8009082640403460885m_2364004831508039477gmail_signature" data-smartmail="gmail_signature">Francis
          Bond &lt;<a href="http://www3.ntu.edu.sg/home/fcbond/" target="_blank">http://www3.ntu.edu.sg/home/f<wbr>cbond/</a>&gt;<br>
          Division of Linguistics and Multilingual Studies<br>
          Nanyang Technological University<br>
        </div>
      </div>
      </div></div><div id="m_8009082640403460885m_2364004831508039477DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2"><br>
        <table style="border-top:1px solid #d3d4de">
          <tbody>
            <tr>
              <td style="width:55px;padding-top:13px"><a href="http://www.avg.com/email-signature?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=emailclient" target="_blank"><img src="https://ipmcdn.avast.com/images/icons/icon-envelope-tick-green-avg-v1.png" alt="" style="width:46px;height:29px" height="29" width="46"></a></td>
              <td style="width:470px;padding-top:12px;color:#41424e;font-size:13px;font-family:Arial,Helvetica,sans-serif;line-height:18px">Virus-free. <a href="http://www.avg.com/email-signature?utm_medium=email&amp;utm_source=link&amp;utm_campaign=sig-email&amp;utm_content=emailclient" style="color:#4453ea" target="_blank">www.avg.com</a>
              </td>
            </tr>
          </tbody>
        </table>
        <a href="#m_8009082640403460885_m_2364004831508039477_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2" width="1" height="1"> </a></div>
    </blockquote>
    <br>
  </div>

</blockquote></div><br></div>
</div></div></blockquote></div><br><br clear="all"><div><br></div>-- <br><div class="gmail_signature" data-smartmail="gmail_signature"><div dir="ltr">Michael Wayne Goodman<div>Ph.D. Candidate, UW Linguistics</div></div></div>
</div>