Friday, October 10, 2014

Word candidates

Children learn language without any lexicon explicitly given, so that they have to learn words from perceived utterances.  As machine learning solution for word segmentation, there are methods such as Nested Pitman-Yor Language Modeling.  However, if you just want to get ideas of word candidates, you can use a simpler method, perhaps similar to a method used for keyboard input auto-completion.  The basic idea here is that a word string may have substrings whose frequencies are similar to the frequency of the word.  For example, the frequency of the occurrence of "xample" may be similar to the frequency of "example," as they occur almost always together.
The result of a simple experiment with an artificial corpus (modified from the previous labeling experiment, adding one-word utterances such as 'illo' and sentences about the ambience such as 'ilfacefrigide' ("It's cold.")):
        lo,778
        il,531*
      loes,456
      illo,301*
        le,271*
        un,268*
    illoes,262
    angulo,261*
        au,242
        ta,204
        re,188
        as,176
     verde,171*
      esun,160
      blau,157*
     rubie,156*
    ilface,153
   circulo,143*
rectangulo,132*
 loesverde,131
 triangulo,129*
  anguloes,125
  loesblau,122
 loesrubie,120
       lor,104
     tomas,104*
  illoesun,83
ilfacefrigide,80
  ....
* Intended words are marked with '*'.
The numbers are the frequency of strings in the corpus of 1000 utterances.

No comments:

Post a Comment