For the past few days, I’ve been reading Kevin Markey’s extremely interesting 1994 PhD thesis, “The sensorimotor foundations of phonology: a computational model of early childhood articulatory and phonetic development.” I’d have loved to write this thesis myself, nearly 20 years later! It’s been quite inspiring to read, though, because of the assumptions and other decisions that Markey made in the course of making his model tractable. I think there are a few of these assumptions that we might be able to tackle more directly now, thanks to some of the more recent tools of graphical models and the like.
Several of these ideas come from reading Chapter 7 of the thesis, which details the workings of the reinforcement learning system that drives the motor side of the model. As I understand it, there is a perceptual system that analyzes very small segments of speech sounds and “encodes” them into vectors describing some predefined linguistic features of the sound. These sequences of feature vectors are then used as targets for the motor system to imitate.
At the top level of the motor system is the “phonological controller.” Its role is to choose a low-level “articulatory controller” that is most likely to produce the next speech sound as output. Once it makes this choice, the phonological controller goes offline, as it were, deferring actions to the chosen articulatory controller until it signals that it is finished. The phonological controller sees the linguistic features of the target sound, as well as the linguistic features of the sounds being produced by the articulatory controllers. Each articulatory controller, then executes a “closed-loop” control policy that effectively produces one (or a limited distribution of) speech sound by repeatedly choosing articulatory target poses. The extreme low level of the motor system applies a spring dynamics model to the sequence of target poses, in effect interpolating between the poses in a smooth way. Interestingly, the articulatory controllers do not receive direct feedback about their linguistic performance; instead, they receive feedback about, basically, the state of the airflow in the vocal tract. (Markey appropriately calls these features proprioceptive.) Each articulatory controller receives a reinforcement signal from the phonological controller whenever it finishes, allowing the articulatory controller to learn which sequence of poses is correct for the region of target speech sounds that it “owns.” In turn, the phonological controller receives reinforcement proportional to the similarity of the overall speech signal to the target; this (somehow!) allows the phonological controller to learn which articulatory modules are good at reproducing different portions of the speech signal.
I really like this idea of having a bunch of “modules” that are specialized for regions of the output space. It seems that this isn’t really a new idea, actually: Markey points to several papers from the early 90s that describe different strategies for allocating responsibility in hierarchical models of reinforcement learning, but unfortunately I haven’t read them yet. Given my limited knowledge of this area, then, I wanted to speculate a bit on some parallels with more modern graphical models for this type of learning. I hope to explore some of these parallels by, in effect, replicating some of what Markey did, using audio signals as a training database.
First, assigning modules or the like to “cover” a space seems an awful lot like clustering to me. It would be really fun to see whether there would be a parallel here with nonparametric clustering models like the dirichlet process. Perhaps you could start with just a couple of motor modules, and they would cover random portions of the output space. Then, by comparing their typical outputs with the target sounds that they are trying to produce, you could split up modules that seem to have high variance, or consistently produce output that’s far from a specific target. I suppose this would have a significant impact on the higher-level controller; in fact, there’s a confound in that the high-level controller might choose to avoid a motor module whose distribution is far from a specific target, instead of splitting that module. For example, if trying to produce sound \(s\), the controller might choose between \(m_0\) and \(m_1\). If \(m_0\)’s typical output is centered near s, then the controller should choose it over \(m_1\). However, if neither \(m_0\) nor \(m_1\) are close to \(s\), then they might be good candidates for splitting, perhaps based on their distances to \(s\). If a new module, \(m_2\), gets split out, then its output would be evaluated to see whether it was close to \(s\), and if it was it might be retained, otherwise it would be discarded.
In fact, you could look at this as a sort of proximal dirichlet allocation. Instead of asking each cluster the likelihood of its having produced a given observation, you’d be asking each motor module to generate a small sample of outputs, and then you could compare those with the target and choose among them accordingly. I suppose the comparison falls apart if you look at it more closely, though, because there’s often just one chance to imitate a target in motor learning tasks.
Another thread I’ve been quite into recently is the idea of sparsity and optimal encoding in machine learning. In a sensory context, sparsity is great: all you have to do is throw data at a learning algorithm, and you often get beautiful “basis vectors” as a result.
In the motor context, sparse coding using basis vectors could represent a small number of motor primitives that are combined at execution time to form complex movements. The problem with sparse coding (or, heck, even dense coding, as long as you look at it as a basis vector problem) for motor movements is that there’s no learning signal – nearly everyone who’s explored this area of machine learning seems to have come across this and realized that reinforcement learning is the only solution.
But what about using sparsity as a way of making the reinforcement learning problem more tractable? Anyone who’s tried RL on any reasonably complex state space knows the pain that this supposedly good algorithm can bring to the machine learning student. Partial observability is the bane of all real tasks. So, what about using a sparse coding of the state space to infer the transition matrix in a reasonably compact space? I’ve worked on a RL model of driving for a year or so now, and I know well how only a small portion of the state space is ever visited in “real” driving scenarios. If we could code the raw, continuous world using a tiny subset of some basis, then the transitions could be calculated among basis vectors?