The brain is not a single universal neural network that does everything well. It's a collection of different neural networks that specialize in different tasks, and probably use very different methods to achieve them.
It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.
Actually, in the architecture you described, if there is a planning net that's connected to image net and an audio net, rather than feeding audio to the image net I think synesthesia would be better modeled by feeding the output of the audio net into the image net's input on the planning net. If that makes sense.
It's how some guys defeated the first iteration of recaptcha's audio mode. Then google replaced it with something very annoying to use even for humans.
It seems like the way forward would be networking together various kinds of neural networks to achieve complex goals. For example, an NTM specialized in formulating plans that has access to a CNN for image recognition, and so on.