Friday, December 30, 2011

Embarrassingly Parallel

I remember we found this expression in one of the papers we were studying during the Large Scale Machine Learning seminar. Here we have a definition, take from Wikipedia:

In parallel computing, an embarrassingly parallel workload (or embarrassingly parallel problem) is one for which little or no effort is required to separate the problem into a number of parallel tasks. This is often the case where there exists no dependency (or communication) between those parallel tasks.
Full article: http://en.wikipedia.org/wiki/Embarrassingly_parallel

Tuesday, December 13, 2011

Synthetic Training Data

One of the recent breakthroughs in computer vision was the use of synthetic data to train effective recognition systems. The most remarkable example is the work of Shotton et al. from Microsoft Research Cambridge, who trained the kinect human pose recognition system using more than a million of synthetic images, generated by rendering fake humans using 3D software.

That's perfectly possible mainly because the computer graphics industry has been working on generating 3D humans with realistic appearance for cinematographic productions and games. Clothing materials, hair, shapes and so on, are easily simulated in 3D. So, generating 3D human poses is a well establish and well understood procedure, and it make sense to generate synthetic data for this kind of problem.

What about medical images? Can we generate synthetic images simulating various medical conditions? It sounds a bit weird, right? Well, during the SIPAIM 2011 in Bucaramanga, Juan Antonio, one of the invited speakers from Spain, was giving a tutorial on how medical images are captured and generated, specially using Magnetic Resonance Imaging (MRI). This is also a very well known physical phenomenon that could be "simulated" using 3D software. Perhaps it's not a conventional 3D rendering procedure, because the intensities observed in an MRI scan are not responses to light, such as in conventional photography, but to magnetic resonance, as its name suggests.

Anyway, we could simulate various tissues and their responses to different machine configurations (intensity of the magnetic field, for instance) and render quite good MRI scans playing with tissue parameters. Then, we could produce a really large dataset of images with quite precise labelling at the pixel level, useful to train a recognition system for medical images. It could eventually work for xrays and other imaging modalities.

On the other hand, I wonder, if we can simulate that data, why can not we use the simulation function directly inside a learning algorithm? In other words, instead of generating the data to train a learning algorithm, the recognition system might also be ready to get that sort of simulation function as prior knowledge to make more effective predictions... does it make sense?