Tuesday, December 13, 2011

Synthetic Training Data

One of the recent breakthroughs in computer vision was the use of synthetic data to train effective recognition systems. The most remarkable example is the work of Shotton et al. from Microsoft Research Cambridge, who trained the kinect human pose recognition system using more than a million of synthetic images, generated by rendering fake humans using 3D software.

That's perfectly possible mainly because the computer graphics industry has been working on generating 3D humans with realistic appearance for cinematographic productions and games. Clothing materials, hair, shapes and so on, are easily simulated in 3D. So, generating 3D human poses is a well establish and well understood procedure, and it make sense to generate synthetic data for this kind of problem.

What about medical images? Can we generate synthetic images simulating various medical conditions? It sounds a bit weird, right? Well, during the SIPAIM 2011 in Bucaramanga, Juan Antonio, one of the invited speakers from Spain, was giving a tutorial on how medical images are captured and generated, specially using Magnetic Resonance Imaging (MRI). This is also a very well known physical phenomenon that could be "simulated" using 3D software. Perhaps it's not a conventional 3D rendering procedure, because the intensities observed in an MRI scan are not responses to light, such as in conventional photography, but to magnetic resonance, as its name suggests.

Anyway, we could simulate various tissues and their responses to different machine configurations (intensity of the magnetic field, for instance) and render quite good MRI scans playing with tissue parameters. Then, we could produce a really large dataset of images with quite precise labelling at the pixel level, useful to train a recognition system for medical images. It could eventually work for xrays and other imaging modalities.

On the other hand, I wonder, if we can simulate that data, why can not we use the simulation function directly inside a learning algorithm? In other words, instead of generating the data to train a learning algorithm, the recognition system might also be ready to get that sort of simulation function as prior knowledge to make more effective predictions... does it make sense?

1 comment:

  1. Synthetic training data is a hot topic nowadays, an important component of the large scale ML wave. Ng's ML course has an interesting lecture (http://www.ml-class.org/course/video/preview_list, Sect XVIII) that shows an example of synthetic data generation for a photo OCR application. In this context, Juanca's question is very valid, how to extend this to medical image analysis. In the case of radiological images (e.g. MRI) it seems feasible (this is an example: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2430606/), in the case of histopathological images, it will take more effort and, frankly, I cannot imagine how to approach it.

    The final question is a general one, and it totally makes sense. This is basically the problem of how to integrate not only the synthetic training data, but the process that produce it in the learning process. I'm not sure whether it has been already adressed, but if the trend continues it will become more relevant. This reminds me of a work I did in anomaly detection [1], where we used 'synthetically' generated anomalies to train an anomaly detection system. This 'heuristics' was later formalized in [2], where the authors show "a support vector machine (SVM) for anomaly
    detection for which we can easily establish universal consistency".

    [1] F. González and D. Dasgupta. Anomaly detection using real-valued negative selection. Genetic Programming and Evolvable Machines, 4:383–403, 2003. http://www.springerlink.com/content/v6p6614j70211716/


    [2] Ingo Steinwart, Don Hush, and Clint Scovel. 2005. A Classification Framework for Anomaly Detection. J. Mach. Learn. Res. 6 (December 2005), 211-232. http://jmlr.csail.mit.edu/papers/volume6/steinwart05a/steinwart05a.pdf

    ReplyDelete