Friday, September 2, 2011

Literature review (part 5)

Six new summaries for you today. We've got three hair papers, two supervised multi-class segmentation papers, and a paper which converts images to sketches and vice-versa.

"FREQUENTIAL AND COLOR ANALYSIS FOR HAIR MASK SEGMENTATION" by Rousset and Coulon: The task is hair segmentation (determine which pixels are hair). Images are frontal facing, and face detection is used to find the bounding box. Images are restricted to women so they don't have to deal with facial hair. The idea is to find some individual pixels which are definitely hair, and use alpha matting to get the rest of the hair pixels.

To find the definitely-hair pixels (the seeds), they use a combination of "frequential" (texture) and color information. They model the color of hair pixels with a single Gaussian, and threshold at a particular density to determine the pixels which have the right color (color mask). It's not clear to me exactly how they use the texture information, but they come up with a texture mask. The AND of the color and texture masks gives the pixels which are definitely hair. Because the matting technique they're using (Levin's, described in the last post), requires seed pixels for the background, they select those pixels which are (NOT TEXTURE) OR (NOT HEAD), where HEAD is true for the pixels in the automatically detected bounding box around the face.

They get what looks like reasonable precision and recall.

"Detection and Analysis of Hair" by Yacoob and Davis: This is probably the first computer vision paper to talk about leveraging hair recognition as part of a larger recognition system. Their idea is to enumerate a bunch of hair attributes and feed the values of those attributes for a particular image to a classifier, along with the internal face information. The attributes they measure are: 1) hair color, characterizing individual strands as cylindrical Lambertian surfaces and inferring the color-related physical parameters of the hair 2) hair parts (as in "you part your hair with a comb"), whether they occur, and if so, where 3) hair width (how much it sticks up off the head) 4) hair length, 5) they go on: surface area covered by hair, hair symmetry, inner and outer hairlines, hair texture (using Gabor wavelets). They find that these attributes provide useful additional information not exploited by the face recognition system they use, and that by combining internal face information and hair information, they can boost performance.

"A novel two-tier Bayesian based method for hair segmentation" by Wang et al.: A hair segmentation paper. The idea is that hair color can be pretty consistent within an image, but vary greatly between images, due to eg lighting. When a test image comes in, all pixels are scored for probability of being hair using location and color (color density is modeled with a Gaussian mixture model (GMM)). The image is then oversegmented, and if a fragment has a high total hair weight, everything inside it is labeled as hair. All the pixels from all these hair fragments are used to train a new GMM which captures hair color _in this image_. The new color GMM and the original location prior is used to classify the rest of the fragments, resulting in the final segmentation.

The main idea here is to adapt the classifier at test time based on information found in the query.

"Inducing semantic segmentation from an example" by Schnitman et al.: Task is supervised multi-label segmentation. The authors assume training and test images come from the same domain. They mention an application: photoshoot of a model, where the director wants to change the color of garment worn across all photos post-hoc.

They oversegment their test image, and from each fragment in the test image they extract several local patches. They compute the distance of the fragment to each of the labels as the median of the Euclidean distance of each local patches to the nearest patch in the label data; taking the median gives them some robustness to noise. The per-label costs for each fragment, along with a cost for neighboring fragments having different labels, is combined into an optimization problem which is solved by graph cuts using alpha-expansion, which allows graph-cuts to deal with more than two labels.


"TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context" by Shotton et al.: The task is supervised segmentation. In the author's own words, there are three contributions: 1) a novel type of feature, the texture-layout filter, 2) a new discriminative model that combines texture-layout filters with lower-level features, 3) techniques for making the training scalable to larger datasets.

A texture-layout filter is built with textons. Each pixel in each image is assigned a texton. A texture-layout filter is a pair (r, t) for a region r and a texton t. Its value is proportional to the number of pixels that are assigned texton t in the region.


The model they use is a conditional random field that incorporates: 1) texture-layout features, 2) colors (modeling color probability using a Gaussian mixture model, 3) location information (sky tends to be at the top of the image if anywhere), 4) edge potentials which penalize changing from one class to another, except when there is an image gradient.


"Face photo-sketch synthesis and recognition" by Wang and Tang: The idea in this paper is to find a mapping between face photos and sketches. A potential application is law enforcement, as a sketch could be used to search a database of photos.


The mapping is similar to Freeman's SR work, where low res patches are identified with high res patches. Here, instead, sketch patches are matched with image patches. But there are some differences: 1) unlike Freeman, which is a single-scale MRF, the MRF used here is multi-scale, with the prior for a region coming not just from nearby small regions but also from far-away, larger regions. 2) they assume the faces are aligned, so when matching patches, they can spatially constrain where the patches they match to come from. For example, if I know I'm over the left ear, I will only retrieve database patches that were also extracted from the left ear. 3) there is not one patch per pixel, instead the patches form a grid, with small overlaps between the cells. The MRF error comes from the error between the overlapping pixels, and the optimized variable is which of K patch candidates to use for a particular patch location (the K candidates come from the K nearest neighbors of the original image patch). Even after the MRF optimization, there may be disagreements between the pixels, so the boundaries between patches are computed using "dynamic programming" (sounds like min cut).

One experiment they ran matched sketches against face databases. They got better performance synthesizing face and matching to faces directly than converting all database to sketches and matching in sketch space. The former approach lets them use existing face recognition techniques.

Somehow their system inferred skin tone based on the sketch. In other words, they get a grayscale sketch, and their system somehow figures out the skin tone should be Asian / Caucasian / etc. The system must be relying on facial shape and features to infer skin tone, which is amazing.


No comments:

Post a Comment