summer 2011: head segmentation: September 2011

Wednesday, September 14, 2011

Literature review (part 13)

Today we come to the end of the third phase of my literature review (you didn't know there were phases, did you?). Hopefully we're quite near the end.

Today we have a quick citation and two works using SIFT for ear recognition.

"Statistical shape influence in geodesic active contours" by Leventon et al.: Presents a method of incorporating shape into the image segmentation process. Not sure why this paper was on my reading list.

"SIFT-based ear recognition by fusion of detected keypoints from color similarity slice regions" by Kisku et al.: The goal is to provide ear recognition despite pose variation and occlusion. They do this by detecting color "level-sets" of ear images and applying SIFT to those level sets. The level sets are in fact ranges of colors that they cluster together using a GMM.

In more detail: they fit GMMs to pixel colors of ear images in order to find bands of ear colors. For each band, they segment a new image of an ear into pixels which fall into the band (black) and everything else (white). Segmentations produced in this manner have characteristic ear-like shapes. SIFT features are extracted from these segmentations and all concatenated together to form one ear descriptor. Then they match descriptors.

"Ear identification by fusion of segmented slice regions using invariant features: an experimental manifold with dual-fusion approach" by Kisku et al.: This appears to be similar to their previous paper, though they introduce another feature inspired by Dempster-Shafer theory.

Tuesday, September 13, 2011

Literature review (part 12)

Today we have an ear detection paper and a paper which does eyebrow segmentation.

"Fast and Fully Automatic Ear Detection Using Cascaded AdaBoost" by Islam et al.: Used AdaBoost with custom weak learners to detect ears. The weak learners use custom features, including a center-surround feature used to detect ear pits. To improve efficiency, they actually use a cascade of AdaBoost classifiers, where the early classifiers should be quick to evaluate.

They get essentially perfect detection performance.

"Facial features segmentation by model-based snakes" by Radeva: Goal is segmentation of facial features, including eyes, eyebrows, and mouth using aligned images. They find the eyes using template matching. They then project the image along the x-axis, and look for a valley in the grayscale values indicating the eyebrow location (eyebrows are dark); this determines the eyebrow y location. To determine the x location, they project along the y-axis and use the fact that eyebrows are above eyes and both are dark, again looking for a valley in the grayscale values.

They use the approximate eyebrow locations to initialize an active-contour-like technique called a "rubber snake". The final position of the rubber snake provides the segmentation.

Monday, September 12, 2011

Literature review (part 11)

These are some "quick citations"; I skimmed these papers to get a quick gist of what they do.

"Estimation of the chin and cheek contours for precise face model adaptation" by Kampmann: The goal is to determine the chin and cheek contours.

They assume they know where the eyes and mouth are, and use this to find the probable location of the chin. They then use local gradient information to find three points on the chin (high gradient means likely chin boundary). They then fit a curve to these three points and call it the chin boundary.

To determine cheek boundaries (the boundary above and to either side of the chin), they use a similar technique.

They appear to use only frontal facing images.

"Ear Biometrics in Computer Vision" by Burge and Burger: Model ears as a graph built from a Voronoi diagram of ear segments. Use a novel graph distance function.

"Force Field Feature Extraction for Ear Biometrics" by Hurley: This is a thesis. Pixels in the the image of an ear are treated like point masses, and the gravitational field is somehow used to construct a descriptor.

Literature review (part 10)

Three papers today: head segmentation, review paper on ear biometrics, and a cute one on image-based shaving.

"Segmentation of a head into face, ears, neck and hair for knowledge-based analysis-synthesis coding of videophone sequences" by Kampmann: Task is to segment a video stream into face ears, neck, and hair regions. A motivating idea is smart videoconferencing compression, as regions like the face should be transferred at a higher fidelity than other regions.

They assume they start with eyes and mouth center positions as well as chin and cheek contours. In a new image, they find areas with likely skin tone using the eyes and mouth locations, and they use this tone to segment the image between skin and background. Using the chin contour, they break the skin pixels into face, neck, and ears regions.

To find hair, they assume they have a segmentation of the head from the background and they subtract the skin pixels.

"The ear as a biometric" by Hurley et al.: This is an ear biometrics review paper.

The paper makes the point that there is significant variability in ear shape between people, and that the ear: 1) changes little with age, 2) doesn't change with facial expression, 3) vs fingerprints there is not hygenics issue, 4) vs iris or retina scanning, there is no user fear of harm, 5) not affected by makeup or obscured by facial hair, though can be changed with jewelry or obscured with head hair.

Burge and Burger demonstrated the potential for the ear as a biometric and also used computer vision to recognize ears. They segmented the ear using Canny edges and compared two such segmentations by comparing the Voronoi diagrams of the segmentations using a novel distance measure. They identify occlusion by hair as a major obstacle.

A number of authors have used principal components analysis on the raw pixels of registered images as a preprocessing step when recognizing ears.

Appear only to talk about approaches for registered ears.

"Image-based Shaving" by Nguyen et al.: This is a graphics paper discussing automatic removal of beards in images. They express a given image in terms of beard and non-beard PCA components.

Given a set of non-beard images, a naive approach is to find the PCA components accounting for most of the energy, and when a bearded image comes in, express the bearded image in terms of the non-beard principal components. However, as beards tend to be spatially large and significantly different than skin pixels, the reconstruction process tries to reconstruct the beard and produces poor results. As a first pass fix for this problem, they use a robust estimator, where pixel mismatch imposes an asymptotically L1 error, instead of the L2 error of regular PCA. This improves the results, but can also result in overly smooth reconstructions.

As a further refinement, they estimate the beard space directly. To do this, they take a dataset of bearded faces, and for each bearded face, they compute the difference between the bearded face and the automatically shaved faces using the previous method. This set of difference vectors spans the beard space. They can then express a new bearded image with the standard PCA error (L2), using beard and non-beard principal components. They can then reweight the beard coefficients to delete the beard or even make it more pronounced.

As yet a further refinement, they exploit the spatial locality of beards. When determining the beard subspace by considering the difference images, they use an MRF to segment beard pixels (generally pixels with a large difference) from non-beard pixels. They then zero out the entries in the difference vectors that are not beard pixels.

With their technique, they can also perform beard addition, though they must blend the beard and face layers as a postprocessing step to assure an even transition from beard to face.

They use hand-registered data.

They have preliminary results for automatic glasses removal, but they don't discuss them.

Wednesday, September 7, 2011

Literature review (part 9)

Took the wrong train today, so had extra time to read. Today we have ears, eyebrows, and head detection. I think the girl sitting next to me on the train was wondering why I was looking at arrays of photos of ears.

"On Model-Based Analysis of Ear Biometrics" by Arbab-Zavar et al.: The task is biometrics using the ear, claiming that ears are distinctive. Their dataset is a subset of the XM2VTS dataset, sides of heads with the ears clearly visible, but not aligned.

They describe their model as a constellation model. They train it by taking one image from each of the 63 subjects and manually cropping to the ear. They run SIFT on all these images and cluster the resulting descriptors, using information from the descriptor space as well as the location space; they can use location space information because the ears are roughly registered.

Given a test image, they find the elliptical shape of the ear in the test image and crop to the ear. Then they extract SIFT features. The final match score is the cost of matching all of the cluster centers to the SIFT descriptors in the test image.

They do some occlusion tests by synthetically occluding images of ears (though the enrollment ears remain fully visible). Their performance drops with occlusion, but they find they do better than a competing method based on PCA.

"Robust 2D ear registration and recognition based on sift point matching" by Bustard and Nixon: Registers ears using SIFT (assuming the ear is planar). With SIFT for registration and a simple pixel-to-pixel distance, they get results equal to manual registration with PCA.

In their method, ears in the gallery are masked, finding ear pixels and non-ear pixels. Then SIFT features are extracted from the ear regions in the gallery. For a query image, SIFT features are extracted and matched across the gallery via RANSAC. Based on language in the experiments section, the matching appears to be image-to-image. A score for each gallery image is computed by warping the query by the correct homography and taking squared pixel error. The error is made robust to occlusion by capping the error any one pixel can contribute. Additionally, before comparison the images are normalized to minimize effects of lighting.

They test with the XM2VTS dataset as well as a dataset of their own devising, showing decent results even with occlusion and viewpoint variation.

They should have given results for manual registration and their image distance, to prove that for perfect registration, their distance is better than PCA. Then they could have made two contributions.

Contains a good review section.

"A method for estimating and accurately extracting the eyebrow in human face image" by Chen and Cham: They extract eyebrow contours using a k-means technique, which appears to be clustering of image patches. They use the initial estimate of the eyebrow vs non-eyebrow pixels to initialze a snake (contour fitting) which just uses image gradients.

Once they've segmented the eyebrows, they characterize the curves of the top halves of the eyebrows and compare the characterizations. On the Olivetti face dataset, they achieve 87% accuraccy.

"Active shape models-their training and application" by Cootes et al.: The desire is to capture the shapes of objects while allowing some variability in shape. Unlike active contour models, active shape models parameterize the ways the shape can change, so that only class-specific deformations are allowed.

"Feature detection and tracking with constrained local models" by Cristinacce: Like the active appearance model, uses shape and texture to to parameterize a novel image. However, this work cites active appearance models and claims to have improved localization accuracy.

"An accurate algorithm for head detection based on XYZ and HSV hair and skin color models" by Gunes and Piccardi: The goal is head segmentation from background, including even back-of-head segmentation.

They learn a Gaussian mixture model for hair color, representing color in both XYZ and HSV space. Colors with sufficient density under the GMM are considered to be hair colors. Interestingly, they seem to make an error by adding probabilities in equation (3) instead of multiplying them.

They learn a similar model for skin color, but before feeding skin pixels to their model, they first filter them with the technique of Hsu et al. A pixel is estimated as a head pixel if it is a hair or skin pixel. So no naked people. Given the detected head pixels in an image, they perform morphological closure to fill in gaps and fit an ellipse to the resulting shape.

Unfortunately, they do not talk about recognition; this is detection paper only.

"Eyes and eyebrows parametric models for automatic segmentation" by Hammal and Caplier: They work with video data, and assume the face in the first frame of each video has been detected. They then track faces using block matching.

They detect irises, eyes, and eyebrows, but I focused on eyebrows. Assuming they a rectangle in which the eyebrow lies, they detect the x locations of the eyebrow endpoints by looking at the zero crossings of the first derivative of the vertical projection of the rectangle. They detect the shared y location by looking at the maximum of the horizontal projection of the rectangle. They fit a Bezier curve through the two points as well as the point in between them. They refine the curve by adjusting it locally to maximize the flow of luminance gradient through the curve.

New directions

I think I've mined everything useful in recognition from hair, hair matting, matting generally, and the cogsci angle on why external features are potentially useful. I hope to look more at ears, jawline, head shape, eyebrows, and automatic shaving. The last is because hair (head, eyebrows, face) as a biometric can be spoofed, and being able to identify and eliminate hair could make recognitions systems robust to that spoofing.

Monday, September 5, 2011

Literature review (part 8)

"The role of eyebrows in face recognition" by Sadr et al.: They perform a lesion study, removing either eyes or eyebrows from images of celebrity faces where the task is to identify the celebrity. They find recognition performance for humans is worse with no eyebrows than it is with no eyes.

They speculate as to why eyebrows may be important to recognition: 1) Since eyebrows are used to convey emotion, humans may naturally attend to them more, giving them higher weight in recognition, 2) Eyebrows are stable across lighting and image degradations, as they are large and high-contrast. Thus they are relatively stable and so good low-noise information sources.

The paper contains some interesting references:

"Given that the brow ridge may have been an important, sexually distinctive characteristic of our early ancestors' faces, it is not surprising that recent studies have found an important role for eyebrow thickness in discriminating between male and female faces (Bruce et al 1993)" pg 286
This suggests eyebrows as useful for gender discrimination with computer vision. If we consider eyebrows external features, determining gender is half the battle.

"In fact, research in the latter field has shown the eyebrows to be, in a kinematic sense, the most expressive part of the face (Linstrom et al 2000) and, we would suggest, a facial feature whose gestures would be easily recognized at a distance." pg 292
Just reinforcing the idea that eyebrows are good for low-visibility recognition.

"New appearance models for natural image matting" by Singaraju et al.: The task is image alpha matting using sparse trimaps. They build on Levin et al.'s work, but address a failure case of that work, where the model could overfit when the foreground and background layers are locally linear. Like Levin's, their solution is closed-form. They compare to Levin, and claim to get better matting results.

"Estimation of Alpha Mattes For Multiple Image Layers" by Singaraju and Vidal: Alpha matting typically assumes two layers, a foreground and a background. This work addresses (sparsely initialized) alpha matting for 2 or more layers. They use as an example two toy trolls standing next to each other, with overlapping hair.

The technique they propose is closed-form, but it does not necessarily produce alpha values which fall in the unit interval, breaking the traditional probabilistic interpretation of matting. When they add the constraint that the alpha values fall in the unit interval, they lose the closed form solution. However, the optimum value can still be obtained by solving a quadratic program.

Sunday, September 4, 2011

Literature review (part 7)

"Perceptual Expertise Bridging Brain and Behavior Oxford Series in Visual Cognition" by Gauthier et al.: Gary suggested I check this book out, and I found some references which might be useful when talking about the cogsci aspect of external features. The following are quotations from the book:

"Adults and children alike find it easier to recognize an unfamiliar face based only on its external features than only on its internal features (Want, Pascalis, Coleman, & Blades, 2003)" pg 78

I already had a reference for this tidbit, but now I have another. Also, they might define "external feature", which would be useful.

"Thus, adults are more sensitive to featural changes than they are to spacing changes that cover most of the natural variability among faces in the real world but stay within normal limits (Farkas, 1981), a result leading to the conclusion that adults are adept at using featural differences in recognizing facial identity." pg 81

A featural change is something like swapping out the eyes on one face for the eyes on another. The result is adults are more sensitive to a "unit" of change to the features than to a unit of change to the configuration of the features. This is the same sort of thing to which a local descriptor like SIFT would be sensitive.

"Nonetheless, under some conditions features may not provide a reliable cue: when the face is seen from a new point of view, when the person poses a new facial expression, under poor lighting conditions, and after many years of aging. Under these conditions the appearance of individual features changes, and adults may need to rely on the spacing among facial features that comes from the bone structure of the face. It is not surprising then that adults are exquisitely sensitive to the spacing of facial features (Freire et al., 2000; Mondloch et al., 2002) and that limits in this sensitivity correspond to limits in their visual acuity (Ge, Luo, Nishiura, & Lee, 2003; Haig, 1984)." pg 81

The last bit here is potentially useful, as it suggests it is natural to use configuration in the case of low quality images. If this is true, it's not such a stretch to think it's useful to use larger, external features in the same case.

"Transferable Belief Model for hair mask segmentation" by Rousset et al.: Similar to previous work, in which local texture and color masks are used to determine seed pixels for an alpha matting algorithm. The difference here is the use of a transferable belief model, which is a theory of probability in which the user can explicitly encode lack of knowledge, leading to more conservative posterior beliefs. They update their texture and color detectors so they can output "uncertain" as well as "hair" or "not hair". Instead of doing an AND to get the hair seed, as they did in their previous paper, they use Dempster's rule of combination, an artifact of transferable belief models.

They run some experiments with frontal face shots of women (apparently the same dataset as before), comparing their old method with the current one, and the new one does better. This isn't surprising, as the old one had to make hard (binary) decisions for every pixel. Even a small number of mislabeled pixels could ruin the matte, so it's better to err conservative. The transferable belief model makes it easy to be conservative.

Saturday, September 3, 2011

Literature review (part 6)

We have a grab-bag today. A paper on co-segmentation, which is the task of finding the same thing in two images, a quite complicated hair segmentation paper from industry, and a new graphics paper on beard transfer.

"An efficient algorithm for co-segmentation" by Hochbaum and Singh: In co-segmentation, two images are presented, where an object co-occurs in the two images. The task is to segment the co-occurring object in each of the images, and as a byproduct determine which pixels belong to the co-occurring object in each image.

This paper draws inspiration from a previous approach to the same problem [Rother 2006], in which the error function combines MRFs for each of the two images (the segmentation part of the error) along with a histogram agreement term. The histogram agreement term considers the color histograms in the foregrounds of each of the two images, and encourages them to be similar (if the foregrounds between the two images are exactly the same, the color histograms will be identical). Apparently for useful histogram similarity functions, the entire optimization problem becomes intractable, reducing the practitioner to approximate methods. In this paper, a similar objective function is presented, but which can be reduced to a min cut problem and thus solved in polynomial time.

Interestingly, the technique is robust to small changes in the color histograms of the foreground object, as well as being completely invariant to how pixels are arranged in the foreground object. This enables the authors to match pairs of real Flickr photos which have a recurring object, such as photographs of similar lawn gnomes.

They also have a medical application of co-segmentation. Image slices of two different brains are affine-aligned and co-segmented. The parts that don't match (the background) are likely to be lesions in one of the brains.

"Automatic Hair Detection in the Wild" by Julian et al.: The task is hair segmentation. The dataset is user uploaded photos from a virtual fitting room for glasses (a private dataset). Thus the photos are probably frontal and fairly high quality.

They detect the face and eyes using cascade classifiers. Then they fit a constrained local model [Cristinacce and Cootes 2006] to the detected face and eyes to refine the estimate of the locations of the eyes and to get an initial estimate of the temple location. With the temple location estimates, they initialize a hybrid active shape / active contour model [Cootes 1995, Leventon 2000] to find the hair. They call this model the Upper Hair Shape Model (UHSM) because it only tries to find the hair on the top of the head and a little on the sides (crew cut). The model is agnostic to the color of the hair, only seeking color uniformity. They use the fitted UHSM and the eye locations to find background, hair, and face regions. They use these regions to initialize yet another model, an image adapted appearance model which is presumably not agnostic to color. The details of this further model are given in a previous paper, which I cannot find online: "P. Julian, V. Charvillat, C. Dehais, and F. Lauze. On the interest of texture for face segmentation. In Orasis, 2009."

They present no empirical measurements (probably trade secrets). Based on the photos (qualitative performance), I'd guess all that complexity didn't buy them much.

"Toward image-based facial hair modeling" by Herrera et al.: This is a graphics paper, where the task is to transfer facial hair from one registered image to another. What's interesting is they don't assume they know where is the hair in the source image, so they must detect hair pixels. Being a graphics paper, I hoped it would offer some fresh ideas on the subject.

They extract 4 features from each pixel: 1) The response of the orientation filter along the dominant angle (hair tends to all grow in the same direction at a point) 2) The absolute value of the local image gradient (hair tends to have high image gradients, smooth skin low image gradients), 3) The sum of the color channels R + G + B, 4) R - G and R - B (two values) because red dominates skin color and looking at differences reduces the effect of specularity.

The texture information they extract (1 and 2) can, morally speaking, be calculated from a SIFT descriptor. However, a SIFT descriptor has 128 dimensions, many of which may be useless, and calculating (1) from a SIFT descriptor would require a nonlinear function (max). It would be interesting to see how well SIFT would perform as a drop-in replacement. The color features are simply linear functions of the raw colors, so would be useless to extract if the next step were some sort of linear classifier or Mahalanobis metric learning.

But they don't do anything like that in the next step. Instead for classification, they bin each of their features and use naive Bayes to get probability given hair versus probability given skin.

The rest of the paper is graphics.

Friday, September 2, 2011

Literature review (part 5)

Six new summaries for you today. We've got three hair papers, two supervised multi-class segmentation papers, and a paper which converts images to sketches and vice-versa.

"FREQUENTIAL AND COLOR ANALYSIS FOR HAIR MASK SEGMENTATION" by Rousset and Coulon: The task is hair segmentation (determine which pixels are hair). Images are frontal facing, and face detection is used to find the bounding box. Images are restricted to women so they don't have to deal with facial hair. The idea is to find some individual pixels which are definitely hair, and use alpha matting to get the rest of the hair pixels.

To find the definitely-hair pixels (the seeds), they use a combination of "frequential" (texture) and color information. They model the color of hair pixels with a single Gaussian, and threshold at a particular density to determine the pixels which have the right color (color mask). It's not clear to me exactly how they use the texture information, but they come up with a texture mask. The AND of the color and texture masks gives the pixels which are definitely hair. Because the matting technique they're using (Levin's, described in the last post), requires seed pixels for the background, they select those pixels which are (NOT TEXTURE) OR (NOT HEAD), where HEAD is true for the pixels in the automatically detected bounding box around the face.

They get what looks like reasonable precision and recall.

"Detection and Analysis of Hair" by Yacoob and Davis: This is probably the first computer vision paper to talk about leveraging hair recognition as part of a larger recognition system. Their idea is to enumerate a bunch of hair attributes and feed the values of those attributes for a particular image to a classifier, along with the internal face information. The attributes they measure are: 1) hair color, characterizing individual strands as cylindrical Lambertian surfaces and inferring the color-related physical parameters of the hair 2) hair parts (as in "you part your hair with a comb"), whether they occur, and if so, where 3) hair width (how much it sticks up off the head) 4) hair length, 5) they go on: surface area covered by hair, hair symmetry, inner and outer hairlines, hair texture (using Gabor wavelets). They find that these attributes provide useful additional information not exploited by the face recognition system they use, and that by combining internal face information and hair information, they can boost performance.

"A novel two-tier Bayesian based method for hair segmentation" by Wang et al.: A hair segmentation paper. The idea is that hair color can be pretty consistent within an image, but vary greatly between images, due to eg lighting. When a test image comes in, all pixels are scored for probability of being hair using location and color (color density is modeled with a Gaussian mixture model (GMM)). The image is then oversegmented, and if a fragment has a high total hair weight, everything inside it is labeled as hair. All the pixels from all these hair fragments are used to train a new GMM which captures hair color _in this image_. The new color GMM and the original location prior is used to classify the rest of the fragments, resulting in the final segmentation.

The main idea here is to adapt the classifier at test time based on information found in the query.

"Inducing semantic segmentation from an example" by Schnitman et al.: Task is supervised multi-label segmentation. The authors assume training and test images come from the same domain. They mention an application: photoshoot of a model, where the director wants to change the color of garment worn across all photos post-hoc.

They oversegment their test image, and from each fragment in the test image they extract several local patches. They compute the distance of the fragment to each of the labels as the median of the Euclidean distance of each local patches to the nearest patch in the label data; taking the median gives them some robustness to noise. The per-label costs for each fragment, along with a cost for neighboring fragments having different labels, is combined into an optimization problem which is solved by graph cuts using alpha-expansion, which allows graph-cuts to deal with more than two labels.

"TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context" by Shotton et al.: The task is supervised segmentation. In the author's own words, there are three contributions: 1) a novel type of feature, the texture-layout filter, 2) a new discriminative model that combines texture-layout filters with lower-level features, 3) techniques for making the training scalable to larger datasets.

A texture-layout filter is built with textons. Each pixel in each image is assigned a texton. A texture-layout filter is a pair (r, t) for a region r and a texton t. Its value is proportional to the number of pixels that are assigned texton t in the region.

The model they use is a conditional random field that incorporates: 1) texture-layout features, 2) colors (modeling color probability using a Gaussian mixture model, 3) location information (sky tends to be at the top of the image if anywhere), 4) edge potentials which penalize changing from one class to another, except when there is an image gradient.

"Face photo-sketch synthesis and recognition" by Wang and Tang: The idea in this paper is to find a mapping between face photos and sketches. A potential application is law enforcement, as a sketch could be used to search a database of photos.

The mapping is similar to Freeman's SR work, where low res patches are identified with high res patches. Here, instead, sketch patches are matched with image patches. But there are some differences: 1) unlike Freeman, which is a single-scale MRF, the MRF used here is multi-scale, with the prior for a region coming not just from nearby small regions but also from far-away, larger regions. 2) they assume the faces are aligned, so when matching patches, they can spatially constrain where the patches they match to come from. For example, if I know I'm over the left ear, I will only retrieve database patches that were also extracted from the left ear. 3) there is not one patch per pixel, instead the patches form a grid, with small overlaps between the cells. The MRF error comes from the error between the overlapping pixels, and the optimized variable is which of K patch candidates to use for a particular patch location (the K candidates come from the K nearest neighbors of the original image patch). Even after the MRF optimization, there may be disagreements between the pixels, so the boundaries between patches are computed using "dynamic programming" (sounds like min cut).

One experiment they ran matched sketches against face databases. They got better performance synthesizing face and matching to faces directly than converting all database to sketches and matching in sketch space. The former approach lets them use existing face recognition techniques.

Somehow their system inferred skin tone based on the sketch. In other words, they get a grayscale sketch, and their system somehow figures out the skin tone should be Asian / Caucasian / etc. The system must be relying on facial shape and features to infer skin tone, which is amazing.