summer 2011: head segmentation

Wednesday, September 14, 2011

Literature review (part 13)

Today we come to the end of the third phase of my literature review (you didn't know there were phases, did you?). Hopefully we're quite near the end.

Today we have a quick citation and two works using SIFT for ear recognition.

"Statistical shape influence in geodesic active contours" by Leventon et al.: Presents a method of incorporating shape into the image segmentation process. Not sure why this paper was on my reading list.

"SIFT-based ear recognition by fusion of detected keypoints from color similarity slice regions" by Kisku et al.: The goal is to provide ear recognition despite pose variation and occlusion. They do this by detecting color "level-sets" of ear images and applying SIFT to those level sets. The level sets are in fact ranges of colors that they cluster together using a GMM.

In more detail: they fit GMMs to pixel colors of ear images in order to find bands of ear colors. For each band, they segment a new image of an ear into pixels which fall into the band (black) and everything else (white). Segmentations produced in this manner have characteristic ear-like shapes. SIFT features are extracted from these segmentations and all concatenated together to form one ear descriptor. Then they match descriptors.

"Ear identification by fusion of segmented slice regions using invariant features: an experimental manifold with dual-fusion approach" by Kisku et al.: This appears to be similar to their previous paper, though they introduce another feature inspired by Dempster-Shafer theory.

Tuesday, September 13, 2011

Literature review (part 12)

Today we have an ear detection paper and a paper which does eyebrow segmentation.

"Fast and Fully Automatic Ear Detection Using Cascaded AdaBoost" by Islam et al.: Used AdaBoost with custom weak learners to detect ears. The weak learners use custom features, including a center-surround feature used to detect ear pits. To improve efficiency, they actually use a cascade of AdaBoost classifiers, where the early classifiers should be quick to evaluate.

They get essentially perfect detection performance.

"Facial features segmentation by model-based snakes" by Radeva: Goal is segmentation of facial features, including eyes, eyebrows, and mouth using aligned images. They find the eyes using template matching. They then project the image along the x-axis, and look for a valley in the grayscale values indicating the eyebrow location (eyebrows are dark); this determines the eyebrow y location. To determine the x location, they project along the y-axis and use the fact that eyebrows are above eyes and both are dark, again looking for a valley in the grayscale values.

They use the approximate eyebrow locations to initialize an active-contour-like technique called a "rubber snake". The final position of the rubber snake provides the segmentation.

Monday, September 12, 2011

Literature review (part 11)

These are some "quick citations"; I skimmed these papers to get a quick gist of what they do.

"Estimation of the chin and cheek contours for precise face model adaptation" by Kampmann: The goal is to determine the chin and cheek contours.

They assume they know where the eyes and mouth are, and use this to find the probable location of the chin. They then use local gradient information to find three points on the chin (high gradient means likely chin boundary). They then fit a curve to these three points and call it the chin boundary.

To determine cheek boundaries (the boundary above and to either side of the chin), they use a similar technique.

They appear to use only frontal facing images.

"Ear Biometrics in Computer Vision" by Burge and Burger: Model ears as a graph built from a Voronoi diagram of ear segments. Use a novel graph distance function.

"Force Field Feature Extraction for Ear Biometrics" by Hurley: This is a thesis. Pixels in the the image of an ear are treated like point masses, and the gravitational field is somehow used to construct a descriptor.

Literature review (part 10)

Three papers today: head segmentation, review paper on ear biometrics, and a cute one on image-based shaving.

"Segmentation of a head into face, ears, neck and hair for knowledge-based analysis-synthesis coding of videophone sequences" by Kampmann: Task is to segment a video stream into face ears, neck, and hair regions. A motivating idea is smart videoconferencing compression, as regions like the face should be transferred at a higher fidelity than other regions.

They assume they start with eyes and mouth center positions as well as chin and cheek contours. In a new image, they find areas with likely skin tone using the eyes and mouth locations, and they use this tone to segment the image between skin and background. Using the chin contour, they break the skin pixels into face, neck, and ears regions.

To find hair, they assume they have a segmentation of the head from the background and they subtract the skin pixels.

"The ear as a biometric" by Hurley et al.: This is an ear biometrics review paper.

The paper makes the point that there is significant variability in ear shape between people, and that the ear: 1) changes little with age, 2) doesn't change with facial expression, 3) vs fingerprints there is not hygenics issue, 4) vs iris or retina scanning, there is no user fear of harm, 5) not affected by makeup or obscured by facial hair, though can be changed with jewelry or obscured with head hair.

Burge and Burger demonstrated the potential for the ear as a biometric and also used computer vision to recognize ears. They segmented the ear using Canny edges and compared two such segmentations by comparing the Voronoi diagrams of the segmentations using a novel distance measure. They identify occlusion by hair as a major obstacle.

A number of authors have used principal components analysis on the raw pixels of registered images as a preprocessing step when recognizing ears.

Appear only to talk about approaches for registered ears.

"Image-based Shaving" by Nguyen et al.: This is a graphics paper discussing automatic removal of beards in images. They express a given image in terms of beard and non-beard PCA components.

Given a set of non-beard images, a naive approach is to find the PCA components accounting for most of the energy, and when a bearded image comes in, express the bearded image in terms of the non-beard principal components. However, as beards tend to be spatially large and significantly different than skin pixels, the reconstruction process tries to reconstruct the beard and produces poor results. As a first pass fix for this problem, they use a robust estimator, where pixel mismatch imposes an asymptotically L1 error, instead of the L2 error of regular PCA. This improves the results, but can also result in overly smooth reconstructions.

As a further refinement, they estimate the beard space directly. To do this, they take a dataset of bearded faces, and for each bearded face, they compute the difference between the bearded face and the automatically shaved faces using the previous method. This set of difference vectors spans the beard space. They can then express a new bearded image with the standard PCA error (L2), using beard and non-beard principal components. They can then reweight the beard coefficients to delete the beard or even make it more pronounced.

As yet a further refinement, they exploit the spatial locality of beards. When determining the beard subspace by considering the difference images, they use an MRF to segment beard pixels (generally pixels with a large difference) from non-beard pixels. They then zero out the entries in the difference vectors that are not beard pixels.

With their technique, they can also perform beard addition, though they must blend the beard and face layers as a postprocessing step to assure an even transition from beard to face.

They use hand-registered data.

They have preliminary results for automatic glasses removal, but they don't discuss them.

Wednesday, September 7, 2011

Literature review (part 9)

Took the wrong train today, so had extra time to read. Today we have ears, eyebrows, and head detection. I think the girl sitting next to me on the train was wondering why I was looking at arrays of photos of ears.

"On Model-Based Analysis of Ear Biometrics" by Arbab-Zavar et al.: The task is biometrics using the ear, claiming that ears are distinctive. Their dataset is a subset of the XM2VTS dataset, sides of heads with the ears clearly visible, but not aligned.

They describe their model as a constellation model. They train it by taking one image from each of the 63 subjects and manually cropping to the ear. They run SIFT on all these images and cluster the resulting descriptors, using information from the descriptor space as well as the location space; they can use location space information because the ears are roughly registered.

Given a test image, they find the elliptical shape of the ear in the test image and crop to the ear. Then they extract SIFT features. The final match score is the cost of matching all of the cluster centers to the SIFT descriptors in the test image.

They do some occlusion tests by synthetically occluding images of ears (though the enrollment ears remain fully visible). Their performance drops with occlusion, but they find they do better than a competing method based on PCA.

"Robust 2D ear registration and recognition based on sift point matching" by Bustard and Nixon: Registers ears using SIFT (assuming the ear is planar). With SIFT for registration and a simple pixel-to-pixel distance, they get results equal to manual registration with PCA.

In their method, ears in the gallery are masked, finding ear pixels and non-ear pixels. Then SIFT features are extracted from the ear regions in the gallery. For a query image, SIFT features are extracted and matched across the gallery via RANSAC. Based on language in the experiments section, the matching appears to be image-to-image. A score for each gallery image is computed by warping the query by the correct homography and taking squared pixel error. The error is made robust to occlusion by capping the error any one pixel can contribute. Additionally, before comparison the images are normalized to minimize effects of lighting.

They test with the XM2VTS dataset as well as a dataset of their own devising, showing decent results even with occlusion and viewpoint variation.

They should have given results for manual registration and their image distance, to prove that for perfect registration, their distance is better than PCA. Then they could have made two contributions.

Contains a good review section.

"A method for estimating and accurately extracting the eyebrow in human face image" by Chen and Cham: They extract eyebrow contours using a k-means technique, which appears to be clustering of image patches. They use the initial estimate of the eyebrow vs non-eyebrow pixels to initialze a snake (contour fitting) which just uses image gradients.

Once they've segmented the eyebrows, they characterize the curves of the top halves of the eyebrows and compare the characterizations. On the Olivetti face dataset, they achieve 87% accuraccy.

"Active shape models-their training and application" by Cootes et al.: The desire is to capture the shapes of objects while allowing some variability in shape. Unlike active contour models, active shape models parameterize the ways the shape can change, so that only class-specific deformations are allowed.

"Feature detection and tracking with constrained local models" by Cristinacce: Like the active appearance model, uses shape and texture to to parameterize a novel image. However, this work cites active appearance models and claims to have improved localization accuracy.

"An accurate algorithm for head detection based on XYZ and HSV hair and skin color models" by Gunes and Piccardi: The goal is head segmentation from background, including even back-of-head segmentation.

They learn a Gaussian mixture model for hair color, representing color in both XYZ and HSV space. Colors with sufficient density under the GMM are considered to be hair colors. Interestingly, they seem to make an error by adding probabilities in equation (3) instead of multiplying them.

They learn a similar model for skin color, but before feeding skin pixels to their model, they first filter them with the technique of Hsu et al. A pixel is estimated as a head pixel if it is a hair or skin pixel. So no naked people. Given the detected head pixels in an image, they perform morphological closure to fill in gaps and fit an ellipse to the resulting shape.

Unfortunately, they do not talk about recognition; this is detection paper only.

"Eyes and eyebrows parametric models for automatic segmentation" by Hammal and Caplier: They work with video data, and assume the face in the first frame of each video has been detected. They then track faces using block matching.

They detect irises, eyes, and eyebrows, but I focused on eyebrows. Assuming they a rectangle in which the eyebrow lies, they detect the x locations of the eyebrow endpoints by looking at the zero crossings of the first derivative of the vertical projection of the rectangle. They detect the shared y location by looking at the maximum of the horizontal projection of the rectangle. They fit a Bezier curve through the two points as well as the point in between them. They refine the curve by adjusting it locally to maximize the flow of luminance gradient through the curve.

New directions

I think I've mined everything useful in recognition from hair, hair matting, matting generally, and the cogsci angle on why external features are potentially useful. I hope to look more at ears, jawline, head shape, eyebrows, and automatic shaving. The last is because hair (head, eyebrows, face) as a biometric can be spoofed, and being able to identify and eliminate hair could make recognitions systems robust to that spoofing.

Monday, September 5, 2011

Literature review (part 8)

"The role of eyebrows in face recognition" by Sadr et al.: They perform a lesion study, removing either eyes or eyebrows from images of celebrity faces where the task is to identify the celebrity. They find recognition performance for humans is worse with no eyebrows than it is with no eyes.

They speculate as to why eyebrows may be important to recognition: 1) Since eyebrows are used to convey emotion, humans may naturally attend to them more, giving them higher weight in recognition, 2) Eyebrows are stable across lighting and image degradations, as they are large and high-contrast. Thus they are relatively stable and so good low-noise information sources.

The paper contains some interesting references:

"Given that the brow ridge may have been an important, sexually distinctive characteristic of our early ancestors' faces, it is not surprising that recent studies have found an important role for eyebrow thickness in discriminating between male and female faces (Bruce et al 1993)" pg 286
This suggests eyebrows as useful for gender discrimination with computer vision. If we consider eyebrows external features, determining gender is half the battle.

"In fact, research in the latter field has shown the eyebrows to be, in a kinematic sense, the most expressive part of the face (Linstrom et al 2000) and, we would suggest, a facial feature whose gestures would be easily recognized at a distance." pg 292
Just reinforcing the idea that eyebrows are good for low-visibility recognition.

"New appearance models for natural image matting" by Singaraju et al.: The task is image alpha matting using sparse trimaps. They build on Levin et al.'s work, but address a failure case of that work, where the model could overfit when the foreground and background layers are locally linear. Like Levin's, their solution is closed-form. They compare to Levin, and claim to get better matting results.

"Estimation of Alpha Mattes For Multiple Image Layers" by Singaraju and Vidal: Alpha matting typically assumes two layers, a foreground and a background. This work addresses (sparsely initialized) alpha matting for 2 or more layers. They use as an example two toy trolls standing next to each other, with overlapping hair.

The technique they propose is closed-form, but it does not necessarily produce alpha values which fall in the unit interval, breaking the traditional probabilistic interpretation of matting. When they add the constraint that the alpha values fall in the unit interval, they lose the closed form solution. However, the optimum value can still be obtained by solving a quadratic program.