 full-text |
 print |
 pdf |
 permalink |
Inventors
Schneiderman, Henry
Kanade, Takeo
Application #
795208
Filed
Feb-28-2001
Published
Dec-7-2004
Current US Class
375/240.19 382/154 382/240 382/285
International Classes
G06K 009/00
Field of Search
382/115 382/118 382/159 382/180 382/209 382/154 382/162 382/165 382/285 382/240 706/21 706/52 700/83 375/240.19 345/810 345/840 348/42 348/51
Assignee
Carnegie Mellon University (Pittsburgh, PA)
Examiners
Mehta; Bhavesh M.
Attorney, Agent or Firm
Kirkpatrick & Lockhart LLP
US Patent References
| 5642431 |
|
Network-based syst... |
|
| 5710833 |
|
Detection, recogniti... |
|
| 6072893 |
|
Method and system... |
|
| 6128397 |
|
Method for finding... |
|
| 6134339 |
|
Method and appar... |
|
| 6192145 |
|
Method and appar... |
|
| 6211515 |
|
Adaptive non-unifo... |
|
| 6272231 |
|
Wavelet-based faci... |
|
| 6381280 |
|
Single chip motion... |
|
| 6567081 |
|
Methods and arran... |
|
| 6597739 |
|
Three-dimensional... |
|
| 6671391 |
|
Pose-adaptive face... |
|
Referenced by:
View Backward References
Other References
Amit et al., Discussion of the Paper "Arcing Classifiers" by Leo Breiman, The Annals of Statistics, vol. 26, No. 3, 1998, pp. 833-837. Breiman, L., Arcing Classifiers, The Annals of Statistics, vol. 26, No. 3, 1998, pp. 801-823. Burel et al., Detection and Localization of Faces on Digital Images, Pattern Recognition Letters 15, 1994, pp. 963-967. Burl et al., Recognition of Planar Object Classes, CVPR 1996, pp. 223-230. Colmenarez et al., Face Detection with Information-Based Maximum Discrimination, CVPR 1997, pp. 782-787. Cosman et al., Vector Quantization of Image Subbands: A Survey, IEEE Trans. On Image Processing, vol. 5, No. 2, Feb. 1996, pp. 202-225. Dietterich, T.G., Discussion of the Paper "Arcing Classifiers" by Leo Breiman, The Annals of Statistics, vol. 26, No. 3, 1998, pp. 838-841. Domingos et al., On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning, 29, 1997, pp. 103-130. Freund et al., A Decision-theoretic Generalization of On-line Learning and an Application to Boosting, Journal of Computer and System Sciences, vol. 55, No. 1, 1997, pp. 119-139. Freund et al., Discussion of the Paper "Arcing Classifiers" by Leo Breiman, The Annals of Statistics, vol. 26, No. 3, 1998, pp. 824-832. Moghaddam et al., Probabilistic Visual Learning for Object Representation, IEEE Trans. on Pattern Analysis and Machine Intelligence vol. 19, No. 7, Jul. 1997, pp. 696-710. Osuna et al., Training Support Vector Machines: An Application to Face Detection, CVPR 1997, pp. 130-136. Roth et al., A SNoW-Based Face Detector, Neural Information Processing Systems, 1999, pp. 862-868. Rowley et al., Neural Network-Based Face Detection, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, No. 1, Jan. 1998, pp. 23-38. Schapire et al., Improved Boosting Algorithms Using Confidence-rated Predictions, Machine Learning, 37(3), 1999, pp. 297-336. Schneiderman et al., Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition, CVPR 1998, pp. 45-51. Strang et al., Wavelets and Filter Banks, Wellesley-Cambridge Press, 1997, pp. 1-35, 103-142, 216-218. Sung et al., Example-based Learning for View-based Human Face Detection, M.I.T. AI Memo No. 1521, 1994, pp. 1-20. Sung, K., Learning and Example Selection for Object and Pattern Detection, M.I.T. AI Lab. Tech Report No. 1572, 1996, pp. 1-195. Vapnik, V.N., The Nature of Statistical Learning Theory, Springer, 1995, pp. 127-137. Yang et al., Human Face Detection in a Complex Background, Pattern Recognition, 27(1), 1994, pp. 53-63.
Citation
Cite This Patent
More From Subclass 154
More From Class 382
|
Abstract
An object finder program for detecting presence of a 3D object in a 2D image containing a 2D representation of the 3D object. The object finder uses the wavelet transform of the input 2D image for object detection. A pre-selected number of view-based detectors are trained on sample images prior to performing the detection on an unknown image. These detectors then operate on the given input image and compute a quantized wavelet transform for the entire input image. The object detection then proceeds with sampling of the quantized wavelet coefficients at different image window locations on the input image and efficient look-up of pre-computed log-likelihood tables to determine object presence. The object finder's coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme. The object finder detects a 3D object over a wide range in angular variation (e.g., 180 degrees) through the combination of a small number of detectors each specialized to a small range within this range of angular variation.
Claims
What is claimed is:
1. A method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of said 3D object, said method comprising:
receiving a digitized version of said 2D image;
selecting one or more view-based detectors;
for each view-based detector, computing a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from said 2D image that is localized in space, frequency, and orientation;
applying said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to:
compute a likelihood ratio based on visual information received from corresponding waveform coefficients;
compare said likelihood ratio to a predetermined threshold value; and
detect a specific orientation of said 3D object in said 2D image based on said comparison of said likelihood ratio to said predetermined threshold value;
combining results of application of said one or more view-based detectors; and
determining orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
2. The method of claim 1, further comprising developing said one or more view-based detectors from a pre-selected set of training images.
3. The method of claim 2, wherein developing said one or more view-based detectors includes the following for at least one of said one or more view-based detectors:
selecting at least one correction factor, wherein said at least one correction factor is configured to correct the light intensity level of at least one of said training images;
selecting a first value for said at least one correction factor;
applying said first value for said at least one correction factor to said at least one of said training images;
examining an effect on appearance of said at least one of said training images after application of said first value of said at least one correction factor thereto;
selecting a second value of said at least one correction factor based on said effect on the appearance of said at least one of said training images; and
continuing selection, application, and examination until a desired effect on the appearance of said at least one of said training images is obtained.
4. The method of claim 3, wherein selecting at least one correction factor includes selecting two correction factors, and wherein each of said two correction factors is applied to a different half of said at least one of said training images.
5. The method of claim 2, wherein developing said one or more view-based detectors includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes;
selecting an image window, wherein said image window is configured to be placed at a plurality of locations on each of said training images;
for each attribute, determining corresponding attribute values at a plurality of coordinates at one of said plurality of locations of said image window by training a plurality of instances of said at least one view-based detector on said pre-selected set of training images;
for each of said plurality of instances, computing a respective weight to be applied to corresponding attribute values for said each attribute at said one of said plurality of locations of said image window;
for each of said plurality of instances, applying said respective weight to corresponding attribute values for each attribute at said plurality of coordinates at said one of said plurality of locations of said image window, thereby generating a set of weighted attribute values for each attribute for each of said plurality of coordinates at said one of said plurality of locations of said image window; and
for each of said plurality of coordinates and for each attribute, combining corresponding weighted attribute values in said set of weighted attribute values, thereby generating a single attribute value for each attribute at said each of said plurality of coordinates at said one of said plurality of locations of said image window.
6. The method of claim 1, wherein computing said wavelet transform includes:
computing said wavelet transform for a first scale of said 2D image, thereby generating a plurality of wavelet transform levels at said first scale; and
reusing at least one of said plurality of wavelet transform levels as part of said wavelet transform for a second scale of said 2D image when computing said wavelet transform for said second scale.
7. The method of claim 6, wherein said first and said second scales differ from one another by one octave.
8. The method of claim 6, wherein said plurality of wavelet transform levels includes three levels, and wherein said wavelet transform for said second scale reuses two lower resolution levels from said three wavelet transform levels for said first scale.
9. The method of claim 1, wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at one of a plurality of locations within said 2D image;
selecting two correction factors, wherein each of said two correction factors is configured to correct the light intensity level for a corresponding half of said image window at said one of said plurality of locations;
selecting a predetermined number of correction values for each of said two correction factors;
for each of said two correction factors and for each of said predetermined number of correction values therefor, evaluating the total log-likelihood ratio value for said plurality of attributes for the corresponding half of said image window at said one of said plurality of locations;
for each half of said image window at said one of said plurality of locations, selecting the largest total log-likelihood ratio value for said plurality of attributes; and
adding corresponding largest total log-likelihood ratio value for said each half of said image window to determine an overall log-likelihood ratio value, wherein said overall log-likelihood ratio value is used to estimate the presence of said 3D object, and wherein said overall log-likelihood value is said computed likelihood ratio of claim 1.
10. The method of claim 1, wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a first one of a first plurality of locations within said 2D image;
for each of said plurality of attributes, determining a corresponding attribute value at each of a first plurality of coordinates within said image window at said first location;
for each of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said first plurality of coordinates based on said corresponding attribute values determined at said first plurality of coordinates;
estimating presence of the 3D object in said image window at said first location based on said comparison of said likelihood ratio to said predetermined threshold value, wherein said likelihood ratio is defined by a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
moving said image window to a second one of said first plurality of locations within said 2D image; and
continuing determination of said corresponding attribute values and said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said first plurality of locations within said 2D image.
11. The method of claim 10, wherein said first and said second class-conditional probabilities are obtained by looking-up a pre-computed set of log-likelihood tables using corresponding attribute values.
12. The method of claim 10, wherein said image window is rectangular.
13. The method of claim 10, wherein said plurality of attributes includes seventeen attributes.
14. The method of claim 10, wherein said each attribute is configured to sample and quantize eight transform coefficients.
15. The method of claim 10, wherein said each attribute quantizes said each of said predetermined number of transform coefficients into three levels.
16. The method of claim 10, further comprising the following for at least one of said one or more view-based detectors:
scaling said 2D image to one of a predetermined number of scale levels, thereby generating a scaled image;
placing said image window at a third one of a second plurality of locations within said scaled image;
for each of said plurality of attributes, determining said corresponding attribute value at each of a second plurality of coordinates within said image window at said third location;
for each of said plurality of attributes, obtaining a third class-conditional probability for said object class and a fourth class-conditional probability for said non-object class at said each of said second plurality of coordinates based on said corresponding attribute values determined at said second plurality of coordinates;
estimating presence of the 3D object in said image window at said third location based on a ratio of a third product and a fourth product, wherein said third product includes a product of all of said third class-conditional probabilities and wherein said fourth product includes a product of all of said fourth class-conditional probabilities;
moving said image window to a fourth one of said second plurality of locations within said scaled image; and
continuing determination of said corresponding attribute values and said third and said fourth class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said fourth location and at each remaining location in said second plurality of locations within said scaled image.
17. The method of claim 16, wherein said predetermined number of scale levels is determined based on the size of the 2D image.
18. The method of claim 16, wherein scaling said 2D image is continued until the scaled version of said 2D image is smaller than the size of said image window.
19. The method of claim 1, wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a plurality of locations within said 2D image;
for each attribute in a subset of said plurality of attributes, determining a corresponding attribute value at each of a plurality of coordinates within said image window at each of said plurality of locations;
for each attribute in said subset of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said plurality of coordinates at said each of said plurality of locations based on said corresponding attribute values determined at said plurality of coordinates;
for each one of said plurality of locations of said image window, computing said likelihood ratio, wherein said likelihood ratio is a division of a first product and a second product, and wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities at corresponding one of said plurality of locations;
for each one of said plurality of locations of said image window, determining if said likelihood ratio at corresponding one of said plurality of locations is above said predetermined threshold value; and
for each one of said plurality of locations of said image window, estimating presence of said 3D object only if said likelihood ratio at corresponding one of said plurality of locations is above said predetermined threshold value.
20. The method of claim 1, wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
for each of said plurality of attributes, determining a corresponding attribute value at each of a plurality of coordinate locations within said 2D image;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a first one of a plurality of locations within said 2D image;
for each of said plurality of attributes, selecting those corresponding attribute values that fall within said first location of said image window;
for each of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class based on said selected attribute values that fall within said first location of said image window;
estimating presence of the 3D object in said image window at said first location based on said comparison of said likelihood ratio to said predetermined threshold, wherein said likelihood ratio is defined by a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
moving said image window to a second one of said plurality of locations within said 2D image; and
continuing selection of corresponding attribute values, determination of said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said plurality of locations within said 2D image.
21. The method of claim 1, wherein said 3D object is a human face.
22. The method of claim 1, wherein said 3D object is a car.
23. The method of claim 1, further comprising placing a marker at said location of said 3D object upon detecting said location in said 2D image.
24. A computer-readable storage medium having stored thereon instructions, which, when executed by a processor, cause the processor to perform the following:
digitize a 2D (two dimensional) image, wherein said 2D image contains a 2D representation of a 3D (three dimensional) object;
compute a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents corresponding visual information from said 2D image;
place an image window of fixed size at a first plurality of locations within said 2D image;
evaluate a plurality of visual attributes at each of said first plurality of locations of said image window using corresponding transform coefficients to determine a likelihood ratio corresponding to said each of said first plurality of locations; and
estimate the presence of said 3D object in said 2D image based on a comparison of said corresponding likelihood ratio to a predetermined threshold value at said each of said first plurality of locations.
25. The computer-readable storage medium of claim 24 having stored thereon instructions, which, when executed by the processor, cause the processor to further perform the following:
evaluate a subset of said plurality of visual attributes at said each of said first plurality of locations of said image window using corresponding transform coefficients to determine said likelihood ratio corresponding to said each of said first plurality of locations; and
estimate the presence of said 3D object only at those of said first plurality of locations of said image window where said corresponding likelihood ratio for said subset of said plurality of visual attributes is above said predetermined threshold value.
26. The computer-readable storage medium of claim 24 having stored thereon instructions, which, when executed by the processor, cause the processor to further perform the following:
generate a scaled version of said 2D image;
place said image window of fixed size at a second plurality of locations within said scaled version of said 2D image;
evaluate said plurality of visual attributes at each of said second plurality of locations of said image window using corresponding transform coefficients to determine said likelihood ratio corresponding to said each of said second plurality of locations; and
estimate the presence of said 3D object in said scaled version of said 2D image based on a comparison of said corresponding likelihood ratio to said predetermined threshold value at said each of said second plurality of locations.
27. The computer-readable storage medium of claim 24 having stored thereon instructions, which, when executed by the processor, cause the processor to display said 2D image with a visual marker placed where the presence of said 3D object is estimated.
28. A computer system, which, upon being programmed, is configured to perform the following:
receive a digitized version of a 2D (two dimensional) image, wherein said 2D image contains a 2D representation of a 3D (three dimensional) object;
select one or more view-based detectors;
for each view-based detector, compute a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents corresponding visual information from said 2D image;
apply said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to:
compute a likelihood ratio based on visual information received from corresponding waveform coefficients;
compare said likelihood ratio to a predetermined threshold value; and detect a specific orientation of said 3D object in said 2D image based on said comparison of said likelihood ratio to said predetermined threshold value;
combine results of application of said one or more view-based detectors; and
determine orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
29. The computer system of claim 28, which, upon being programmed, is further configured to perform the following for each of said one or more view-based detectors:
generate a scaled version of said 2D image;
place an image window of fixed size at a plurality of locations within said scaled version of said 2D image;
evaluate a plurality of visual attributes at each of said plurality of locations of said image window using corresponding transform coefficients to determine said likelihood ratio corresponding to said each of said plurality of locations; and
estimate the presence of said 3D object in said scaled version of said 2D image based on a comparison of said corresponding likelihood ratio to said predetermined threshold value at said each of said plurality of locations.
30. The computer system of claim 28, which, upon being programmed, is further configured to perform the following for each of said one or more view-based detectors:
select a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
place an image window at a first one of a plurality of locations within said 2D image, wherein said image window is configured to represent a fixed size area of said 2D image;
for each of said plurality of attributes, determine a corresponding attribute value at each of a plurality of coordinates within said image window at said first location;
for each of said plurality of attributes, compute a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said plurality of coordinates based on said corresponding attribute values determined at said plurality of coordinates;
estimate the presence of the 3D object in said image window at said first location based on said comparison of said likelihood ratio to said predetermined threshold, wherein said likelihood ratio is defined by a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
move said image window to a second one of said plurality of locations within said 2D image; and
continue determination of said corresponding attribute values and said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said plurality of locations within said 2D image.
31. The computer system of claim 28, which, upon being programmed, is further configured to perform the following:
establish a communication link with a client computer over a communication network;
receive said digitized version of said 2D image from said client computer over said communication network;
determine the orientation and location of said 3D object in said 2D image received from said client computer; and
send a notification of said orientation and location of said 3D object to said client computer over said communication network.
32. The computer system of claim 28, which, upon being programmed, is further configured to perform the following for each of said one or more view-based detectors:
select a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
place an image window at a plurality of locations within said 2D image, wherein said image window is configured to represent a fixed size area of said 2D image;
for each attribute in a subset of said plurality of attributes, determine a corresponding attribute value at each of a plurality of coordinates within said image window at each of said plurality of locations;
for each attribute in said subset of said plurality of attributes, compute a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said plurality of coordinates at said each of said plurality of locations based on said corresponding attribute values determined at said plurality of coordinates;
for each one of said plurality of locations of said image window, compute said likelihood ratio, wherein said likelihood ratio is a division of a first product and a second product, and wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities at corresponding one of said plurality of locations;
for each one of said plurality of locations of said image window, determine if said likelihood ratio at corresponding one of said plurality of locations is above said predetermined threshold value; and
for each one of said plurality of locations of said image window, estimating presence of said 3D object only if said likelihood ratio at corresponding one of said plurality of locations is above said predetermined threshold value.
33. A method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of said 3D object, said method comprising:
receiving a digitized version of said 2D image;
selecting one or more view-based detectors;
for each view-based detector, computing a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from said 2D image that is localized in space, frequency, and orientation;
applying said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of said 3D object in said 2D image based on visual information received from corresponding transform coefficients; and wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at one of a plurality of locations within said 2D image;
selecting two correction factors, wherein each of said two correction factors is configured to correct the light intensity level for a corresponding half of said image window at said one of said plurality of locations;
selecting a predetermined number of correction values for each of said two correction factors;
for each of said two correction factors and for each of said predetermined number of correction values therefor, evaluating the total log-likelihood value for said plurality of attributes for the corresponding half of said image window at said one of said plurality of locations;
for each half of said image window at said one of said plurality of locations, selecting the largest total log-likelihood value for said plurality of attributes; and
adding corresponding largest total log-likelihood value for said each half of said image window to estimate the presence of said 3D object;
combining results of application of said one or more view-based detectors; and
determining orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
34. A method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of said 3D object, said method comprising:
receiving a digitized version of said 2D image;
selecting one or more view-based detectors;
for each view-based detector, computing a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from said 2D image that is localized in space, frequency, and orientation;
applying said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of said 3D object in said 2D image based on visual information received from corresponding transform coefficients; and wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a first one of a first plurality of locations within said 2D image;
for each of said plurality of attributes, determining a corresponding attribute value at each of a first plurality of coordinates within said image window at said first location;
for each of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said first plurality of coordinates based on said corresponding attribute values determined at said first plurality of coordinates;
estimating presence of the 3D object in said image window at said first location based on a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
moving said image window to a second one of said first plurality of locations within said 2D image; and
continuing determination of said corresponding attribute values and said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said first plurality of locations within said 2D image;
combining results of application of said one or more view-based detectors; and
determining orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
35. A method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of said 3D object, said method comprising:
receiving a digitized version of said 2D image;
selecting one or more view-based detectors;
for each view-based detector, computing a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from said 2D image that is localized in space, frequency, and orientation;
applying said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of said 3D object in said 2D image based on visual information received from corresponding transform coefficients; and wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a plurality of locations within said 2D image;
for each attribute in a subset of said plurality of attributes, determining a corresponding attribute value at each of a plurality of coordinates within said image window at each of said plurality of locations;
for each attribute in said subset of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said plurality of coordinates at said each of said plurality of locations based on said corresponding attribute values determined at said plurality of coordinates;
computing a plurality of ratios, wherein each ratio corresponds to a different one of said plurality of locations of said image window, wherein said each ratio is a division of a first product and a second product, and wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities at corresponding one of said plurality of locations of said image window;
determining which of said plurality of ratios are above a predetermined threshold value; and
estimating presence of said 3D object at only those of said plurality of locations where corresponding ratios are above said predetermined threshold value;
combining results of application of said one or more view-based detectors; and
determining orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
36. A method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of said 3D object, said method comprising:
receiving a digitized version of said 2D image;
selecting one or more view-based detectors;
for each view-based detector, computing a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from said 2D image that is localized in space, frequency, and orientation;
applying said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of said 3D object in said 2D image based on visual information received from corresponding transform coefficients; and wherein applying said one or more view-based detectors in parallel includes the following for at least one of said one or more view-based detectors:
defining a plurality of attributes, wherein each attribute is configured to sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
for each of said plurality of attributes, determining a corresponding attribute value at each of a plurality of coordinate locations within said 2D image;
selecting an image window, wherein said image window is configured to represent a fixed size area of said 2D image;
placing said image window at a first one of a plurality of locations within said 2D image;
for each of said plurality of attributes, selecting those corresponding attribute values that fall within said first location of said image window;
for each of said plurality of attributes, obtaining a first class-conditional probability for an object class and a second class-conditional probability for a non-object class based on said selected attribute values that fall within said first location of said image window;
estimating presence of the 3D object in said image window at said first location based on a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
moving said image window to a second one of said plurality of locations within said 2D image; and
continuing selection of corresponding attribute values, determination of said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said plurality of locations within said 2D image;
combining results of application of said one or more view-based detectors; and
determining orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
37. A computer system, which, upon being programmed, is configured to perform the following:
receive a digitized version of a 2D (two dimensional) image, wherein said 2D image contains a 2D representation of a 3D (three dimensional) object;
select one or more view-based detectors;
for each view-based detector:
compute a wavelet transform of said digitized version of said 2D image, wherein said wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents corresponding visual information from said 2D image; select a plurality of attributes, wherein each attribute is configured to
sample and quantize each of a predetermined number of transform coefficients from said plurality of transform coefficients;
place an image window at a first one of a plurality of locations within said 2D image, wherein said image window is configured to represent a fixed size area of said 2D image;
for each of said plurality of attributes, determine a corresponding attribute value at each of a plurality of coordinates within said image window at said first location;
for each of said plurality of attributes, compute a first class-conditional probability for an object class and a second class-conditional probability for a non-object class at said each of said plurality of coordinates based on said corresponding attribute values determined at said plurality of coordinates;
estimate the presence of the 3D object in said image window at said first location based on a ratio of a first product and a second product, wherein said first product includes a product of all of said first class-conditional probabilities and wherein said second product includes a product of all of said second class-conditional probabilities;
move said image window to a second one of said plurality of locations within said 2D image; and
continue determination of said corresponding attribute values and said first and said second class-conditional probabilities, and estimation of the presence of said 3D object in said image window at said second location and at each remaining location in said plurality of locations within said 2D image;
apply said one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of said 3D object in said 2D image based on visual information received from corresponding transform coefficients;
combine results of application of said one or more view-based detectors; and
determine orientation and location of said 3D object from said combination of results of application of said one or more view-based detectors.
Description
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention broadly relates to image processing and image recognition, and more particularly, to a system and method for detecting presence of 3D (three dimensional) objects in a 2D (two dimensional) image containing 2D representation of the 3D objects.
2. Description of the Related Art
Object recognition is the problem of using computers to automatically locate objects in images, where an object can be any type of three dimensional physical entity such as a human face, automobile, airplane, etc. Object detection involves locating any object that belongs to a category such as the class of human faces, automobiles, etc. For example, a face detector would attempt to find all human faces in a photograph, but would not make finer distinctions such as identifying each face.
The challenge in object detection is coping with all the variations that can exist within a class of objects and the variations in visual appearance. FIG. 1A illustrates a picture slide 10 showing intra-class variations for human faces and cars. For example, cars vary in shape, size, coloring, and in small details such as the headlights, grill, and tires. Similarly, the class of human faces may contain human faces for males and females, young and old, bespectacled with plain eyeglasses or with sunglasses, etc. Also, the visual expression of a face may be different from human to human. One face may appear jovial whereas the other one may appear sad and gloomy. Visual appearance also depends on the surrounding environment and lighting conditions as illustrated by the picture slide 12 in FIG. 1B. Light sources will vary in their intensity, color, and location with respect to the object. Nearby objects may cast shadows on the object or reflect additional light on the object. Furthermore, the appearance of the object also depends on its pose; that is, its position and orientation with respect to the camera. FIG. 1C shows a picture slide 14 illustrating geometric variation among human faces. A person's race, age, gender, ethnicity, etc., may play a dominant role in defining the person's facial features. A side view of a human face will look much different than a frontal view.
Therefore, a computer-based object detector must accommodate all this variation and still distinguish the object from any other pattern that may occur in the visual world. For example, a human face detector must be able to find faces regardless of facial expression, variation from person to person, or variation in lighting and shadowing. Most methods for object detection use statistical modeling to represent this variability. Statistics is a natural way to describe a quantity that is not fixed or deterministic such as a human face. The statistical approach is also versatile. The same statistical model can potentially be used to build object detectors for different objects without re-programming.
Prior success in object detection has been limited to frontal face detection. Little success has been reported in detection of side profile) views of faces or of other objects such as cars. Prior methods for frontal face detection include methods described in the following publications: (1) U.S. Pat. No. 5,642,431, titled "Network-based System And Method For Detection of Faces And The Like", issued on Jun. 24, 1997 to Poggio et al.; (2) U.S. Pat. No. 5,710,833, titled "Detection Recognition And Coding of Complex Objects Using Probabilistic Eigenspace Analysis", issued on Jan. 20, 1998 to Moghaddam et al.; (3) U.S. Pat. No. 6,128,397, titled "Method For Finding All Frontal Faces In Arbitrarily Complex Visual Scenes", issued on Oct. 3, 2000 to Baluja et al.; (4) Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, "Neural Network-Based Face Detection", IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1, January 1998, pp. 23-28; (5) Edgar Osuna, Robert Freund, and Federico Girosi, "Training Support Vector Machines: An Application To Face Detection", Conference on Computer Vision and Pattern Recognition, 1997, pp. 130-136; (6) M. C. Burl and P. Perona, "Recognition of Planar Object Classes", Conference on Computer Vision and Pattern Recognition, 1996, pp. 223-230; (7) H. Schneiderman and T. Kanade, "Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition", Conference on Computer Vision and Pattern Recognition, 1998, pp. 45-51; (8) L. Wiskott, J-M Fellous, N. Kruger, C. v. d. Malsburg, "Face Recognition by Elastic Bunch Matching", IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:7, 1997, pp. 775-779; and (9) D. Roth, M-H Yang, and N. Ahuja, "A SnoW-Based Face Detector", NIPS-12 (Neural Information Processing Systems), 1999.
The methods discussed in publications (1) through (9) mentioned above differ primarily in the statistical model they use. The method of publication (1) represents object appearance by several prototypes consisting of a mean and a covariance about the mean. The method in publication (5) consists of a quadratic classifier. Such a classifier is mathematically equivalent to representation of each class by its mean and covariance. These methods as well as that of publication (2) emphasize statistical relationships over the full extent of the object. As a consequence, they compromise the ability to represent small areas in a rich and detailed way. The methods discussed in publications (3) and (4) address this limitation by decomposing the model in terms of smaller regions. The methods in publications (3) and (4) represent appearance in terms of approximately 100 inner products with portions of the image. Finally, the method discussed in publication (9) decomposes appearance further into a sum of independent models for each pixel.
However, the above methods are limited in that they represent the geometry of the object as a fixed rigid structure. These methods are also limited in their ability to accommodate differences in the relative distances between various features of a human face such as the eyes, nose, and mouth. Not only can these distances vary from person to person, but their projections into the image can vary with the viewing angle of the face. For this reason, these methods tend to fail for faces that are not fully frontal in posture. This limitation is addressed by the publications (6) and (8), which allow for small amounts of variation among small groups of hand-picked features such as the eyes, nose, and mouth. However, by using a small set of hand-picked features these representations have limited power. The method discussed in publication (7) allows for geometric flexibility with a more powerful representation by using richer features (each takes on a large set of values) sampled at regular positions across the fall extent of the object. Each feature measurement is treated as statistically independent of all others. The disadvantage of this approach is that any relationship not explicitly represented by one of the features is not represented. Therefore, performance depends critically on the quality of the feature choices.
Finally, all of the above methods are structured such that the entire statistical model must be evaluated against the input image to determine if the object is present. This can be time consuming and inefficient. In particular, since the object can appear at any position and any size within the image, a detection decision must be made for every combination of possible object position and size within an image. It is therefore desirable to detect a 3D object in a 2D image over a wide range of variation in object location, orientation, and appearance. It is also desirable to perform the object detection in a computationally advantageous manner so as to conserve time and computing resources.
SUMMARY OF THE INVENTION
In one embodiment, the present invention contemplates a method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of the 3D object. The method comprises receiving a digitized version of the 2D image; selecting one or more view-based detectors; for each view-based detector, computing a wavelet transform of the digitized version of the 2D image, wherein the wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from the 2D image that is localized in space, frequency, and orientation; applying the one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of the 3D object in the 2D image based on visual information received from corresponding transform coefficients; combining results of application of the one or more view-based detectors; and determining orientation and location of the 3D object from the combination of results of application of the one or more view-based detectors.
In an alternative embodiment, the present invention contemplates a method of providing assistance in detecting the presence of a 3D object in a 2D image. The method comprises receiving a digitized version of the 2D image from a client site and over a communication network (e.g., the Internet); determining the location of the 3D object in the 2D image; and sending a notification of the location of the 3D object to the client site over the communication network.
In a still further embodiment, the present invention contemplates a computer-readable storage medium having stored thereon instructions, which, when executed by a processor, cause the processor to perform a number of tasks including the following: digitize a 2D image containing a 2D representation of a 3D object; compute a wavelet transform of the digitized version of the 2D image, wherein the wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents corresponding visual information from the 2D image; place an image window of fixed size at a first plurality of locations within the 2D image; evaluate a plurality of visual attributes at each of the first plurality of locations of the image window using corresponding transform coefficients; and estimate the presence of the 3D object in the 2D image based on evaluation of the plurality of visual attributes at the each of the first plurality of locations.
An object finder program according to the present invention improves upon existing methods of 3D object detection both in accuracy and computational properties. These improvements are based around the use of the wavelet transform for object detection. A pre-selected number of view-based detectors are trained on sample 2D images prior to performing the detection on an unknown 2D image. These detectors then operate on the given 2D input image and compute a quantized wavelet transform for the entire input image. The object detection then proceeds with sampling of the quantized wavelet coefficients at different image window locations on the input image and efficient look-up of pre-computed log-likelihood tables to determine object presence. The object finder's coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme. The object finder detects a 3D object over a wide range in angular variation (e.g., 180 degrees) through the combination of a small number of detectors each specialized to a small range within this range of angular variation.
The object finder may be trained to detect many different types of objects (e.g., airplanes, cats, trees, etc.) besides the human faces and cars as discussed hereinbelow. Some of the applications where the object finder may be used include: commercial image databases (e.g., stock photography) for automatically labeling and indexing images; an Internet-based image searching and indexing service; finding objects of military interest (e.g., mines, tanks, etc.) in satellite, radar, or visible imagery; as a tool for automatic description of the image content of an image database; to achieve accurate color balancing on human faces and remove red-eye from human faces in a digital photo development; for automatic adjustment of focus, contrast, and centering on human faces during digital photography; and enabling automatic zooming on human faces as part of a security and surveillance system.
BRIEF DESCRIPTION OF THE DRAWINGS
Further advantages of the present invention may be better understood by referring to the following description taken in conjunction with the accompanying drawings, in which:
FIGS. 1A-1C illustrate different challenges in object detection;
FIG. 2 illustrates an embodiment of a generalized operational flow for the object finder program according to the present invention;
FIG. 3 depicts an exemplary setup to utilize the object finder program according to the present invention;
FIG. 4 illustrates the decision-making involving a fixed object size, orientation, and alignment;
FIG. 5 shows the view-based classification approach utilized by the object finder program to detect object locations and orientations;
FIG. 6 is a real-life illustration of the object classification approach outlined in FIG. 5;
FIG. 7 shows an example of different orientations for human faces and cars that the object finder program is configured to model;
FIG. 8 depicts the general object detection approach used by the object finder program of the present invention;
FIG. 9 illustrates an exemplary histogram;
FIG. 10 shows a set of subbands produced by a wavelet transform based on a three-level decomposition of an input image using a 5/3 linear phase filter-bank;
FIG. 11 depicts an input image and its wavelet transform representation;
FIG. 12 shows a gradation of image details represented by a wavelet transform;
FIG. 13 shows three vertical subbands and three horizontal subbands in the wavelet decomposition shown in FIG. 11;
FIG. 14 shows seven intra-subband operators;
FIG. 15 shows three inter-orientation operators;
FIG. 16 shows six inter-frequency operators;
FIG. 17 shows one inter-frequency, inter-orientation operator;
FIG. 18 illustrates an example of how statistics for detectors are collected off-line using a set of training images;
FIGS. 19-22 illustrate how detectors are estimated using the AdaBoost algorithm;
FIG. 23 shows a simplified flow chart illustrating major operations performed by a view-based detector during detection of an object at a specific orientation;
FIG. 24 illustrates an input image along with its wavelet transform and quantized wavelet transform;
FIG. 25 illustrates how two local operators sample different arrangements of wavelet coefficients;
FIG. 26 shows a simplified illustration of how the overcomplete wavelet transform of an input image is generated;
FIG. 27 is a simplified illustration of three levels of wavelet transform coefficients in the image window-based object detection using the coarse-to-fine search strategy according to the present invention;
FIGS. 28-30 further illustrate the object detection process for one scale of the input image;
FIG. 31 illustrates the image scaling process as part of the overall object detection process shown in FIG. 23;
FIG. 32 further illustrates the details of the image scaling process and corresponding wavelet transform computation according to the present invention;
FIGS. 33-34 depict various images of humans with the object markers placed on the human faces detected by the object finder according to the present invention; and
FIGS. 35-36 illustrate various images of cars with the object markers placed on the cars detected by the object finder.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
FIG. 2 illustrates an embodiment of a generalized operational flow for the object finder program according to the present invention. The object finder program (simply, the "object finder") is represented by the block 18. A digitized image 16 is a typical input to the object finder 18, which operates on the image 16 and generates a list of object locations and orientations for the 3D objects represented in the 2D image 16. It is noted that the terms "image" and "digitized image" are used interchangeably hereinbelow. However, both of these terms are used to refer to a 2D image (e.g., a photograph) containing two dimensional representations of one or more 3D objects (e.g., human faces, cars, etc.). In one embodiment, as discussed hereinbelow in more detail, the object finder 18 may place object markers 52 (FIG. 6) on each object detected in the input image 16 by the object finder 18. The input image may be an image file digitized in one of many possible formats including, for example, a BMP (bitmap) file format, a PGM (Portable Grayscale bitMap graphics) file format, a JPG (Joint Photographic Experts Group) file format, or any other suitable graphic file format. In a digitized image, each pixel is represented as a set of one or more bytes corresponding to a numerical representation (e.g., a floating point number) of the light intensity measured by a camera at the sensing site.
FIG. 3 depicts an exemplary setup to utilize the object finder program 18 according to the present invention. An object finder terminal or computer 22 may execute or "run" the object finder program application 18 when instructed by a user. The digitized image 16 may first be displayed on the computer terminal or monitor display screen and, after application of the object finder program, a marked-up version of the input image (e.g., picture slide 50 in FIG. 6) may be displayed on the display screen of the object finder terminal 22. The program code for the object finder program application 18 may be initially stored on a portable data storage medium, e.g., a floppy diskette 24, a compact disc 26, a data cartridge tape (not shown) or any other magnetic or optical data storage medium. The object finder computer 22 may include appropriate disk drives to receive the portable data storage medium and to read the program code stored thereon, thereby facilitating execution of the object finder software. The object finder software 18, upon execution by the computer 22, may cause the computer 22 to perform a variety of data processing and display tasks including, for example, analysis and processing of the input image 16, display of a marked-up version of the input image 16 identifying locations and orientations of one or more 3D objects in the input image 16 detected by the object finder 18, transmission of the marked-up version of the input image 16 to a remote computer site 28 (discussed in more detail hereinbelow), etc.
As illustrated in FIG. 3, in one embodiment, the object finder computer terminal 22 may be remotely accessible from a client computer site 28 via a communication network 30. In one embodiment, the communication network 30 may be an Ethernet LAN (local area network) connecting all the computers within a facility, e.g., a university research laboratory or a corporate data processing center. In that case, the object finder terminal 22 and the client computer 28 may be physically located at the same site, e.g., a university research laboratory or a photo processing facility. In alternative embodiments, the communication network 30 may include, independently or in combination, any of the present or future wireline or wireless data communication networks, e.g., the Internet, the PSTN (public switched telephone network), a cellular telephone network, a WAN (wide area network), a satellite-based communication link, a MAN (metropolitan area network) etc.
The object finder computer 22 may be, e.g., a personal computer (PC), a graphics workstation, or a computer chip embedded as part of a machine or mechanism (e.g., a computer chip embedded in a digital camera, in a traffic control device, etc.). Similarly, the computer (not shown) at the remote client site 28 may also be capable of viewing and manipulating digital image files transmitted by the object finder terminal 22. In one embodiment, as noted hereinbefore, the client computer site 28 may also include the object finder terminal 22, which can function as a server computer and can be accessed by other computers at the client site 28 via a LAN. Each computer--the object finder computer 22 and the remote computer (not shown) at the client site 28--may include requisite data storage capability in the form of one or more volatile and non-volatile memory modules. The memory modules may include RAM (random access memory), ROM (read only memory) and HDD (hard disk drive) storage. Memory storage is desirable in view of sophisticated image processing and graphics display performed by the object finder terminal 22 as part of the object detection process.
Before discussing how the object detection process is performed by the object finder software 18, it is noted that the arrangement depicted in FIG. 2 may be used to provide a commercial, network-based object detection service that may perform customer-requested object detection in real time or near real time. For example, the object finder program 18 at the computer 22 may be configured to detect human faces in photographs or pictures remotely submitted to it over the communication network 30 (e.g., the Internet) by an operator at the client site 28. The client site 28 may be a photo processing facility specializing in removal of "red eyes" from photographs. In that case, the object finder computer 22 may first automatically detect all human faces in the photographs submitted and send the detection results to the client computer site 28, which can then automatically remove the red spots on the faces pointed out by the object finder program 18. Thus, the whole process can be automated. As another example, the object finder computer 22 may be a web server running the object finder software application 18. The client site 28 may be in the business of providing commercial image databases. The client site 28 may automatically search and index images on the world wide web as |