The ellipses show lines of equal probability density of the Gaussian. Figure 4.13: Two bivariate normal distributions, whose priors are exactly the same. Figure 4.10: The covariance matrix for two features that have exact same variances. When normal distributions are plotted that have a diagonal covariance matrix that is just a constant multplied by the identity matrix, their cluster points about the mean are shperical in shape. weblink
The covariance matrix for two features x and y is diagonal, and x and y have the exact same variance. As in the univariate case, this is equivalent to determining the region for which gi(x) is the maximum of all the discriminant functions. Expansion of the quadratic form yields The regions are separated by decision boundaries, surfaces in feature space where ties occur among the largest discriminant functions.
We can consider p(x|wj) a function of wj (i.e., the likelihood function) and then form the likelihood ratio p(x|w1)/ p(x|w2). Suppose further that we measure the lightness of a fish and discover that its value is x. This means that we allow for the situation where the color of fruit may covary with the weight, but the way in which it does is exactly the same for apples If this is true, then the covariance matrices will be identical.
From the multivariate normal density formula in Eq.4.27 notice that the density is constant on surfaces where the squared distance (Mahalanobis distance)(x -”)TS-1(x -”) is constant. Regardless of whether the prior probabilities are equal or not, it is not actually necessary to compute distances. Because the state of nature is so unpredictable, we consider w to be a variable that must be described probahilistically. Thomas Bayes Wiki Thus, to minimize the average probability of error, we should select the i that maximizes the posterior probability P(wj|x).
Since it is quite likely that we may not be able to measure features that are independent, this section allows for any arbitrary covariance matrix for the density of each class. Bayes Error Rate Example To understand how this tilting works, suppose that the distributions for class i and class j are bivariate normal and that the variance of feature 1 is and that of feature Pattern Recognition for Human Computer Interface, Lecture Notes, web site, http://www-engr.sjsu.edu/~knapp/HCIRODPR/PR-home.htm Let us reconsider the hypothetical problem posed in Chapter 1 of designing a classifier to separate two kinds of fish: sea bass and salmon.
The contour lines are stretched out in the x direction to reflect the fact that the distance spreads out at a lower rate in the x direction than it does in Wiki Bayes Rule Using the general discriminant function for the normal density, the constant terms are removed. One of the various forms in which the minimum-error rate discriminant function can be written, the following two are particularly convenient: Instead, x and y have the same variance, but x varies with y in the sense that x and y tend to increase together.
The continuous univariate normal density is given by Because both Si and the (d/2) ln 2p terms in eq. 4.41 are independent of i, they can be ignored as superfluous additive constants. Bayes Error Rate In R But since w= then the hyperplane which seperates Ri and Rj is orthogonal to the line that links their means. Optimal Bayes Error Rate Intstead, the boundary line will be tilted depending on how the 2 features covary and their respective variances (see Figure 4.19).
Thus, it does not work well depending upon the values of the prior probabilities. have a peek at these guys Figure 4.14: As the priors change, the decision boundary throught point x0 shifts away from the more common class mean (two dimensional Gaussian distributions). Given the covariance matrix S of a Gaussian distribution, the eigenvectors of S are the principal directions of the distribution, and the eigenvalues are the variances of the corresponding principal directions. Thus, we obtain the equivalent linear discriminant functions Naive Bayes Classifier Error Rate
The classifier is said to assign a feature vector x to class wi if gi(x) > gj(x), ičj For the minimum error-rate case, we can simplify things further by taking gi(x)= P(wi|x), so that the maximum discriminant function corresponds to the maximum posterior probability. The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. check over here Then the posterior probability can be computed by Bayes formula as:
Figure 4.21: Two bivariate normals, with completely different covariance matrix, are showing a hyperquatratic decision boundary. Wiki Bayes Factor Figure 4.5: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean. Then consider making a measurement at point P in Figure 4.17: Figure 4.17: The discriminant function evaluated at P is smaller for class apple than it is for class orange.
The position of x0 is effected in the exact same way by the a priori probabilities. For this reason, the decision bondary is tilted. Figure 4.18: The contour lines are elliptical in shape because the covariance matrix is not diagonal. Bayes Wikipedia One of the most useful is in terms of a set of discriminant functions gi(x), i=1, ,c.
If we can find a boundary such that the constant of proportionality is 0, then the risk is independent of priors. Does the tilting of the decision boundary from the orthogonal direction make intuitive sense? This approach is based on quantifying the tradeoffs between various classification decisions using probability and the costs that accompany such decisions. http://onlinetvsoftware.net/error-rate/bayes-error-rate-matlab.php Figure 4.11: The covariance matrix for two features that has exact same variances, but x varies with y in the sense that x and y tend to increase together.
When this happens, the optimum decision rule can be stated very simply: the decision rule is based entirely on the distance from the feature vector x to the different mean vectors. If we view matrix A as a linear transformation, an eigenvector represents an invariant direction in the vector space. Figure 4.9: The covariance matrix for two features x and y do not co-vary, but feature x varies more than feature y. Therefore, the covariance matrix for both classes would be diagonal, being merely s2 times the identity matrix I.
If the prior probabilities P(wi) are the same for all c classes, then the ln P(wi) term can be ignored. Geometrically, this corresponds to the situation in which the samples fall in equal-size hyperspherical clusters, the cluster for the ith class being centered about the mean vector mi (see Figure 4.12). If we assume there are no other types of fish relevant here, then P(w1)+ P(w2)=1. If the catch produced as much sea bass as salmon, we would say that the next fish is equally likely to be sea bass or salmon.
This leads to the requirement that the quadratic form wTSw never be negative. Samples from normal distributions tend to cluster about the mean, and the extend to which they spread out depends on the variance (Figure 4.4).