Which probability distribution is best




















Another is an extreme value distribution, which can also be altered to generate both positive and negative skewness, depending upon whether the extreme outcomes are the maximum positive or minimum negative values see Figure 6A. There are often natural limits on the values that data can take on. Using a distribution that does not constrain the values to these limits can create problems.

When data is constrained, the questions that needs to be answered are whether the constraints apply on one side of the distribution or both, and if so, what the limits on values are. Once these questions have been answered, there are two choices. One is to find a continuous distribution that conforms to these constraints. For instance, the lognormal distribution can be used to model data, such as revenues and stock prices that are constrained to be never less than zero.

For data that have both upper and lower limits, you could use the uniform distribution, if the probabilities of the outcomes are even across outcomes or a triangular distribution if the data is clustered around a central value. An alternative approach is to use a continuous distribution that normally allows data to take on any value and to put upper and lower limits on the values that the data can assume.

Note that the cost of putting these constrains is small in distributions like the normal where the probabilities of extreme values is very small, but increases as the distribution exhibits fatter tails. As we noted in the earlier section, a key consideration in what distribution to use to describe the data is the likelihood of extreme values for the data, relative to the middle value. In the case of the normal distribution, this likelihood is small and it increases as you move to the logistic and Cauchy distributions.

While it may often be more realistic to use the latter to describe real world data, the benefits of a better distribution fit have to be weighed off against the ease with which parameters can be estimated from the normal distribution.

Consequently, it may make sense to stay with the normal distribution for symmetric data, unless the likelihood of extreme values increases above a threshold.

The same considerations apply for skewed distributions, though the concern will generally be more acute for the skewed side of the distribution.

In other words, with positively skewed distribution, the question of which distribution to use will depend upon how much more likely large positive values are than large negative values, with the fit ranging from the lognormal to the exponential. In summary, the question of which distribution best fits data cannot be answered without looking at whether the data is discrete or continuous, symmetric or asymmetric and where the outliers lie.

The simplest test for distributional fit is visual with a comparison of the histogram of the actual data to the fitted distribution. Consider figure 6A. The distributions are so clearly divergent that the normal distribution assumption does not hold up.

A slightly more sophisticated test is to compute the moments of the actual data distribution — the mean, the standard deviation, skewness and kurtosis — and to examine them for fit to the chosen distribution. With the price-earnings data above, for instance, the moments of the distribution and key statistics are summarized in table 6A.

Table 6A. Current PE. Normal Distribution. Standard deviation. Since the normal distribution has no skewness and zero kurtosis, we can easily reject the hypothesis that price earnings ratios are normally distributed. The typical tests for goodness of fit compare the actual distribution function of the data with the cumulative distribution function of the distribution that is being used to characterize the data, to either accept the hypothesis that the chosen distribution fits the data or to reject it.

Not surprisingly, given its constant use, there are more tests for normality than for any other distribution. The Kolmogorov-Smirnov test is one of the oldest tests of fit for distributions [2] , dating back to Improved versions of the tests include the Shapiro-Wilk and Anderson-Darling tests.

The statistic is calculated and the p-value associated with that statistic determined. The test assumes that the data fits the specified distribution. A low p-value means that assumption is wrong, and the data does not fit the distribution. A high p-value means that the assumption is correct, and the data does fit the distribution. The p-values for the Anderson-Darling statistic are given in the third column.

The fourth column lists the p-value for the likelihood ratio test LRT. Look at Table 2. Note that there is only a LRT value when there are two distributions from the same family, e. In these cases, the second distribution is created by the addition of the threshold parameter.

The LRT determines whether there is a significant improvement in fit with the addition of the threshold parameter. The smaller the p-value in the LRT column, the more likely the addition of the extra parameter created a significant improvement in fit.

The three parameter log-normal distribution has a value for 0. This implies the extra parameter improved the fit. The three parameter Gamma distribution has a value of 0. This implies that the extra parameter did not improve the fit significantly. The fifth column contains the Akaike information criterion AIC value. You can use AIC to select the distribution that best fits the data.

The distribution with the smallest AIC value is usually the preferred model. AIC is defined as the following:. Note that the AIC value alone for a single distribution does not tell us anything. It is not a test like the p-value from the Anderson-Darling statistic. The AIC value compares the relative quality of all distributions. So, if all distributions do not fit the data well, the AIC value will not let you know this. You need to combine the p-values for the Anderson-Darling statistic, the LRT, and the AIC value to help determine which data fits the distribution best.

Based on the results, it appears that the Weibull and the three parameter Weibull both fit the data pretty well. The Smallest Extreme Value distribution fits the data the worst. There are also visual methods you can use to determine if the fit is any good. One is to overlay the probability density function pdf for the distribution on the histogram of the data. Figure 3 shows this for the Weibull distribution. Note that the pdf does seem to fit the histogram — an indication that the Weibull distribution fits the data.

The pdf does not appear to overlay the histogram very well — an indication that the Smallest Extreme Value distribution does not fit the data well. Another visual way to see if the data fits the distribution is to construct a P-P probability-probability plot. If the P-P plot is close to a straight line, then the specified distribution fits the data.

Figure 5 shows the P-P plot for the Weibull distribution results. The points fall along the straight line indicating that the distribution does fit the data. Note that the points do not fall along the straight line — another indication that this distribution does not fit the data. You have determined which distribution fits your data best.

This is an important step. It is easy to do with software. But you should have a reason for using a certain distribution — it must make sense in terms of your process. For example, the Weibull distribution is widely used in reliability and life data analysis. If this is the distribution that fits the data best, does it make sense in terms of your process? It may not always be possible to do, but you should have a reason to believe that the data fits a certain distribution — beyond the numbers saying this is the best distribution.

In the example above, there is probably very little difference between how well the Weibull and Gamma distributions fit the data.

Which one makes the most sense for your process? Now that it has been determined that the Weibull distribution fits the data best, we can perform a non-normal process capability analysis. A previous publication covered how to do this.

The upper specification limit is 7. The SPC for Excel software was used to generate the non-normal process capability analysis. The chart is shown in Figure 7. Not where you want for your PPAP! Back to work on reducing variation in your process. Since the conditions of the variable match the conditions of the binomial distribution, the binomial distribution would be the correct distribution type for the variable in question.

If historical data are available, use distribution fitting to select the distribution that best describes your data. The feature is described in detail in Fitting Distributions to Data. You can also populate a custom distribution with your historical data.

After you select a distribution type, determine the parameter values for the distribution. Each distribution type has its own set of parameters.

For example, there are two parameters for the binomial distribution: trials and probability. The conditions of a variable contain the values for the parameters. In the example used, the conditions show trials and 0. In addition to the standard parameter set, each continuous distribution except uniform also lets you choose from alternate parameter sets, which substitute percentiles for one or more of the standard parameters. One popular risk management metric used in investing is value-at-risk VaR.

VaR yields the minimum loss that can occur given a probability and time frame for a portfolio. Alternatively, an investor can get a probability of loss for an amount of loss and time frame using VaR. Misuse and overeliance on VaR has been implicated as one of the major causes of the financial crisis. As a simple example of a probability distribution, let us look at the number observed when rolling two standard six-sided dice.

Securities and Exchange Commission. Tools for Fundamental Analysis. Advanced Technical Analysis Concepts. Risk Management. Financial Ratios. Actively scan device characteristics for identification. Use precise geolocation data. Select personalised content. Create a personalised content profile. Measure ad performance. Select basic ads. Create a personalised ads profile. Select personalised ads. Apply market research to generate audience insights.

Measure content performance.



0コメント

  • 1000 / 1000