5.1.4 Distribution Fit


Contents

Summary

Knowing the distribution model of the data helps you to continue with the right analysis. or make estimation of your data. The Distribution Fit tool helps users to examine the distribution of their data, and estimate parameters for the distribution.

What you will learn

This tutorial will show you:

User Story

A house builder is trying to decide how many new houses he should build in the next year based on past sales of houses in the surrounding area. He would like to know the following:


To solve this problem, The house builder needs to:

Choosing Distributions

  1. Start with a new project or a new workbook. Import the data file: \Samples\Statistics\HouseSold.dat
  2. Highlight Column B, select Plot > Statistical > Histogram from Origin menu
    Dist fit histogram.png
  3. Consider the facts as below to choose the distributions:

Performing Distribution Fit

  1. Return to the HouseSold worksheet and highlight the column B. From the Menu Bar, select the Statistics: Descriptive Statistics: Distribution Fit menu
  2. In the dialog that opens, in the Distributions tab, clear Normal and select the following three distributions based on the conclusions in the Choosing Distributions section
    DOC-2411 Dist fit dlg settings a.png
  3. In the Plots tab, select Probability Plot
    DOC-2411 Dist fit dlg settings b.png
  4. In the Goodness of Fit tab, check all three methods. Click OK to apply the settings and close dialog.
    DOC-2411 Dist fit dlg settings c.png

Comparing and Selecting Fitting Models

We can compare and select a fitting model based on the following results of distribution fit:

From the Probability (P-P) Plot and Goodness of Fit Tests table, we can draw a conclusion that lognormal and gamma are both good choices. Here we choose the lognormal as an example for further analysis.

Making Estimations

Once the best distribution model is found, we can use the CDF and INV functions to calculate these probabilities:

  1. To answer the first question, open the Command Window or the Script Window from the Windows menu, and type commands as below
    logncdf(80, 3.94262, 0.35614) =
    where 3.94262 is mu and 0.35614 is sigma, obtained from the Parameter Estimates table in the Report Sheet.
    Dist fit results parameter estimates.png
  2. You will get
    logncdf(80, 3.94262, 0.35614) =  0.89136185728793
    We can conclude that if the house builder builds 80 new houses, there is an 89% probability that he will NOT sell all of those houses.
  1. To answer the second question, run the script below in the Command Window or the Script Window
    logninv(1-0.6, 3.94262, 0.35614) =
  2. You will get
    logninv(1-0.6, 3.94262, 0.35614) =  47.105650533425
    We can conclude that the house builder is more likely to make a profit if he builds 47 new houses.

We choose the lognormal model in Choosing Distribution section so we use logncdf and logninv for the estimation. If we choose gamma, we can use gamcdf and gaminv for the estimation, which will result in a similar conclusion.

Notes:There are also other descriptive statistics and graphs in results of Distribution Fit which help you to take a quick look of your data
  • Descriptive Statistics table
  • Quantiles table
  • Histogram
  • Box Chart
  • CDF (Cumulative Distribution Function Plot)