The histogram is a very commonly used chart in machine learning. It is applicable to continuous variables, like sales, age, salary, profits, Number of customers, etc using the built-in function hist() of a pandas data frame.

You can plot the histogram for those columns in your data which are continuous in nature and can take any value between a min and max range.

A very common mistake is that people plot the histogram for those categorical columns which has numbers in it! e.g. Gender(1/0) or Ticket Priority(1/2/3/4/5) etc. The correct graph for categorical columns is a bar chart.

A histogram helps to understand the distribution of values in single continuous column

A histogram helps to understand the distribution of values in one single column. for example, consider the below example, The data contains three continuous columns(Salary, Age, and Cibil) and one categorical column(Approve_Loan). You can visualize the distribution of continuous columns Salary, Age, and Cibil using a histogram.

Sample Output:

Histogram for a single column
Histogram for a single column
Histograms for multiple columns in data
Histograms for multiple columns in data

The X-axis in a histogram represents the range of values present in the column. The Y-axis represents the frequency of values. For example, you can observe in the Histogram for the AGE column, that, there are four values between Age 22.5 Years and 25.0 Years, similarly, you can get an idea about how many values are there in each range.

A Histogram gives you below information

  1. What is the spread of data? The minimum and maximum values in that column?
  2. What is the central tendency of the data? A rough idea about the median and the mean values.
  3. Skewness in the data. Whether there are outliers on the left side(negative skewed) or on the right side(positive skewed)

What is the ideal output of the histogram?

The ideal output of a histogram is a shape like a bell curve. It indicates that the data is normally distributed.

For example, if you generate 100 random values of Age distributed around the mean as 30 Years. Plotting the histogram will generate a bell curve.

This is the type of output that is expected from a histogram of any continuous column. Slight deviations from this curve can be accepted, but If there is too much deviation from normal, then either the outlier treatment is required, or that column is rejected.

Slight deviations from this normal curve can be accepted, but If there is too much deviation from normal, then either the outlier treatment is required, or that column is rejected.

Sample Output:

Histogram for a normally distributed data
Histogram for a normally distributed data

Lead Data Scientist

Farukh is an innovator in solving industry problems using Artificial intelligence. His expertise is backed with 10 years of industry experience. Being a senior data scientist he is responsible for designing the AI/ML solution to provide maximum gains for the clients. As a thought leader, his focus is on solving the key business problems of the CPG Industry. He has worked across different domains like Telecom, Insurance, and Logistics. He has worked with global tech leaders including Infosys, IBM, and Persistent systems. His passion to teach inspired him to create this website!