How to summarize the data in a table? – Palin Analytics
After categorizing the variables, there are various ways to summarize the data some of them are
- Frequency distribution
- Grouped frequency distribution
- Cumulative frequency distribution
Frequency distribution: it is an overview of all distinct values in a variable and how many times it occurs, it tells us how frequencies are distributed over the values. It is used for the summary of categorical variables. In simple words it explains number of times the value of the variable occurs.
Frequency Distribution example :
We will use the same banking data for reference, I want to know about the number of customers who is holding the same number of credit cards. Like refer variable no. 4 i.e number of credit cards there is a frequency of credit cards holding by the customers between 1 to 10. This explains how frequencies are distributed over values. Now the question arises how we calculate the number of customers who is holding the for 1 credit cards. In excel there is very basic formula countif which can give you the exact number of customers who is holding the 1 credit cards.
In this we will have to mention the range means total number of customers starting from row1 upto n and we will mention the numbers of credit card holders we want to count.
In this picture we can see we calculated number of customers with 1 credit cardi.e 150.
Here we can see the total number of customers which is equal to exact number of rows in our data if not there is some missing values there.
Next question arises what is the highest number of the credit card usage and answer is 660 customers who are using 4 credit cards.
Grouped Frequency Distribution: Grouping the distribution in a certain brackets is called grouped frequency distribution like if we want to know where salary is moving, In our data there is a variable annual salary which is moving from minute to higher I want to group them, how do we do that. Find out the min & max salary from which we will get to know the height and depth of the salary.
find out the range (max-min). Want to create 10 bucket distribution of this(which will help us in classification models as well) and the bucket size will be range/10.
Now we want to do the grouping with the difference of 1 lakh per annum in salary.
Now we want to calculate the number of customers, which falls in specific group.
Now you can see where our maximum number of customers fall i.e341 which falls in 8 lakhs to 9 lakhs. Now you will ask why do we do bucketing: to make the significant changes we do grouping.
Cumulative Frequency : cumulative frequency is sum of the current and previous class for example in simple words if we want to explain you are getting paid 3000 for 1st week 5000 for 2nd week 4500 is for 3rd week cumulative frequency of this is the total amount of current + previous weeks is 12500(=3000+5000+4500). in our data total number of customers is equal to the value of the last row. You can see this in below image.
In this image you can see 1st row contains the same value but in 2nd contains total of 1st and 2nd and last one contains the total number of customers we have in our data.
Some of the other frequency distributions are
Ungrouped frequency distribution, Relative frequency distribution, Relative cumulative frequency distribution
In this way we can give the summary of the Data.