Basic Fundamentals of Statistics for Data Science – Palin Analytics
Data science deals with the properties and the statistical relationship of data. In the statistical relationship, there’s no fixed formula so we try to find out the patterns here.
For example: Measuring the relationship between the speed and the mileage of a car; there’s no deterministic relationship because there’s no fixed formula to calculate the mileage on the basis of speed and there are N number of values for certain speed so the curve can’t be given in a single line because there are N number of points which differ according to the N number of drivers who drive the car of the same model on certain speed so this isn’t a deterministic relationship anymore this is statistical relationship.
STATISTICAL CONCEPTS FOR BUSINESS ANALYTICS
The scope of exploration is limited to the data we have so we find a pattern using data science which we get from the inherent nature of data. Therefore we use some statistical and mathematical concepts like tendency, distribution to find out the inferences of data.
MEAN, MEDIAN, MODE
The mean is the average of all numbers and is sometimes called the arithmetic mean. To calculate mean, add together all of the numbers in a set and then divide the sum by the total count of numbers.
The statistical median is the middle number in a sequence of numbers. To find the median, organize each number in order by size; the number in the middle is the median.
The mode is the number that occurs most often within a set of numbers. Mode helps identify the most common or frequent occurrence of a characteristic. It is possible to have two modes (bimodal), three modes (trimodal) or more modes within larger sets of numbers.
Mean, Median, Mode these three measures describe the central tendency of the data. When a number in data replace all the numbers and average still remains the same that number is the central tendency of the data.
For Example: In a data, there are numbers as follows
20 30 40 50 60
Average of this data is 40 so if we replace all these numbers from 40 output will still remain the same. 40 is the central tendency of this data.
A measure that is used to evaluate the amount of variation and dispersion of a dataset. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.
Collecting the data through a survey is nothing but some alphanumeric values which can be strings or numbers. We can measure the central tendency of this can see its dispersion. Mean, Median, Mode are used to measure central tendency whereas standard deviation is used to measure dispersion.
CENTRAL LIMIT THEOREM
When N sample are taken from the given population, the sampling mean (the average value of all the samples) of that population is a normal distribution. The original distribution of the population is called central limit theorem.
Population and Samples
In statistics, the population means all the members if certain group whereas samples are the collective part of the population.
Given that we take enough samples, the sampling distribution of the sampled mean will be a normal distribution, irrespective of the distribution of the original population.
According to the central limit theorem, the mean of a sample of data will be closer to the mean of the overall population in question as the sample size increases, notwithstanding the actual distribution of the data, and whether it is normal or non-normal.
As a general rule, sample sizes equal to or greater than 30 are considered sufficient for the central limit theorem to hold, meaning the distribution of the sample means is fairly normally distributed.
I wish you like this post. If you have any questions then feel free to comment below.
more visit: https://palin.co.in/