Introduction to statistics
Data science is the hottest paying job of the 21st century, with an average salary of 120kdollars per year. Now, if you want to foray into the world of data science, you need to have good command over statistics, as it forms the base of all the data science concepts. In this blog, we will learn more about Statistics for Data Science.
Statistics is primarily an applied branch of mathematics, which tries to form a sense of observations within the world. Statistics is usually considered one of the pillars of knowledge science.
Statistics may be a mathematical science concerning the gathering, presentation, analysis and interpretation of knowledge. Let’s begin with the statistics tutorial.
An analysis of any situation can be done in two ways
1. Statistical analysis
It is the science of collecting, exploring and presenting large amounts of knowledge to identify patterns and trends. It is also called quantitative analysis.
2. Non Statistical analysis
It provides general information and includes text, sound, still images, and moving images. It is also called qualitative analysis.
There are two major categories of statistics
1. Descriptive statistics
Descriptive statistics help organize data and focus on the main characteristics of the data. It provides a summary of the data numerically or graphically.
2. International statistics
International statistics generalizes the larger dataset and applies applied mathematics to draw a conclusion. It allows you to infer population parameters based on sample statistics and to model relationships within the data. Modelling allows you to develop mathematical equations which describe the interrelationships between two or more variables.
Correlation in statistics
Correlation may be a statistical measure that expresses the extent to which two variables are linearly related (meaning they modify together at a continuing rate). It’s a common tool for describing simple relationships without making a press release about cause and effect.
Covariance in statistics
Covariance may be a statistical tool that’s wont to determine the connection between the movements of two asset prices. When two stocks tend to move together, they are seen as having a positive covariance; when they move inversely, the covariance is negative.
Binomial distribution in statistics
The binomial distribution is often thought of as simply the probability of a hit or FAILURE outcome in an experiment or survey that’s repeated multiple times. The binomial is a type of distribution that has two possible outcomes (the prefix “bi” means two or twice).
Poisson distribution in statistics
In statistics, a Poisson distribution may be a probability distribution that will be wont to show what percentage of an occasion is probably going to occur within a specified period of time.
Moving ahead in this blog on Statistics for Data Science, let us understand some common terms. There are various statistical terms that one should be aware of while dealing with statistics. Some of them are as follows:
A population is a group from which data is to be collected.
A sample is a subset of the population.
A variable is a feature characteristic of any member of a population offering in quality of quantity from another member.
4. Quantitative variable
A variable differing in quantity is named a quantitative variable. Example: The weight of a person and the number of people in the car.
5. Qualitative variable
A variable differing in quality is called a qualitative variable or attribute. Example: Colour and the degree of damage to a car in an accident.
6. Discrete variable
A discrete variable is one in which no value can be assumed between two given values. Example: the number of children in a family.
7. Continuous variable
A continuous variable is one during which any value is often assumed between the two given values. Example: The time taken for a 100-meter run.
Types of statistical measures
There are four types of statistical measures used to describe data.
1. Measure of frequency
The frequency of the data indicates the number of occurrences of any particular data value in the given dataset. The measure of frequency is number and percentage.
2. Measure of central tendency
Central tendency, whether the data values accumulate in the middle of distribution or towards the end. The measure of central tendency is mean median and mode.
3. Measure of spread
Spread describes how similar or varied the set of observed values are for a particular variable. The measure of spread is the standard deviation, variance and quartiles. The measures of spread are also called measures of dispersion.
4. Measure of position
The position identifies the exact location of a particular data value in the given data set. The measures of position are percentiles, quartiles and standard scores.
Statistical analysis system (SAS) provides a list procedure to perform descriptive statistics. They are as follows:
- Proc Print – Prints all the variables in a SAS dataset.
- Proc Contents – Describe the structure of the dataset.
- Proc Means – Provides data summarization tools to compute descriptive statistics.
- Proc Freq – Produces one way to n way frequency and cross-tabulation table.
- Proc Univariate – Conducts some basic statistical analyses and includes high-resolution graphical features.
- Proc GChart – Produces six types of charts: block, horizontal, vertical bar, pie, doughnut and star.
- Proc Boxplot – Creates side by side box and whisker plots of measurements organized in groups.
- Proc Gplot – creates two-dimensional graphs including simple scatter plots, overlay pivot plots against a second vertical axis, bubble plots and logarithmic plots.
Hypothesis testing is an inferential statistical technique to determine whether there is enough evidence in a data sample to infer that a certain condition holds true for the entire population.
1. Null Hypothesis (H0)
The null hypothesis is assumed to be true unless there is strong evidence to the contrary. No variation exists between variables. For example, a pharmaceutical company has introduced a medicine in the market for a particular disease, and people have been using it for a considerable period of time, and it’s generally considered safe. If the medicine is proved to be safe, then it is referred to as the null hypothesis.
2. Alternative Hypothesis (H1)
The alternative hypothesis is assumed to be true when the null hypothesis is proved false. Any hypothesis other than the null hypothesis. For example, a pharmaceutical company has introduced a medicine in the market for a particular disease, and people have been using it for a considerable period of time, and it’s generally considered safe. If the medicine is proved to be safe, then it is referred to as the null hypothesis. To reject the null hypothesis, we should prove that the medicine is unsafe. If the null hypothesis is rejected, then the alternative hypothesis is used.
This brings us to the end of the statistics data science tutorial. We hope that you were able to gain more insights into the topic and were able to learn statistics for data science.