Introduction to Statistical Methods and Regression Models
Lead Author(s): Lelys Bravo de Guenni
Student Price: Contact us to learn more
This book is an introduction to the main methods used in Statistical Inference and Regression Analysis, using real life examples and a statistical package outputs.
Chapter 1: Introduction
Part 1: Why Statistics?
Statistics is taking a leading role in science and technology. Karl Pearson, a very important British Matemathician (1857 -1936) quoted the following: "statistics is the grammar of science". Good science can not be written without a good statistical analysis in support of the scientific findings.
Major activities in Statistics are:
- Design of experiments and surveys to test hypotheses
- Exploration and visualization of sample data
- Summary description of sample data
- Stochastic modeling of uncertainty
- Forecasting based on suitable models
- Hypothesis testing and statistical inference
- Development of new statistical theory and methods
History of Statistics
There was a strong motivation to make sense of large amounts of data collected by population surveys in the new states in Europe during the 1600’s.
Mathematical foundations of Statistics advanced significantly thanks to advances in Probability Theory (inspired by games of chance and gambling).
For more information about the History of Statistics: Johnson and Kotz (1998) and Kotz and Johnson (1993).
A nice summary of key dates can be found in this link.
Job in Statistics
You can check the following Bureau of Labor Statistics link:
Employment of statisticians is projected to grow 34 percent from 2014 to 2024:
A recent study by McKinsey Global Institute predicts that the United States will need vastly more professionals—between 140,000 and 190,000—with expertise in statistical methods by 2018.
Choosing the appropriate statistical technique
The selection of an appropriate statistical technique depends on the type of research, type of variables and the particular question your are trying to answer.
In most types of analysis you have a response or dependent variable and you have other variables called predictors.
Example: Relationship between alcohol consumption (Frequency of use) and body-mass index [(weight in kg)/(height in m)2]
You want to understand the response variable y, as a function of the explanatory variable x
1. Effect of a experimental diet (tannin content) on the caterpillar's growth
We can observe that when the Tannin concentration increases, the caterpillar growth slows down. The objective is to find a statistical model that describes this behavior in an optimal way.
2. Climate change and fish growth
Climate change has an important effect on many biological processes and life on earth. An example is the impact of climate change on the cod size caught.
Again we would like to have a model that is able to describe the impact of climate change on the code size.
This link has the full article on the impact of climate change on the fish size.
Types of Variables: Continuos or Categorical?
It is fundamental to recognize the types of variables we are dealing with and discern whether they are continuous or categorical:
- A continuous variable is a measurement that can take a real number. Example: Weight, height, blood pressure, rainfall amount.
- A categorical variable is called a factor and might have two or more levels. Example: sex is a two levels factor variable (male or female); hair color four is a four levels variable (brown, black, blonde, other).
Before you start your statistical analysis, try to answer the following questions:
- Which variable is the response variable?
- Which are the explanatory variables?
- Are the explanatory variables continuous, categorical or a mixture?
- Type of response variable: Continuous, a count, a proportion, a time at death , or a category?
A guide to the analysis
Here is a list of statistical methods you should use according to the data types you have available:
Data Sources from different types of research
We can mention three types of research depending on the nature of the observational units (or subjects):
- Experimental: Controlled study; subjects randomly assigned to different factor levels
- Quasi-experimental:Subjects are not assigned randomly; less expensive studies; less control
- Observational: No randomization or manipulation of study subjects
Aim of the Analysis
A statistical analysis can be visualized into different steps:
- Select an appropriate model
- Determine the values of the parameters of a specific model (least square, maximum likelihood, restricted likelihood, Bayesian paradigm)
- Get the best fit of the model to the data
- Check model adequacy
- Ocam’s Razor Principle (Principle of Parsimony)
Ocam's Razor principle
Here are some principles proposed by Ocams ( 1287–1347) to be followed in statistical analysis:
- As few parameters as possible
- Linear models preferred to non-linear models
- Few assumptions better than many assumptions
- Minimal adequate model
- Simple explanations better than complex explanations
Please note: This principle is a guide, not a rule
Part 2. Classification of variables
The following diagram shows the main categories for classifying the different types of data we normally work on in statistics. It is very important to identify the type of data we are going to work on, before we start any statistical analysis:
Classification of Quantitative Variables
We classify the quantitative variable according to the data gappiness we find:
- Discrete: Gaps between observations and values in between are not possible. Example: Number of children a couple might have: 0, 1, 2,..
- Continuous: No gaps between observations and values in between are possible. Example: Inflation rate (annual) in different countries: 0.5%, 1.1%, 2.7%, 5.6%, 10.1%,…
Other ways to classifying variables can be stated as follows:
Nominal Level: Categorical data only. Data can not be arranged in order.
Example: Survey responses (yes, no, na).
Ordinal level: Data can be arranged in order. Differences are meaningless.
Example: Course grades (A,B,C,D,F).
Interval Level: Like Ordinal data but differences are meaningful. There is no natural zero starting point.
Example: Years of Strong El Niño since 1980: 1982,1987,1997.
Ratio Level: Similar to interval level with a zero starting point. The zero starting point makes ratios meaningful.
Example: weight or distances
Classify the following variables according to the level of measurement
Age of an individual
GPA of a college student
Type of living accommodation
Income earned in a week
Number of miles walked in a day
Final Grade in a class
Descriptive orientation of variables
This refers to the role a variable might play in a given statistical analysis. We can distinguish three main groups:
- Response variable or dependent variable
- Predictor or independent variable
- Confounders or nuisance variables
A study variable like freedom of press, can be classified as nominal, if the variable has three categories: Free, Not Free and Partly Free. Originally the variable is evaluated in each country and territory through 23 methodology questions. Each country has a final score (from 0 to 100) . A total score of 0 to 30 means Free ; a total score of 31 to 60 means Partly Free, and a total score of 61 to 100 means Not Free. If the scores are considered instead, the variable is discrete, and from the level of measurement point of view the variable can be classified as an interval variable and from its descriptive orientation it is a dependent variable.
For more information on this example you can see the following link.
Social security number can be classified as
None of the above