  # Introduction to Statistical Methods and Regression Models

Lead Author(s): Lelys Bravo de Guenni

This book is an introduction to the main methods used in Statistical Inference and Regression Analysis, using real life examples and a statistical package outputs.

# Part 1: Why Statistics?

​​Statistics is taking a leading role in science and technology. Karl Pearson, a very important British Matemathician  (1857 -1936) quoted the following: "statistics is the grammar of science".  Good science can not be written without a good statistical analysis in support of the scientific findings.

Image courtesy of Pixabay under CC0 1.0 by Creative Commons.

Major activities in Statistics  are:

• Design of experiments and surveys to test hypotheses
•  Exploration and visualization of sample data
• Summary description of sample data
• Stochastic modeling of uncertainty
• Forecasting based on suitable models
• Hypothesis testing and statistical inference
• Development of new statistical theory and methods

### History of Statistics

There was a ​strong motivation  to make sense of large amounts of data collected by population surveys in the new states in Europe during the 1600’s.

Mathematical foundations of Statistics advanced significantly thanks to advances in Probability Theory (inspired by games of chance and gambling).

For more information about the History of Statistics: Johnson and Kotz (1998) and Kotz and Johnson (1993).

A nice summary of key dates can be found in this link.

### Job in Statistics

You can check the following ​Bureau of Labor Statistics link:

Employment of statisticians is projected to grow 34 percent from 2014 to 2024:

A recent study by McKinsey Global Institute predicts that the United States will need vastly more professionals—between 140,000 and 190,000—with expertise in statistical methods by 2018.

### Choosing the appropriate statistical technique

The selection of an appropriate statistical technique ​depends on the  type of research, type of variables and the particular question your are trying to answer.

In most types of analysis you have a response or dependent variable and you have other variables called predictors.

Example: Relationship between alcohol consumption (Frequency of use) and body-mass index  [(weight in kg)/(height in m)2

You want to understand the response variable y, as a function of the explanatory variable x

### Examples

​ 1. Effect of a experimental diet (tannin content) on the caterpillar's growth Scatter Diagram of Growth vs. Tannin content for caterpillars. ​Provided by the author.

We can observe that ​when the Tannin concentration increases, the caterpillar growth slows down. The objective is to find a statistical model that describes this behavior in an optimal way.

2. Climate change and fish growth

Climate change has an important effect on many biological processes and life on earth. An example is the impact of climate change on the cod size caught.

Again we would like to have a model that is able to describe the impact of climate change on the code size. Image Courtesy of U.S. Fish and Wildlife Service​ by SA-3.0 via Creative Commons

### Types of Variables: Continuos or Categorical?

It is fundamental to recognize the types of variables we are dealing with and discern whether they are continuous or categorical:

• ​A continuous variable is a measurement that can take a real number. Example: Weight, height, blood pressure, rainfall amount.
• A categorical variable is called a factor and might have two or more levels. Example: sex is a two levels factor variable (male or female); hair color four is a four levels variable  (brown, black, blonde, other).

Before you start your statistical analysis, try to answer the following questions:

• ​Which variable is the response variable?
• Which are the explanatory variables?
• Are the explanatory variables continuous, categorical or a mixture?
•  Type of response variable: Continuous, a count, a proportion, a time at death , or a category?

### A guide to the analysis

Here is a list of statistical methods you should use according to the data types you have available:

### Data Sources from different types of research

​We can mention three types of research depending on the nature of the observational units (or subjects):

• Experimental: Controlled study; subjects randomly assigned to different factor levels
• Quasi-experimental:Subjects are not assigned randomly; less expensive studies; less control
• Observational:  No randomization or manipulation of study subjects

### Aim of the Analysis

A statistical analysis can be visualized into different steps:

• ​Select an appropriate model
• Determine the values of the parameters of a specific model (least square, maximum likelihood, restricted likelihood, Bayesian paradigm)
•  Ocam’s Razor Principle (Principle of Parsimony)

### Ocam's Razor principle

Here are some principles proposed by Ocams ( 1287–1347) to be followed in statistical analysis:

• ​As few parameters as possible
• Linear models preferred to non-linear models
• Few assumptions better than many assumptions
• Simple explanations better than complex explanations

Please note: This principle is a guide, not a rule

# Part 2. Classification of variables

The following diagram shows the main categories for classifying the different types of data we normally work on in statistics. It is very important to identify the type of data we are going to work on, before we start any statistical analysis:

### Classification of Quantitative Variables

We classify the quantitative variable according to the data gappiness we find:

• Discrete: Gaps between observations and values in between are not possible. Example: Number of children a couple might have: 0, 1, 2,..
• Continuous: No gaps between observations and values in between are possible.  Example: Inflation rate (annual) in different countries: 0.5%, 1.1%, 2.7%, 5.6%, 10.1%,…

Other ways to classifying variables can be stated as follows:

Nominal Level: Categorical data only. Data can not be arranged in order.

Example: Survey responses (yes, no, na).

Ordinal level: Data can be arranged in order. Differences are meaningless.

Interval Level: Like Ordinal data but differences are meaningful. There is no natural zero starting point.

Example: Years of Strong El Niño since 1980: 1982,1987,1997.

Ratio Level: Similar to interval level with a zero starting point. The zero starting point makes ratios meaningful.

Example: weight or distances
Level of Measurements

Classify the following variables according to the level of measurement

Premise
Response
1

Age of an individual

A

Ordinal

2

GPA of a college student

B

Ratio

3

Type of living accommodation

C

Nominal

4

Blood group

D

Interval

5

Income earned in a week

E

Ratio

6

Number of miles walked in a day

F

Ratio

7

G

Nominal

### Descriptive orientation of variables

This refers to the role a variable might play in a given statistical analysis. We can distinguish three main groups:

• ​Response variable or dependent variable
• Predictor or independent variable
• Confounders or nuisance variables

Example:

A study variable like freedom of press, can be classified as nominal, if the variable has three categories: Free, Not Free and Partly Free. Originally the variable is evaluated in each country and territory  through 23 methodology questions. Each country has a final score (from 0 to 100) . A total score of 0 to 30 means Free ; a total score of 31 to 60 means Partly Free, and a total score of 61 to 100 means  Not Free. If the scores are considered instead, the variable is discrete, and from the level of measurement point of view the variable can be classified as an interval variable and from its descriptive orientation it is a dependent variable.

Level of Measurements

Social security number can be classified as

A

nominal measurement

B

ordinal measurement

C

interval measurement

D

ratio measurement

E

None of the above