# A Research Voyage for Psychologists

Lead Author(s): **Guillermo Campitelli**

Student Price: **Contact us to learn more**

From traditional methods to model comparison and Bayesian inference

# Chapter 1: Taking off

- Get a picture of the approach and structure of the book.
- Grasp the strategies scientists use to investigate the world.
- Understand the following concepts: variable, value, population, sample, probabilistic causal model, statistical model, research design.
- Have a general picture of the statistical and inferential approaches that will be presented in the book
- Being introduced to the pedagogical tools used in the book: graphs, metaphors, data sets.

Welcome to this voyage! In this book about research methods in psychology I use a lot of images, diagrams, metaphors, videos and other pedagogical tools to facilitate the understanding of complex methodological and analytical concepts. It is assumed that you already know basic statistical topics (samples, populations, measures of central tendency (e.g., mean), measures of dispersion (e.g., standard deviation), the concept of variable, and other basic issues in descriptive statistics. However, we will briefly touch on those topics, so do not panic if you have already forgotten about those topics.

So, you will start this book with very little background knowledge, but at the end of the book you will be able to understand and use complex methodological tools such as Bayesian inferential methods. I am using a step-by-step approach focusing on understanding, not on mathematical rigor.

Science is a discipline that aims to understand what goes on in the world. It does that by scrutinising the world in a structured manner. Based on this scrutiny scientists obtain **data**, and, based on those data, they generate **theories** of how the world works. Theories help make more precise **hypotheses** about what goes on in the world, and, in turn, to design more effective ways of scrutinising the world to obtain as much knowledge as possible.

In this book I will not explain all the methods scientists in general, or psychologists in particular, use to extract knowledge from the world. But I will be able to explain the rationale of the methods psychological researchers use to understand behavior and subjective experience.

In this chapter I will first explain strategies that scientists have developed in order to be able to investigate the world scientifically. I will also introduce you to the tools I will be using throughout the book. Moreover, I will present the different approaches I will explain in the book.

In the following video I explain the ten strategies:

## Strategy 1: Reduce the complexity of the world to variables

Scientists discover that it is very difficult to give an account of things that happen in the world using our raw observations of what happens. The trick is to simplify the complexity of the myriad of things that happen in the world into **variables**. A variable is a characteristic or attribute of things (persons, animals, places, events, objects, etc.) that may take more than one **value**. If the *thing* we are considering is humans, examples of **variables **are:

- Gender
- Age
- Country of birth
- Height
- Educational level
- Intelligence
- Psychological wellbeing

These are all characteristics of humans that can take more than one value. For example, the variable gender can take the values female, male, and others; the variable age may take the values 0, 1, 2, 3,...etc; the variable country of birth can take the values: China, India, Brazil, United States, ...etc.

On the other hand the characteristic species is **NOT** a **variable**, because that characteristic has only one value (i.e., human) for all humans. Thus, in this context species is a **constant**. If the *thing* we are interested in is all animals, then species becomes a variable because it can take many values (i.e., humans, dogs, cats, cows, etc.).

If we are considering only human females, which of the following characteristics is a variable?

Species

Gender

Age

Female

By using variables instead of raw observation we lose huge amounts of information, but at the same time, it allows us to make a complex phenomenon manageable.

For example, considering cities as the *thing" of interest, temperature is a variable. If we know the range of values that temperature might take during a day we can choose appropriate clothing. In this scenario all the things that happen in the world that make a thermometer to take a particular value are ignored and only that value is considered.

## Strategy 2: Develop theoretical models of things that happen in the world

Defining variables to understand the world makes complexity more manageable. Scientists go a step further, and they aim at establishing how variables are causally related among each other. Therefore, they develop theoretical models that establish causal relationships between variables. Ideally science develops theories of mechanisms, which are a set of structures and processes that explain how things in the world work. And we have a number of such theories in all the sciences (theory of relativity, the theory of the cell).

But sometimes it is impossible, or difficult, or we still do not have the knowledge to develop a theory or model of the mechanisms of things that happen in the world. In those cases we may still be able to develop **probabilistic causal models**. That is, models of how variables affect each other. In the causal model below, an intervention that changes the values of variable A will produce a change in values in variable B. In most cases, though, the values that variable B adopts are not fully explained by the effect of only one variable.

It is more likely that changes on other variables also affect variable B. And in a lot of cases, we do not even know which variables affect variable B. In that case, an intervention in variable A may sometimes cause changes in variable B, and sometimes may not cause changes. In this case, we do not know with absolute certainty if a change on variable A will cause a change in variable B, but we can establish the probability of a change in variable B given a change in variable A.

So far I have considered whether a change would occur or not, but changes may come in degrees. So, probabilistic causal models can also aim to establish the **probability** that a change of a certain magnitude will occur.

Developing accurate probabilistic causal models is very important because not only allows us to understand the world, but also affords us the possibility to make interventions in the world that are beneficial to us. If we are able to develop a probabilistic causal model that indicates that taking aspirin reduces the chance of having a heart attack, we can take aspirin and increase our life expectancy.

Although some mechanistic models have been developed in psychology, probabilistic causal models are more common; thus, I will focus on probabilistic causal models in this book.

Probabilistic causal models are developed to represent:

The deterministic causal relationship among variables.

The set of structures and processes that explain a phenomenon.

The things that occur in the world in its pure form

The probabilistic causal relationship among variables.

## Strategy 3. Define populations

An important aspect of science is to determine the scope of applicability of the probabilistic causal models. In other words, to which **population **the model applies. And there are many types of populations that we can define. The figure below shows three categories of populations, but they are not the only categories we can use to classify the world.

The term **population** has one meaning in sciences and another one in statistics. And we need to use both meanings of the term. I was tempted to coin a new name to differentiate between these two meanings but I decided that that would be more confusing. So, I will make sure throughout the book that it is clear to which meaning I am referring to.

In sciences, a population is a set of units of interest. Examples of populations in science are: all humans, all human males, all human children from age 2 to age 4, all cities, all schools, all mammals, all children diagnosed with dyslexia, etc.

In statistics, a population is a large set of values. For example, if we are interested in intelligence in all adult women, the population is the set of all values of the variable intelligence in all adult women.

In psychology we pay a lot of attention to populations of people. This is understandable because our subject matter is people's behavior and subjective experiences. Thus, we want to find causal models that apply to the population of all humans. However, sometimes we want to compare groups of humans with different characteristics. For example, a causal model that applies to human males, may not be adequate for human females. Or, a causal model may apply to adults, and a different one must be developed for children. Or, a different model might be needed for different cultures.

It could also be the case that we are interested in specific populations. For example, all people with depression, all university students, all lawyers, all athletes. I must not forget that psychology is not only about humans: psychologists investigate psychological processes in animals and also in artificial systems.

Given that objects are not our subject matter we do not pay as much attention as we do to people. However, determining to which population of objects a human behavior applies to is a very important part of the research endeavor. Consider, for example, researchers who are investigating how fast humans read words. Their object population of interest is all words. However, when we try to develop a causal model of this phenomenon we may find that the model only applies to how fast people read familiar words, unfamiliar words may require the development of a different causal model.

Another important type of population is the situation. And I am using the term situation in a very broad sense. I am referring to different aspects of situations such as time, space, positions, goals, behaviors, etc. Following the reading words example, we may be interested in whether our causal model applies to different times of the day (morning vs. afternoon), to different distances (50 cm from the screen, 1 m. from the screen), to different body positions (reading while seating vs. reading while standing), to different reading goals (reading for understanding vs. reading for memorising), to different behaviors (indicating whether a word is a noun or an adjective vs. indicating the number of letters a word has).

A researcher is investigating the role of vitamin C in general cognitive ability in children. The population of interest is:

General cognitive ability.

All Humans.

All children.

All types of vitamins.

## Strategy 4. Obtain representative samples

It is impossible for scientists to obtain measurements of all people interacting with all objects in all possible situations. Therefore, researchers obtain measurements in a subset of the populations. This subset is a **sample**.

Like in populations there is a meaning of sample in sciences and another meaning in statistics. In sciences a sample is a set of units of analysis that belongs to a population. In statistics a sample is a set of values obtained from a population of values. For example, if we are interested in measuring anxiety in humans we obtain a sample of people from the population of all humans, and by measurement, we obtain a sample of values of anxiety.

There is an important different here. We roughly know that the population of all humans is composed by around 7 billion units. On the other hand, we do not know all the 7 billion values of anxiety, simply because we did not obtain measures of anxiety in the whole population. Because of that, a population of values is typically represented by a **probability distribution** (see chapter 4).

The process of obtaining samples is called **sampling**. The key of the sampling process is that the outcome should be a **representative sample** of the population of interest. But obtaining a representative sample in all respects is impossible. Even if we use the best sampling method (i.e., **random sampling**) the obtained sample may differ in some aspects from the whole population. Instead, psychological researchers aim to obtain samples that are representative of the population on the variables that may affect the variables in the causal model.

Consider, for example, a researcher who is investigating the effect of the variable* intuition* over the variable *decision making*. Let's assume a third variable, say *education level*, is unrelated to decision making. In this case, we do not need a sample that is representative of the population of interest in *education level*. But if *education level* is associated to *decision making*, then the variable *education level* in the sample should have a similar distribution to that in the population of interest. If this is not possible, we can still use statistical techniques to cope with this problem. However, those techniques sometimes work and sometimes do not work. The other alternative is to indicate that our causal model only applies to a population that has similar characteristics to our sample. Obviously, this is not ideal and diminishes the relevance of the research.

Another way of dealing with this issue is by having large samples. The larger the sample the more similar to the population. Of course, we can still have an unrepresentative large sample, but having large samples help reducing the problem.

A different way of dealing with the issue of representativeness of the sample is by using efficient research designs, which are explained in the next section.

So far, I have been talking about the representativeness of samples in terms of people. But it is also important to have representative samples of objects and situations. This is not very easy to achieve, and researchers spend almost all their resources in obtaining representative samples of people. But sometimes we can achieve representative samples of objects without too much effort. Consider a researcher investigating the knowledge people have about countries. The researcher obtains a sample of the first 10 countries that come to his/her mind. It is very probable that those countries will have large populations or are well known for other reasons. If the aim is to investigate the knowledge people have about all the countries, it is better to obtain the sample of 10 countries by randomly choosing them from the population of all countries in the world.

In order to make inferences about populations, what is the main characteristic of a sample?

It must be homogeneous.

It must be representative.

It must be small.

## Strategy 5: Plan efficient research designs

When we want to know the effect of a variable A on another variable B, or even better, when we want to test a causal model that includes a causal effect of variable A on B, we want to control for the effects of other variables over variable B as much as we can.

Le'ts say a researcher is investigating the effectiveness of a psychotherapy to reduce stress. So, variable A is *treatment*, which takes the value *yes* in the participants who received psychotherapy an the value *no* in those participants who were not exposed to psychotherapy. Variable B is *stress reduction*. Let's say there is another variable (e.g., *age*) that we suspect it may affect variable B. Perhaps more younger people tend to reduce their stress without need for psychotherapy. In order to control for the effect of *age *on *stress reduction* we should assign participants to the values of variable A in a way that variable A and age are unrelated. That is, the mean age of the participants in the treatment-yes group should be very similar to that of the participants in the treatment-no group.

But, what happens when we do not know which other variables affect variable B or when we cannot measure variables that affect B. This is the case shown in the figure below.

In fact, it is always the case that there are unknown variables that affect variable B. Fortunately, there is a way of controlling for this. We can use the technique called **random assignment of participants to values (a.k.a levels) of the independent variable**.

What is the purpose of randomly assigning participants to levels of the independent variable?

Controlling for the effect of unknown variables over the dependent variable.

Controlling for the effect of the independent variable over the dependent variable.

Controlling for the effect of the dependent variable.

We call variable A **independent variable** because it is independent from measurement (i.e., the values it takes do not depend on measurement), and we call variable B **dependent variable** because the values it takes depend on measurement.

So, the technique involves assigning participants to the values of the independent variable using a random procedure.

The goal of this procedure is that the distribution of values of the unknown variables in the group of participants in the treatment-yes group and in the group of participants in the treatment-no group is similar. The larger the sample, the more likely these distributions will be similar.

Notice that in the figure above I added distribution graphs to the probabilistic causal model. This is way of presenting the research design of your experiment. The distribution for variable A shows that variable A has two values, and the number of participants in each value is the same. The distribution of variable B is unknown because we will only know about it when we collect data. Likewise, the distribution of the unknown variables is also unknown, but differently, it will remain unknown after collecting data too.

## Strategy 6: Choose appropriate statistical models

Even if we manage to obtain a sample that is a perfect representation of the population of interest- which means that our independent variable is unrelated to all the unknown variables- we will only know the values of the dependent variable after the study is conducted. So, it may be the case that the distribution of values in the dependent variable is different than that of the population.

Fortunately, we have a tool, or a set of tools I should say, that allows us to establish how different the sample could be from the population. We refer to this set of tools as** statistics**, which uses statistical models and probability theory.

Statistical models are typically mathematical formulas with no graphical representations. In this book we love graphs, so we will use a graphical representation of statistical models. Now, before you read what comes next, do not panic, I will explain everything again step-by-step in chapter 5. So, do not worry at all if you do not understand the following explanation.

The graphical representation for statistical models that I will use in this book is illustrated above. It has some similarities with the probabilistic causal models, but they also have **latent variables**, the variables with a light shade. Please, do not panic when you see Greek letters. They do not have any mysterious meaning. They are place-holders for values that at the moment we do not known. Using statistical tools we are going to estimate the most appropriate values for those variables.

This graphical representation has another component: the **plate**. In this case there is one plate, which is the rounded rectangle surrounding variable A, variable B and variable μ. This plate is called the i plate, and you can see that all the variables inside the plate have an i subscript. In this context i refers to the ID of each participant. The plate tells us that in this model we have N variables A, N variables B, and N variables μ. If N = 100, that is, there are 100 participants, then there are 100 variables A, 100 variables B and 100 variables μ. On the other hand, this model contains only one instance of the variables β_{0}, β_{1}, and σ, because these variables are outside the plate.

The variable μ is special because it has a double border. This indicates that the actual value that this variable takes is fully determined by other variables. In this case the value that each variable μ takes is determined by the following formula:

μ is the prediction of the model for the value each variable B will adopt. But this prediction is not perfect. In order to represent the probabilistic nature of this model we indicate the probability for each possible value that each variable B can take with a **probability distribution**. In this case that probability distribution is a normal distribution with location parameter μ_{i} (the location parameter is typically known as the mean of the distribution) and scale parameter σ (the scale parameter is typically known as the standard deviation of the distribution). Note that in this model μ is different for each variable B (because μ is inside the i plate), but σ is the same for each variable B (because σ is outside the i plate).

Which of the following statements is true about statistical models?

They make perfect predictions of the values of the dependent variable.

They cannot be represented graphically.

They are the same as causal models.

They contain parameters.

## Strategy 7: Collect data with valid measurement tools

Once you have a probabilistic causal model to test, you came up with an appropriate research design, and you decided which statistical model you will use to analyse the data, it is time to collect the data. When collecting the data we need to use measurement tools that are valid (i.e., they measure what they are supposed to measure) and reliable (i.e., if you use the measurement tool twice in the same unit of analysis the result should be the same or very similar).

After collecting the data, we can complete the graph presented in strategy 5 with the distribution of observed values in variable B. This is what is called **descriptive statistics**. Descriptive statistics should include the distribution of values in the dependent variable, or as shown in the graph a different distribution for each value of variable A. In this case, variable A has two values, so there are two distributions of values of variable B. Typically other statistics are reported: the most popular are the **mean** and **standard deviation** of the whole distribution of variable B, and for the distributions of values of variable B for each value of variable A.

Which of the following is not a statistic (i.e., a numeric summary of a sample)?

Mean.

Standard deviation.

Variable.

Median.

## Strategy 8: Make inferences from sample to population

Based on the **data** we obtained in our **sample**, and the statistical model we have chosen we are able to make inferences about the relationship between variables in the **population**.

In this book we are going to see three types of inferential approaches:

**Parameter estimation**[introduced in chapter 5]**Model comparison**[introduced in chapter 6]**Hypothesis testing**[introduced in chapter 7]

In parameter estimation we make an inference on characteristics of the population of interest, including distributions of the variables and the effect of one variable over another. Given that our estimation cannot possible be perfect, we provide a range of values within which we are fairly confident the actual value of the population is located.

In model comparison, instead of starting with one causal model, we start with two or more causal models. So, the goal of inferential statistics in this approach is to determine which of the models best explains the data.

In hypothesis testing we start with a null hypothesis (e.g., there is no causal effect of variable A over variable B), and we determine whether this null hypothesis holds after observing the data.

Which of the following approaches does not make inferences from samples to populations.

Descriptive statistics.

Parameter estimation.

Model comparison.

Hypothesis testing.

## Strategy 9: Update theoretical models

The end result of statistical inference is the acquisition of new knowledge about the world. This new knowledge comes on the form of updating the values of parameters of causal models. Or, more radically, changing a model completely.

If the data contradicts a theoretical model. What is the correct thing to do?

Ignore the data.

Do not make any changes in the theoretical model.

Make the necessary changes in the model based on the new data.

Change the data to be in line with the predictions of the theoretical model.

## Strategy 10: Use the new knowledge and start all over again

Science never has an end point. It is a continuous process to gradually improve our understanding of the world. We create theoretical models of how the world work, we design experiments to collect data and test those models, and by statistical inference we update our theoretical models, which in turn lead to new questions that we aim to answer with new research.

Which of the following statements is correct?

Science is static.

Science follows a cycle which includes data collection and theoretical refinement.

Scientific theories do not change.

Scientific theories always predict the data perfectly.

## Structure of this book

The first part of the book -chapters 1 to 8- is conceptual. In the second part of the book I present cases in which the knowledge acquired in the first part must be used to conduct data analyses. At the beginning of this chapter I introduced causal models; in **chapter 2** I explain **causal models** in greater detail. I mentioned that causal models are probabilistic, that we are going to use probabilistic distributions, and that the statistical inference is based on probability theory. Thus, probability is a very important aspect for research methods. In **chapter 3** I explain the concept of **probability **using the **drone delivery metaphor**, and in **chapter 4** I introduce **probability distributions**. The following three chapters are dedicated to explaining the three types of inferential statistics we are going to apply to different research situations. **Chapter 5** is dedicated to **parameter estimation**, **chapter 6 **deals with **model comparison** and **chapter 7** shows** hypothesis testing**. The three inferential statistics can be done with three different approaches: **traditional or frequentist approach**, **Bayesian approach**, and **resampling approach**. The three approaches are introduced in chapter 5, but they also appear in chapters 6 and 7 because they also applied to the corresponding statistical inference types. The second part of the book starts with **chapter 8. ** In this second part we will apply three parameter estimation approaches (frequentist or traditional, Bayesian and resampling), two model comparison approaches (maximum likelihood and Bayesian) and two hypothesis testing approaches (null hypothesis statistical significance testing [NHST] and Bayesian). Chapter 8 illustrates analyses with one variable, including analyses with a nominal variable and analyses with a numerical variable. **Chapter 9** deals with studies in which the interest is to investigate the effect of a nominal variable on a numerical variable, and it includes between-subjects designs and within-subjects designs. **Chapter 10** presents studies in which the interest is to investigate the role of two nominal variables on a numerical variable in between-subjects designs and within-subjects designs. **Chapter 11** explains studies in which the independent variable(s) is nominal or numerical, and the dependent variable is of the same type. In **chapter 12** we are going to see hierarchical models. Finally, **chapter 13** integrated the knowledge acquired throughout the voyage in order to lead us to a safe **landing**.

## Pedagogical elements of this book

Throughout the book I will use 5 pedagogical elements:

- Datasets.
- Step-by-step approach.
- Graphical approach.
- Questions.
- YouTube Videos.

### Datasets

In **this link** you can download the datasets that will be used throughout the book.

### Step-by-step approach

Given that the focus of this book is understanding, I spend a lot of time on explaining the rationale of things, instead of just showing you how to do something. This is because, when you understand the rationale of the different analytic techniques you will be able to decide which technique is more appropriate in each situation.

Because of that, I explain things slowly, adding a bit of information step by step, so you do not get lost in the process.

### Graphical approach

I love graphs because they help you understand things. So expect to find lots and lots of graphs. This is related to the focus on understanding. I use graphs to explain complex procedures.

### Questions

Each chapter contains 10 questions that will help you determine whether you are understanding the material or you need to read it again.

### YouTube videos

Each chapter contains one or two Youtube videos, some created by me, some created by others. The videos explain concepts or show you how to use statistical software to conduct analyses.

So, we are done with the introduction! I hope you enjoy the rest of the book!!

Scientists use ten strategies to conduct research. The goal of these strategies is to acquire knowledge about how things work in the world. Scientists organise the myriad of things that happen in the world into variables and they propose models of how variables are causally linked. They, then test their models by obtaining representative samples, and measure individuals with valid tools in order to make inferences about populations based on statistical models. Scientists revise their models and continue a cycle of model refining and data collection to know a bit more about the world in each cycle.

- Variable
- Value
- Population
- Representative sample
- Probabilistic causal model
- Statistical model
- Research design
- Inference
- Research cycle

[7] supermarket: Image courtesy of lyzadanger under CC BY-SA 2.0

[8] seasons: Image courtesy of Wikimedia Commons (collage idea and original combination by Predavatel) under CC-BY-SA-3.0.