How to Be Critical of Research
From day one, a key foundation of my content has been respecting your intelligence. Whether it’s writing a balanced article on Gaming Disorder or critiquing my own research, I always aim to equip people with the information to develop their own informed opinions. In a world where you may not know which content to trust, it is important to be critical and strive to be informed.
In academia, research papers have traditionally been locked behind paywalls unless you are a member of an academic institute such as a university. However, this tradition is dwindling as more researchers are publishing papers through open-access (Grant, 2017). As the name suggests, open-access research papers can be read by everyone for free. Open-access papers have been shown to be read more than paywalled papers (Jump, 2014), meaning researchers’ hard work gets more recognition and people get to enjoy research for free.
While you may have an increasing number of research papers at your disposal, the skills and knowledge to be critical of these studies are usually locked behind years of studying and thousands in student debt. As I want to help you be the most informed and knowledgeable person you can be, this article will be dedicated to teaching you some ways of being critical of research without accruing thousands in student debt. In keeping with the theme of the website, I will be using examples from video game research to demonstrate my points.
As usual, there will be a summary below if you do not wish to read everything. Thank you and please enjoy!
Contents
- Who Was Studied?
- How the Study Was Conducted
- Statistics and Data Analysis
- Peer Review
- Two Contradictory Studies?
- Summary
- References
Who Was Studied?
When looking at who was studied in a piece of research, the three key questions are ‘how?’, ‘who?’ and ‘how many?’.
How?
The ‘how?’ of a sample refers to how people were selected to participate in a study. When we are researching something as complex as human behaviour, we ideally want a sample that reflects the variety of human beliefs, behaviours and circumstances. Factors like this are why we typically want our sample to be gathered randomly.
Let’s say someone is interested in researching the effects of violent video games in teenagers from Fakeville. A good way to pick a random sample would be to make a spreadsheet of all high schools in Fakeville and pick a number of high schools randomly. The theory behind this is that picking these high schools randomly will give us a variety of students from different backgrounds and socioeconomic statuses. This random selection procedure was used in the Health Behaviours in School-Aged Children (HBSC) study (Currie et al., 2000), the data that I used for my violence and video games data analysis.
So what other ways are there of picking study participants? One way of recruiting participants is volunteer sampling, simply letting participants sign themselves up. One problem with this is that it leads to the ‘Teacher’s Pet’ effect. In volunteer sampling, you may only be recruiting super helpful people who want to please you and help you. If you are researching video games and youth violence, you might not get rich data on violence and aggression from kind, helpful sweethearts.
Conversely, the researcher themselves may be in charge of the study’s sample through opportunity sampling. Opportunity sampling involves picking who is available to study at that moment in time. An example of an opportunity sample would be a researcher entering a classroom and picking ten students. While this looks like a random sample, it is not random as the researcher can be biased in their selection. For example, the researcher might select students who are making eye contact with them or implicitly select students who look aggressive. Non-random participant samples can result in something called lack of generalisability, something I will discuss in more detail below.
Who?
The ‘who?’ of a sample refers to who was recruited to participate in a study. This is important as we want to make sure our participants reflect the people that we want to apply the study’s findings to. It isn’t appropriate for us to sample 20-25 year olds and use it to talk about youth violence. If we sample a group of people that isn’t representative of our target population, this results in something known as a lack of generalisability. Let me illustrate this with two examples from video game research.
In Yang et al.’s research (2014), the effects of playing as a black video game character were assessed in relation to racist beliefs and aggression. However, this study cannot be generalised to ‘gamers’ as a target population as only white participants were included in the study. Gamers represent a wide variety of ethnicities, so the findings of this study cannot be applied to populations outside of white gamers. The validity of these findings will also be discussed in the data analysis section.
A second example of generalisability may be one you are already familiar with. You may be familiar with the BullyHunters fiasco which made the claim that ‘around three million women have stopped playing games altogether because of harassment’. This statement came from ‘projections’ based on all women who own a console in America. However, this claim came from a study that had PC gamers as its largest sample base (Matthew, 2012), meaning that its generalisability to console users is limited. The ‘three million women’ statement will be discussed in more detail in the data analysis section.
This study was also conducted using volunteer sampling rather than random sampling. Alongside the ‘Teacher’s Pet’ effect, volunteer sampling results in people who are more likely to sign themselves up for a study if they have strong beliefs about the subject matter. While these experiences are valid and deserve to be heard, it may result in reported incidence rates that are higher than the target population, limiting their generalisability.
How Many?
A common criticism I have seen of research in non-academic spaces is that the sample size is ‘too small’. It seems that people have enough understanding of research to know that a small sample size isn’t ideal, but not why it isn’t ideal.
Entire papers have been written detailing the negatives of small sample sizes (Faber & Fonseca, 2014; Hackshaw, 2008). To keep things brief, I will discuss two reasons.
Let’s say you’re interested in the best-selling games of 2018. You survey ten people and write an article titled ‘The Best-Selling Games of 2018’. I’m sure you would have many criticisms seeing this, such as how ten people are unlikely to capture the variety of platforms or genres in gaming. This is a fair analogy for the problem with small sample sizes as a whole. While we are interested in understanding human thoughts, beliefs and behaviours, a small sample size reduces our likelihood of understanding the wide variety of these phenomena.
However, there is something much more sinister about a small sample size.
While I will be discussing statistics in more detail down below, I would like to introduce something called ‘statistical power’. Statistical power refers to the likelihood of making what is known as a Type II error. In research, a Type II error is saying that you found a relationship between x and y, when actually that relationship was found by accident. As you can imagine, Type II errors are very bad.
Something that increases the chances of making a Type II error is having a small sample size. This means that if you spot a research paper with a small number of participants, the relationship that they claim to have found could actually have been found by accident. When it comes to statistical power, there are online tools that help you calculate the number of participants required to avoid making a Type II error. If you see a study with a small sample size that references making this calculation, it is fair to trust the statistics that come afterwards.
How the Study Was Conducted
There are two types of studies that people may be most familiar with. The first type involves trying to directly measure what we are interested in within a controlled environment (known as ‘artificial’ or ‘laboratory’ experiments). The second type involves collecting data on beliefs, attitudes, behaviours or events that the researcher isn’t directly trying to manipulate (known as ‘survey’ or ‘questionnaire’ experiments). I will discuss the pros and cons of these experiment types using examples from video game research.
Artificial Experiments
To talk about artificial experiments, I will use Hollingdale and Greitemeyer’s (2014) study on video games and aggression. In this study, the researchers wanted to directly measure video games and aggression by trying to heighten aggression using Call of Duty. To measure aggression, participants were asked to measure out a portion of hot sauce for someone who dislikes spicy food.
Artificial experiments are praised for coming as close to a cause-effect relationship as social science can permit. In this artificial setting, there is just you and the game. If you are in a controlled environment and you become more ‘aggressive’ after playing a video game, this is evidence for the cause-effect relationship between video games and aggression.
When it comes to measuring things like violence and aggression in an artificial experiment, I cannot stress enough how much legal and ethical red tape is involved. Ethics committees are involved throughout research in order to keep participants safe, healthy and happy. Researchers cannot under any circumstances harm participants or encourage them to harm other participants. This means that some research topics simply lack suitability for artificial experiments due to something known as ecological validity.
Ecological validity refers to whether the methods used to gather data for a behaviour accurately reflects the behaviour in real life. If you looked at the hot sauce methodology and thought ‘Wait a second, that’s not very representative of violence or hurting someone in real life’, you’d be right. Other ways of measuring aggression in artificial video game experiments includes length of time spent playing a clip of static noise to a computer (Hasan et al., 2013). Because of the ethical limitations of research, researchers must find convoluted methods of measuring and replicating aggressive behaviour. However, these methods end up being low in ecological validity – they do not accurately reflect aggression or violence in real life.
While artificial experiments seem impressive because a relationship between x and y can be found in a controlled environment, I always encourage people to examine the methods section to see if data was gathered in an ecologically valid manner.
Survey Studies
If you are interested in a risqué topic such as violence, you can bypass the ethical and legal red tape of needing to cause harm by collecting data on harm that has already happened. To do this in a manner that gives you numerical data, you can provide people with questionnaires to fill out.
The downside of survey studies is that you lose the cause-effect dynamic of artificial experiments. However, if you are indeed dealing with something such as violence, you end up with data that is much more ecologically valid. Instead of trying to justify why a measure of hot sauce is representative of getting into a fist fight, you can just ask people about their history of fist fights.
When dealing with questionnaire data, you can run a very simple statistical analysis to check whether your questionnaire reliably measures what you want to measure. This test is known as Cronbach’s alpha (α). The higher the α value, the more reliable our questionnaire is. Scores lower than 0.5 are considered to be poor, while scores of 0.8 and above are ideal (DeVellis, 2012).
Let’s say I developed a questionnaire looking to measure video games and aggression, featuring questions such as ‘I have thrown a controller while playing a game’. If I included questions that don’t accurately measure video games and aggression (as an extreme example, ‘video games make me sneeze’), the analysis will flag this question up as something that isn’t relevant. This can help us develop optimised questionnaires that measure what we want to measure. As this analysis is so simple and takes such little space to report, you have the right to be sceptical of research papers that do not report the α value of questionnaires.
When participants fill out questionnaires for a study, the data gathered from this is known as self-report data. If data is gathered anonymously, it is argued that this self-report data is truthful as people are not obligated to lie under the cloak of anonymity, nor will they face any consequences for being truthful. However, it is possible for self-report data to unintentionally be inaccurate.
An example of this can be seen in Carras and Kardefelt-Winther’s research (2018). When exploring the proposed criteria for Internet Gaming Disorder, they found that some people self-reported being more addicted to video games than they actually were. While self-report data is great for avoiding harm and being ecologically valid, unfortunately it’s not always 100% accurate.
Statistics and Data Analysis
Data analysis is easily the hardest part of trying to convey research findings to general audiences. Not only is it challenging to break down for people who aren’t researchers, but it can even be difficult to explain to budding researchers. During my time lecturing in statistics, I’ve seen my fair share of temper tantrums and have even witnessed people slam down their notes in frustration. I can’t turn you into statisticians overnight, but I can give you two helpful pieces of advice on how to use and understand statistics.
Extrapolation
For the first lesson, let’s return to the statement that ‘three million women have stopped playing games altogether because of harassment’. Let’s break down how that statement emerged. Ignoring the fact that the statement changed the findings of the original study (‘stopped playing games altogether’ vs ‘stopped playing a game’), this was found in roughly 10% (84/874) of participants. This finding was then applied to 10% of all console-owning women in America, resulting in a figure of three million women.
The practice of applying prevalence rates to the target or general population is known as extrapolation. Let’s say I’m interested in testing reaction times between people who do and do not play video games. It is impossible for me to gather data from every person in existence who does and does not play games. Instead, I (ideally) take a random sample from both groups, conduct tests and analyse their data. Having taken random samples from both target populations, I can extrapolate and make inferences about the rate of behaviour in the target populations.
The key word here is ‘inferences’. If I found that 70% of gamers had better reaction times than non-gamers, it is fair for me to infer that video games have the potential to improve reaction times in up to 70% of those who play them. What is not fair for me to do is state something like ‘700 million gamers have better reaction times than non-gamers’. This is unscientific and an improper use of statistics. Because of examples such as this, I always encourage people to examine participant numbers for themselves in case of any faulty extrapolation.
The p Value
The second lesson involves one of the key foundations of data analysis. When determining whether there is a relationship between variables (such as number of friends and time spent playing games), you may see the phrase ‘significant relationship’, ‘significant difference’ or some sort of phrase mentioning the word ‘significant’. So what makes things ‘significant’?
For significance testing, we turn to the p value. There’s no real secret about what the p value is – ‘p’ simply stands for probability. When we want to test the strength of a relationship, the p value tells us the probability that there is no correlation between these factors and anything we found was simply an accident.
When it comes to the probability of something being an accident, scholars have debated for decades about what this threshold should be. The widely accepted consensus is that there can’t be any more than a 5% margin of error for our statistics. We must be at least 95% sure that the relationship between our variables is so strong that there is no way that it happened by accident. In research papers, this will look like ‘p < .05’. While there are some rare occasions where we want our p greater than .05 (such as the Chi-Square Model Fit test for Structural Equation Models), a lot of the time we want to see p < .05 and don’t want to see p > .05.
Let’s revisit some video game research, particularly Yang et al.’s research on video games and racism. In this study, it was stated that a significant relationship was found between playing as a black avatar and having negative attitudes towards black people. When you read the results section, you will see that the p value is 0.054. This means that the finding exceeds the widely accepted consensus that the p value needs to be below 0.05 in order to be significant. They have said that there is a relationship between playing as a black character and racist beliefs, but many other academics, myself included, would reject this.
If this finding violates an established rule of data analysis, why did it get published? To answer this, I would like to shed some light on something you may have heard of, but may not fully understand – peer review.
Peer Review
When talking about the gold standard of academic research, you may have come across the phrase ‘peer-reviewed academic journal’. People may recognise that there is value behind a peer-reviewed piece of research, but what exactly is peer review?
Peer review involves submitting your research article to an academic journal to be reviewed by one or more academics. The academic/s that are chosen to review the paper will be as close to experts in the particular research field as possible (such as video games and mental health). If the article survives the scrutiny of these experts, it can then be published and read around the world. If experts in the field feel that you’ve done a good job and your research is an impressive contribution to the field, that is high praise indeed.
The belief behind peer review is that academic guardians permit only the highest quality research to enter the field after it has been subjected to rigorous scrutiny. While this is an admirable belief to strive for, academics recognise that peer review is a necessary but ultimately flawed system. Allow me to discuss some of these flaws.
Academics are Tired Mortals
Let’s return to the video game study that claimed statistical significance at p = .054. While many academics would disagree that this relationship is significant, the peer reviewer/s at the time deemed this finding to be acceptable. This point leads to the harshest reality of peer review: peer reviewers are subjective and imperfect mortals just like the rest of us. If a peer reviewer feels that the p < .05 standard is arbitrary gatekeeping nonsense, they are free to exercise this belief when reviewing research papers.
To add to the mortality of peer reviewers, peer reviewers are academics. The same academics that have lectures to deliver, essays and coursework to mark, PhD students to supervise, Masters students to supervise, undergraduate students to supervise, conferences to attend, the list goes on. On top of these responsibilities, peer reviewers may receive a bundle of research papers to review for which they will receive no financial compensation for reviewing.
Two years ago, a research paper of mine was rejected in peer review because I didn’t include something. Except it was very, very clearly there. The peer reviewer seemed to be so tired that they either misread it or rushed my peer review, leading to my paper being rejected on false grounds. I later submitted it to another journal where I received a personal email from the Editor-in-Chief thanking me for such a high quality piece of work (/humblebrag).
I have no doubt that there are many wonderful peer reviewers out there. However, it is important to remember that a piece of research isn’t 100% perfect or error-free just because it was peer-reviewed; the process itself isn’t 100% perfect or error-free.
“Please Submit to Our Journal!”
When you submit a research paper for publication, one author must be nominated as the Lead Author. As Lead Author, you are in charge of correspondence for the paper and your email address must be attached to the publication for further contact.
As a result of your email address being available online, it is rare for a week to pass without receiving some sort of email asking you to submit a paper to a journal. This can result in quite a paradoxical experience. While some journals work hard to maintain a high degree of integrity and publish only the highest quality research, other journals are practically pleading with you to submit a paper.
This example demonstrates another source of critique for peer review – how reputable is the journal that the paper is published in? This can be checked using something known as the ‘impact factor’. The impact factor refers to how many citations articles in the journal receive in relation to how many articles are published in the journal. For example, popular journals such as The Lancet and Nature have impact factors of 53.254 and 41.577 respectively.
Researchers will try to publish their research in high impact factor journals to maximise their visibility and citations. Due to the number of researchers competing for these journals, only the highest quality research is selected for publication. Because of this high quality control, readers know that the journal is of high repute and will be more likely to cite research published in it. As you can see, it’s quite a cyclical relationship between quality and impact factor.
The impact factor is always something you can check when you see a published paper that you feel to be of low quality. It is possible that the journal needs papers to fill its issue and/or has low quality control. If the paper is published in a low impact factor journal, it is unlikely to receive much attention or to be cited as a source in future research.
Two Contradictory Studies?
Let’s say you’re having a disagreement with someone and you use a piece of research to demonstrate your point. A few minutes later, the person returns the favour and links you to a research paper, this time with contradictory findings. Whose paper ‘wins’?
I wanted to include this section as I’ve previously written about the power of cognitive dissonance. Our brains are inclined to value any piece of research that supports our world view and dismiss any research that violates it. In this scenario, it is important for you to be critical of both studies with some guidance provided by this article.
Imagine you critiqued both papers and found that the research that you linked was superior: it had a larger sample size, reliable and valid methodology, and sound data analysis. Congratulations, you have engaged in a key part of the scientific method. In research, we call this ‘falsifiability’.
The idea behind falsifiability is that as researchers are critical beings, we shouldn’t be going out into the world looking to support our beliefs. Instead, we should be finding every opportunity to discredit them. If your belief remains strong and intact no matter how much you try to discredit it, that’s a sign that it’s a really, really solid belief.
However, let’s imagine that you critique both papers and find that they are equally good and valid pieces of research. What do you do then? Who ‘wins’?
When talking about who ‘wins’, it’s time to address the elephant in the room – whether psychology is a science. While there’s still debate in the field, the general consensus is that psychology can be considered a ‘pre-science’ (Kuhn, 1970). While psychological research abides by scientific principles such as falsifiability, there is an important component that separates it from a ‘normal science’.
When a contradictory research finding emerges, there isn’t a massive uprising in the field and there isn’t a rush to recall textbooks. Instead, the research is allowed to co-exist in the theory alongside the other research. Psychologists don’t rush to homogenise a field because human behaviour itself isn’t homogeneous. If we want to research the complexities of humanity, we have to acknowledge its complexity. It is entirely possible for research findings to differ by gender, age, location and other factors. It’s also possible for findings to differ because of how the study was conducted, but that’s where your newly-acquired critiquing skills come in handy!
Summary
- When it comes to selecting participants, participants who are selected randomly have the highest chances of representing a wide range of human thoughts, beliefs, behaviours, life circumstances and so on. Other participant selection methods may have their own biases. For example, volunteer sampling may result in sampling a high proportion of kind and helpful people, or people who have strong beliefs they would like to voice. Alternatively, opportunity sampling can lead to researchers picking a biased sample (e.g. picking people who look aggressive).
- It is always a good idea to check whether the sampled population fits the description of who researchers are applying results to. For example, using a college aged sample to talk about video games and youth violence may not be generalisable.
- Studies with a small sample size are at risk of a Type II error, meaning that researchers have found a relationship between x and y that doesn’t actually exist at all.
- Artificial/laboratory experiments are good for trying to establish a cause-effect relationship. However, they can result in something known as a lack of ecological validity, meaning that how data is gathered isn’t representative of the behaviour in real life. In video games and aggression research, aggression has been measured through pouring out hot sauce and playing a loud noise to a computer.
- Surveys and questionnaire studies allow researchers to gather data that is more ecologically valid while sacrificing the cause-effect dynamic. It is also important to note that self-report data can sometimes be unreliable.
- In most statistical tests, we want our p value to be below 0.05 as it indicates a very low chance that our relationship was found by accident. While results sections of research papers may be challenging to read, I recommend at least verifying this for yourself.
- Peer review may be considered the gold standard for academic research, but it is not without its flaws. Peer reviewers are tired, biased human just like the rest of us. This can result in things getting missed or things getting pushed through due to individual biases. Some peer review journals may also have a low barrier to entry and will spam researchers with emails in order to receive papers. To identify journals such as this, consider looking at the ‘impact factor’ of a journal; the larger the impact factor, the better the reputation.
- If you come across two research papers that contradict one another, you can evaluate their merit based on some suggestions from this article. If they seem to have equal merit, it is totally fine for these results to co-exist. After all, people are complex and varied, it’s only natural for human research to represent the variety of humanity.
Thank you all very much for reading! This hard work would not be possible without the support of my wonderful Patrons. I would particularly like to thank my Platinum Patrons: Matt Demers, Albert S Calderon, Kyle T, Andrew Shirvis, redKheld, DigitalPsyche, Brent Halen, Colton Ballou, Dimelo ‘Derp’ Waterson, Hagbard Celine, Senpai, Aprou, Austin Enright, Dr Shane Tilton, SK120, Teodoro Elizondo and Stephen Gac. Thank you!
References
Carras, M. C., & Kardefelt-Winther, D. (2018). When addiction symptoms and life problems diverge: A latent class analysis of problematic gaming in a representative multinational sample of European adolescents. European Child & Adolescent Psychiatry, 27(4), 513-525.
Currie, C., Hurrelmann, K., Settertobulte, W., Smith, R., & Todd, J. (2000). Health and health behaviour among young people (Health Policy for children and Adolescents, No.1). Copenhagen, Denmark: WHO Regional Office for Europe.
DeVellis, R.F. (2012). Scale development: Theory and applications. Los Angeles: Sage.
Faber, J., & Fonseca, L. M. (2014). How sample size influences research outcomes. Dental Press Journal of Orthodontics, 19(4), 27-9.
Grant, B. (2017). Open Access On The Rise: Study. Retrieved 10th December, 2018 from https://www.the-scientist.com/daily-news/open-access-on-the-rise-study-31125.
Hackshaw, A. (2008). Small studies: strengths and limitations. European Respiratory Journal, 32, 1141-1143.
Hasan, Y., Bègue, L., Scharkow, M., & Bushman, B. J. (2013). The more you play, the more aggressive you become: A long-term experimental study of cumulative violent video game effects on hostile expectations and aggressive behavior. Journal of Experimental Social Psychology, 49(2), 224-227.
Hollingdale, J. & Greitemeyer, T. (2014) The Effect of Online Violent Video Games on Levels of Aggression. PLOS ONE, 9(11): e111790.
Jump, P. (2014). Open access papers ‘gain more traffic and citations’. Retrieved 10th December, 2018 from https://www.timeshighereducation.com/home/open-access-papers-gain-more-traffic-and-citations/2014850.article#survey-answer.
Kuhn, T. (1970). The Structure of Scientific Revolutions (2nd edition), Chicago: University of Chicago Press.
Matthew, E. (2012). Sexism in Video Games [Study]: There is Sexism in Gaming. Retrieved 8th December, 2018 from https://blog.pricecharting.com/2012/09/emilyami-sexism-in-video-games-study.html.
Yang, G. S., Gibson, B., Lueke, A. K., Huesmann, R., & Bushman, B. J. (2014). Effects of Avatar Race in Violent Video Games on Racial Attitudes and Aggression. Social Psychological and Personality Science, 5(6), 698-704.