New paper: What happens to an individual visual working memory representation when it is interrupted?

Bae, G.-Y., & Luck, S. J. (2018). What happens to an individual visual working memory representation when it is interrupted? British Journal of Psychology. https://onlinelibrary.wiley.com/doi/full/10.1111/bjop.12339

Working memory is often conceived as a buffer that holds information currently being operated upon. However, many studies have shown that it is possible to perform fairly complex tasks (e.g., visual search) that are interposed during the retention interval of a change detection task with minimal interference (especially load-dependent interference). One possible explanation is that the information from the change detection task can be held in some other form (e.g., activity-silent memory) while the interposed task is being performed.  If so, this might be expected to have subtle effects on the memory for the stimulus.

To test this, we had subjects perform a delayed estimation task, in which a single teardrop-shaped stimulus was held in memory and was reproduced at the end of the trial (see figure below). A single letter stimulus was presented during the delay period on some trials. We asked whether performing a very simple task with this interposed stimulus would cause a subtle disruption in the memory for the teardrop's orientation.  In some trial blocks, subjects simply ignored the interposed letter, and we found that it produced no disruption of the memory for the teardrop. In other trial blocks, subjects had to make a speeded response to the interposed letter, indicating whether it was a C or a D. Although this was a simple task, and only a single object was being maintained in working memory, the interposed stimulus caused the memory of the teardrop to become less precise and more categorical.

Thus, performing even a simple task on an interposed stimulus can disrupt a previously encoding working memory representation. The representation is not destroyed, but becomes less precise and more categorical, perhaps indicating that it had been offloaded into a different form of storage while the interposed task was being performed. Interestingly, we did not find this effect when an auditory interposed task was used, consistent with modality-specific representations.

Interruption_Paradigm.jpg

How to p-hack (and avoid p-hacking) in ERP research

Luck, S. J., & Gaspelin, N. (2017). How to Get Statistically Significant Effects in Any ERP Experiment (and Why You Shouldn’t)Psychophysiology, 54, 146-157.

How to get a significant effect.jpg

In this article, we show how ridiculously easy it is to find significant effects in ERP experiments by using the observed data to guide the selection of time windows and electrode sites. We also show that including multiple factors in your ANOVAs can dramatically increase the rate of false positives (Type I errors). We provide some suggestions for methods to avoid inflating the Type I error rate.

This paper was part of a special issue of Psychophysiology on Reproducibility edited by Emily Kappenman and Andreas Keil.

Some thoughts about the hypercompetitive academic job market

Many young academics are (justifiably) stressed out about their career prospects, ranging from the question of whether they will be able to get a tenure-track position to whether they will be able to publish in top-tier journals, get grants, get tenure, and do all of this without going insane.  Life in academia has been challenging for a long time, but the level of competition seems to be getting out of control. The goal of this piece is to discuss some ideas from population biology that might help explain the current state of hypercompetition and perhaps shed light on what kinds of changes might be helpful (or unhelpful).

Here’s the problem in a nutshell: if we want to provide a tenure-track faculty position for every new PhD who wants one, the number of available positions would need to increase exponentially with no limit.  This is shown in the graph below.  

Growth1.jpg

If we assume that a typical faculty member has a couple grad students at any given time, and most of them want jobs in academia, this faculty member will have a student who graduates and wants a faculty position approximately every three years.  As a result, we would need to create a new faculty position approximately every three years just to keep up with the students from a single current faculty member.  As if this wasn’t bad enough, these recent PhDs will then get their own grad students, who will also need faculty positions. This leads to an exponential growth in the number of positions needed to fill the demand.  

For example, if we have 1000 positions in a given field in the year 2018, we will need another 1000 positions in that field by the year 2021 to accommodate the new students who have received their PhDs by that time, leading to a total of 2000 positions to accommodate the demand that year.  The faculty in these 2000 positions will have students who will need another 2000 positions by the year 2024, leading to a total need for 4000 positions that year.  

If the number of positions kept increasing over time to fill the demand, we would need over a million positions by the year 2048!  This doesn’t account for retirements, etc., but those factors have a very small effect (unless we start forcing faculty to retire when they reach the age of 40 or some such thing).  There are various other assumptions here (e.g., a new PhD every 3 years), but virtually any realistic set of parameters will lead to an exponential or nearly-exponential growth function.

This is just like the exponential increase you might see in the size of a population of organisms, with a rate factor (r) that describes the rate of reproduction.  However, an exponential increase can happen only if reproduction is not capped by resource limitations.  Resource limitations lead to a maximum population size, which population biologists call K (for the “carrying capacity” of the environment).  When the exponential growth with rate r is combined with carrying capacity K, you get a logistic function.  This is shown in the picture below (from Khan Academy), which illustrates the growth rate of a population of organisms with no limit on the population size (the exponential function on the left) and with a limit at K (the logistic function on the right).

Population Growth.png

At early time points, the two functions are very similar: K doesn’t have much impact on the rate of growth in the logistic function early in time, and growth is mainly limited by r (the replication rate).  This is called “r-limited” growth.  However, later in time, the resource limitations start impacting the rate of growth in the logistic function, and the population size asymptotes at K.  This is called “K-limited” growth.  It’s much nicer to live in a period of r-limited growth, when there are plenty of resources.  When growth is K-limited, this means that the organisms in the population have so few resources that they die before they can reproduce, or are so hungry they can’t reproduce, or their offspring are so hungry they can’t survive, etc.  Not a very pleasant life.

In academia, r-limited growth means that jobs are plentiful, and the main limitations on growth are the number of students per lab and the rate at which they complete their degrees.  By contrast,  K-limited growth basically means that a faculty member needs to die or retire before a new PhD can get a position, and only a small fraction of new PhDs will ever get tenure-track jobs and start producing their own students.  This also means that the competition for tenure-track jobs and research grants will be fierce.  Sound familiar?

In the context of academia, K represents the maximum number of faculty positions that can be supported by the society.  The maximum number of faculty positions might increase gradually over time, as the overall population size increases or as a society becomes wealthier.  However, there is no way we can sustain an exponential growth forever (especially if that means we need over a million positions by 2048 in a field that has only a thousand positions in 2018).  

I think it’s pretty clear that we’re now in a K-limited period, where the number of positions is increasing far too slowly to keep up with the demand for positions from people getting PhDs.  When I was on the job market in the early 1990s, there were already more people with PhDs than available faculty positions.  However, the problem of an oversupply of PhDs was partially masked by an increase in the availability of postdoc positions.  Also, it was becoming more common for faculty at “second-tier” universities to conduct and publish research, so the actual number of positions that combined research and teaching was increasing.  But this balloon has stretched about as far as it can, and highly qualified young scholars are now having trouble getting the kind of position they are seeking (and we’re seeing 200+ applicants for a single position in our department).

In addition to a limited number of tenure-track faculty positions, we have a limited amount of grant money.  In some departments and subfields, getting a major grant is required for getting tenure.  Even if this isn’t a formal requirement, the resources provided by a grant (e.g., funding for grad students and postdocs) may be essential for an assistant professor to be sufficiently productive to receive tenure.  But an increase in grant funding without a commensurate increase in permanent positions can actually make things worse rather than better.  We saw that when the NIH budget was doubled between 1994 and 2003.  This led to an increase in funding for grad students and postdocs (leading to the balloon I mentioned earlier).  However, without an increase in the number of tenure-track faculty positions, there was nowhere for these people to go when they finished their training.  Their CVs were more impressive, but this just increased the expectations of search committees.  Also, a lot of the increased NIH funding was absorbed by senior faculty (like me) who now had 2, 3, or even 4 grants instead of just 1.  As usual, the rich got richer.

One might argue that competition is good, because it means that only the very best people get tenure-track positions and grants.  And I would be the first person to agree that competition can help inspire people to be as creative and productive as possible.  However, the current state of hypercompetition clearly has a dark side.  Some people write tons of grants, often with little thought, in the hopes of getting lucky.  This can lead to poorly-conceived projects, and it can leave people with little time to think about and actually conduct high-quality research.  And it can lead to p-hacking and other questionable research practices, or even outright fraud.  I think we’re way beyond the point at which the level of competition is beneficial.

Now let’s talk about solutions.  Should we increase the number of tenure-track faculty positions at research universities? I would argue that any solution of this nature is doomed to failure in the long run.  Increasing the number of position is an increase in K, and this just postpones the point at which the job market becomes saturated.  It would certainly help the people who are seeking a position now, but the problem will come back eventually. There just isn’t a way for the number of positions to increase exponentially forever.

We could also try to limit the number of students we accept into PhD programs.  This would be equivalent to decreasing r, the rate of “reproduction.”  However, for this to fully solve the problem, we would need the “birth rate” (number of new PhDs per year in a field) to equal the “death rate” (the number of retirements per year in the field).  Here’s another way to look at it: if the number of positions in a field remains constant, a given faculty member can expect to place only a single student in a tenure-track position over the course of the faculty member’s entire career.  Is it realistic to restrict the number of PhD students so that faculty can have only one student per career?  Or even one per decade?  Probably not.

I have only one realistic idea for a solution, which is to create more good positions for PhDs that don’t involve “reproduction” (i.e., training PhD students).  For example, if there were good positions outside of academia for a large number of PhDs, this would reduce the demand for tenure-track positions and decrease r, the rate of reproduction (assuming that there would be fewer people “spawning” new students as a result).  Tenure-track positions at teaching-oriented institutions have the same effect (as long as these institutions don’t decide to start granting PhDs).   I don’t think it’s realistic to increase the number of these teaching-oriented positions (except insofar as they increase with overall changes in population size).  However, in many areas of the mind and brain sciences, it appears that the availability of industry positions could increase substantially.  Indeed, we are already seeing many of our students and postdocs take jobs at places like Google and Netflix.

Many faculty in research-oriented universities think that success in graduate school means getting a tenure-track faculty position in a research-oriented university.  However, if I’m right that the current K-limited growth curve—and the associated hypercompetition—is a major problem, then we should place a much higher value on industry and teaching positions.  The availability of these positions will mean that we can continue to have lots of bright graduate students in our labs without dooming them to work as Uber drivers after they get their PhDs.  And teaching positions are intrinsically valuable: A great teacher can have a tremendous positive impact on thousands of students over the course of a career.

This doesn’t mean that we should focus our students’ training on teaching skills and data science skills, especially when these are not our own areas of expertise.  Excellent research training will be important for both industry positions and teaching-oriented faculty positions.  But we should encourage our students to think about getting some significant training in teaching and/or data science, which will be important even if they take positions in research-oriented universities.  And we should encourage some of our students to take industry internships and get teaching experience.  But mostly we should avoid sending the implicit or explicit message to our students that they are failures if they don’t pursue tenure-track research university positions.  If, as a field, we increase the number of our PhDs who take positions outside of research universities, this will make life better for everyone

VSS Poster: An illusion of opposite-direction motion

At the 2018 VSS meeting, Gi-Yeul Bae will be presenting a poster describing a motion illusion that, as far as we can tell, has never before been reported even though it has been "right under the noses" of many researchers.  As shown in the video below, this illusion arises in the standard "random dot kinematogram" displays that have been used to study motion perception for decades. In the standard task, the motion is either leftward or rightward. However, we allowed the dots to move in any direction in the 360° space, and the task was to report the exact direction at the end of the trial.

In the example video, the coherence level is 25% on some trials and 50% on others (i.e., on average, 25% or 50% of the dots move in one direction, and the other dots move randomly). A line appears at the end of the trial to indicate the direction of motion for that trial.  When you watch a given trial, try to guess the precise direction of motion.  If you are like most people, you will find that you guess a direction that is approximately 180° away from the true direction on a substantial fraction of trials.  You may even see the motion start in one direction and then reverse to the true direction. We recommend that you maximize the video and view it in HD.

In the controlled laboratory experiments described in our poster (which you can download here), we find that 180° errors are much more common than other errors. In addition, our studies suggest that this is a bona fide illusion, in which people confidently perceive a direction of motion that is the opposite of the true direction. If you know of any previous reports of this phenomenon, let us know!

New Paper: Combined Electrophysiological and Behavioral Evidence for the Suppression of Salient Distractors

Gaspelin, N., & Luck, S. J. (in press). Combined Electrophysiological and Behavioral Evidence for the Suppression of Salient Distractors. Journal of Cognitive Neuroscience.

Gaspelin-Pd.jpg

Evidence that people can suppress salient-but-irrelevant color singletons has come from ERP studies and from behavioral studies.  The ERP studies find that, under appropriate conditions, singleton distractors will elicit a Pd component, a putative electrophysiological signature of suppression (discovered by Hickey, Di Lollo, and McDonald, 2009). The behavioral studies show that processing at the location of the singleton is suppressed below the level of nonsingleton distractors (reviewed by Gaspelin & Luck, 2018).  Are these electrophysiological and behavioral signatures of suppression actually related?

In the present study, Nick Gaspelin and I used an experimental paradigm in which it was possible to assess both the ERP and behavioral measures of suppression.  First, we were able to demonstrate that suppression of the salient singleton distractors was present according to both measures.  Second, we found that these two measures were correlated: participants who should a larger Pd also showed greater behavioral suppression.  

Correlations like these can be difficult to find (and believe).  First, both the ERP and behavioral measures can be noisy, which attenuates the strength of the correlation and reduces power.  Second, spurious correlations are easy to find when there are a lot of possible variables to correlate and relatively small Ns.  A typical ERP session is about 3 hours, so it's difficult to have the kinds of Ns that one might like in a correlational study.  To address these problems, we conducted two experiments.  The first was not well powered to detect a correlation (in part because we had no idea how large the correlation would be, making it difficult to assess the power). We did find a correlation, but we were skeptical because of the small N.  We then used the results of the first experiment to design a second experiment that was optimized and powered to detect the correlation, using an a priori analysis approach developed from the first experiment.  This gave us much more confidence that the correlation was real.

Gaspelin-Pd-Exp3.jpg

We also included a third experiment that was suggested by the alway-thoughtful John McDonald. As you can see from the image above, the Pd component was quite early in Experiments 1 and 2. Some authors have argued that an early contralateral positivity of this nature is not actually the suppression-related Pd component but instead reflects an automatic salience detection process.  To address this possibility, we simply made the salient singleton the target.  If the early positivity reflects an automatic salience detection process, then it should be present whether the singleton is a distractor or a target.  However, if it reflects a task-dependent suppression mechanism, then it should be eliminated when subjects are trying to focus attention onto the singleton. We found that most of this early positivity was eliminated when the singleton was the target. The very earliest part (before 150 ms) was still present when the singleton was the target, but most of the effect was present only when the singleton was a to-be-ignored distractor. In other words, the positivity was not driven by salience per se, but occurred primarily when the task required suppressing the singleton.  This demonstrates very clearly that the suppression-related Pd component can appear as early as 150 ms when elicited by a highly salient (but irrelevant) singleton.

An old-school approach to science: "You've got to get yourself a phenomenon"

Given all the questions that have been raised about the reproducibility of scientific findings and the appropriateness of various statistical approaches, it would be easy to get the idea that science is impossible and we haven't learned a single thing about the mind and brain. But that's simply preposterous.  We've learned an amazing amount over the years.

In a previous blog post (and follow-up), I mentioned my graduate mentor's approach, which emphasized self-replication. In this post, I go back to my intellectual grandfather, Bob Galambos, whose discoveries you learned about as a child even if you didn't learn his name. I hope you find his advice useful. It's impractical in some areas of science, but it's what a lot of cognitive psychologists have done for decades and still do today (even though you can't easily tell from their journal articles).  I previously wrote about this in the second edition of An Introduction to the Event-Related Potential Technique, and the following is an excerpt. I am "recycling" this previous text because the relevance of this story goes way beyond ERP research.


Galambos.jpg

My graduate school mentor was Steve Hillyard, who inherited his lab from his own graduate school mentor, Bob Galambos (shown in the photo).  Dr. G (as we often called him) was still quite active after he retired.  He often came to our weekly lab meetings, and I had the opportunity to work on an experiment with him.  He was an amazing scientist who made really fundamental contributions to neuroscience.  For example, when he was a graduate student, he and fellow graduate student Donald Griffin provided the first convincing evidence that bats use echolocation to navigate.  He was also the first person to recognize that glia are not just passive support cells (and this recognition essentially cost him his job at the time).  You can read the details of his interesting life in his autobiography and in his NY Times obituary.

Bob was always a font of wisdom.  My favorite quote from him is this: “You’ve got to get yourself a phenomenon” (he pronounced phenomenon in a slightly funny way, like “pheeeenahmenahn”).  This short statement basically means that you need to start a program of research with a robust experimental effect that you can reliably measure.  Once you’ve figured out the instrumentation, experimental design, and analytic strategy that allows you to reliably measure the effect, then you can start using it to answer interesting scientific questions.  You can’t really answer any interesting questions about the mind or brain unless you have a “phenomenon” that provides an index of the process of interest.  And unless you can figure out how to record this phenomenon in a robust and reliable manner, you will have a hard time making real progress.  So, you need to find a nice phenomenon (like a new ERP component) and figure out the best ways to see that phenomenon clearly and reliably.  Then you will be ready to do some real science!

 

Why I've lost faith in p values, part 2

In a previous post, I gave some examples showing that null hypothesis statistical testing (NHST) doesn’t actually tell us what we want to know.  In practice, we want to know the probability that we are making a mistake when we conclude that an effect is present (i.e., we want to know the probability of a Type I error in the cases where p < .05).  A genetics paper calls this the False Positive Report Probability (FPRP)

However, when we use NHST, we instead know the probability that we will get a Type I error when the null hypothesis is true. In other words, when the null hypothesis is true, we have a 5% chance of finding p < .05.  But this 5% rate of false positives occurs only when the null hypothesis is actually true.  We don’t usually know that the null hypothesis is true, and if we knew it, we wouldn't bother doing the experiment and we wouldn’t need statistics.

In reality, we want to know the false positive rate (Type I error rate) in a mixture of experiments in which the null is sometimes true and sometimes false.  In other words, we want to know how often the null is true when p < .05.  In one of the examples shown in the previous post, this probability (FPRP) was about 9%, and in another it was 47%.  These examples differed in terms of statistical power (i.e., the probability that a real effect will be significant) and the probability that the alternative hypothesis is true [p(H1)].

The table below (Table 2 from the original post) shows the example with a 47% false positive rate.  In this example, we take a set of 1000 experiments in which the alternative hypothesis is true in only 10% of experiments and the statistical power is 0.5. The box in yellow shows the False Positive Report Probability (FPRP).  This is the probability that, in the set of experiments where we get a significant effect (p < .05), the null hypothesis is actually true.  In this example, we have a 47% FPRP.  In other words, nearly half of our “significant” effects are completely bogus.

Why I lost faith in p values-2.jpeg

The point of this example is not that any individual researcher actually has a 47% false positive rate.  The point is that NHST doesn’t actually guarantee that our false positive rate is 5% (even when we assume there is no p-hacking, etc.).  The actual false positive rate is unknown in real research, and it might be quite high for some types of studies.  As a result, it is difficult to see why we should ever care about p values or use NHST.

In this follow-up post, I’d like to address some comments/questions I’ve gotten over social media and from the grad students and postdocs in my lab.  I hope this clarifies some key aspects of the previous post.  Here I will focus on 4 issues:

  1. What happens with other combinations of statistical power and p(H1)? Can we solve this problem by increasing our statistical power?
  2. Why use examples with 1000 experiments?
  3. What happens when power and p(H1) vary across experiments?
  4. What should we do about this problem?

If you don’t have time to read the whole blog, here are four take-home messages:

  • Even when power is high, the false positive rate is still very high when H1 is unlikely to be true. We can't "power our way" out of this problem.
  • However, when power is high (e.g., .9) and the hypothesis being tested is reasonably plausible, the actual rate of false positives is around 5%, so NHST may be reasonable in this situation
  • In most studies, we’re either not in this situation or we don’t know whether we’re in this situation, so NHST is still problematic in practice
  • The more surprising an effect, the more important it is to replicate

1. What happens with other combinations of statistical power and p(H1)? Can we solve this problem by increasing our statistical power?

My grad students and postdocs wanted to see the false positive rate for a broader set of conditions, so I made a little Excel spreadsheet (which you can download here).  This spreadsheet can calculate the false positive rate (FPRP) for any combination of statistical power and p(H1).  This spreadsheet also produces the following graph, which shows 100 different combinations of these two factors.

Why I lost faith in p values-3.jpg

This figure shows the probability that you will falsely reject the null hypothesis (make a Type I error) given that you find a significant effect (p < .05) for a given combination of statistical power and likelihood that the alternative hypothesis is true.  For example, if you look at the point where power = .5 and p(H1) = .1, you will see that the probability is .47.  This is the example shown in the table above.  Several interesting questions can be answered by looking at the pattern of false positive rates in this figure.

Can we solve this problem by increasing our statistical power? Take a look at the cases at the far right of the figure, where power = 1.  Because power = 1, you have a 100% chance of finding a significant result if H1 is actually true.  But even with 100% power, you have a fairly high chance of a Type I error if p(H1) is low.  For example, if some of your experiments test really risky hypotheses, in which p(H1) is only 10%, you will have a false positive rate of over 30% in these experiments even if you have incredibly high power (e.g., because you have 1,000,000 participants in your study).  The Type I error rate declines as power increases, so more power is a good thing.  But we can’t “power our way out of this problem” when the probability of H1 is low.

Is the FPRP ever <= .05? The figure shows that we do have a false positive rate of <= .05 under some conditions.  Specifically, when the alternative hypothesis is very likely to be true (e.g., p(H1) >= .9), our false positive rate is <= .05 no matter whether we have low or high power.  When would p(H1) actually be this high?  This might happen when your study includes a factor that is already known to have an effect (usually combined with some other factor).  For example, imagine that you want to know if the Stroop effect is bigger in Group A than in Group B.  This could be examined in a 2 x 2 design, with factors of Stroop compatibility (compatible versus incompatible) and Group (A versus B).  p(H1) for the main effect of Stroop compatibility is nearly 1.0.  In other words, this effect has been so consistently observed that you can be nearly certain that it is present in your experiment (whether or not it is actually statistically significant).  [H1 for this effect could be false if you’ve made a programming error or created an unusual compatibility manipulation, so p(H1) might be only 0.98 instead of 1.0.]  Because p(H1) is so high, it is incredibly unlikely that H1 is false and that you nonetheless found a significant main effect of compatibility (which is what it means to have a false positive in this context).  Cases where p(H1) is very high are not usually interesting — you don’t do an experiment like this to see if there is a Stroop effect; you do it to see if this effect differs across groups.

A more interesting case is when H1 is moderately likely to be true (e.g., p(H1) = .5) and our power is high (e.g., .9).  In this case, our false positive rate is pretty close to .05.  This is good news for NHST: As long as we are testing hypotheses that are reasonably plausible, and our power is high, our false positive rate is only around 5%. 

This is the “sweet spot” for using NHST.  And this probably characterizes a lot of research in some areas of psychology and neuroscience.  Perhaps this is why the rate of replication for experiments in cognitive psychology is fairly reasonable (especially given that real effects may fail to replicate for a variety of reasons).  Of course, the problem is that we can only guess the power of a given experiment and we really don’t know the probability that the alternative hypothesis is true.  This makes it difficult for us to use NHST to control the probability that our statistically significant effects are bogus (null). In other words, although NHST works well for this particular situation, we never know whether we’re actually in this situation.

2. Why use examples with 1000 experiments?

The example shown in Table 2 may seem odd, because it shows what we would expect in a set of 1000 experiments.  Why talk about 1000 experiments?  Why not talk about what happens with a single experiment? Similarly, the Figure shows "probabilities" of false positives, but a hypothesis is either right or wrong. Why talk about probabilities?

The answer to these questions is that p values are useful only in telling you the long-run likelihood of making a Type I error in a large set of experiments.  P values do not represent the probability of a Type I error in a given experiment.  (This point has been made many times before, but it's worth repeating.)

NHST is a heuristic that aims to minimize the proportion of experiments in which we make a Type I error (falsely reject the null hypothesis).  So, the only way to talk about p values is to talk about what happens in a large set of experiments.  This can be the set of experiments that are submitted to a given journal, the set of experiments that use a particular method, the set of experiments that you run in your lifetime, the set of experiments you read about in a particular journal, the set of experiments on a given topic, etc.  For any of these classes of studies, NHST is designed to give us a heuristic for minimizing the proportion of false positives (Type I errors) across a large number of experiments.  My examples use 1000 experiments simply because this is a reasonably large, round number.

We’d like the probability of a Type I error in any given set of experiments to be ~5%, but this is not what NHST actually gives us.  NHST guarantees a 5% error rate only in the experiments in which the null hypothesis is actually true.  But this is not what we want to know.  We want to know how often we’ll have a false positive across a set of experiments in which the null is sometimes true and sometimes false.  And we mainly care about our error rate when we find a significant effect (because these are the effects that, in reality, we will be able to publish).  In other words, we want to know the probability that the null hypothesis is true in the set of experiments in which we get a significant effect [which we can represent as a conditional probability: p(null | significant effect); this is the FPRP].  Instead, NHST gives us the probability that we will get a significant effect when the null is true [p(significant effect | null)]. These seem like they’re very similar, but the example above shows that they can be wildly different.  In this example, the probability that we care about [p(null | significant effect)] is .47, whereas the probability that NHST gives us [p(significant effect | null)] is .05.

3. What happens when power and p(H1) vary across experiments?

For each of the individual points shown in the figure above, we have a fixed and known statistical power along with a fixed and known probability that the alternative hypothesis is true (p(H1).  However, we don’t actually know these values in real research.  We might have a guess about statistical power (but only a guess because power calculations require knowing the true effect size, which we never know with any certainty).  We don’t usually have any basis (other than intuition) for knowing the probability that the alternative hypothesis is true in a given set of experiments.  So, why should we care about examples with a specific level of power and a specific p(H1)?

Here’s one reason: Without knowing these, we can’t know the actual probability of a false positive (the FPRP, p(null is true | significant effect)).  As a result, unless you know your power and p(H1), you don’t know what false positive rate to expect.  And if you don’t know what false positive rate to expect, what’s the point of using NHST?  So, if you find it strange that we are assuming a specific power and p(H1) in these examples, then you should find it strange that we regularly use NHST (because NHST doesn’t tell us the actual false positive rate unless we know these things).

The purpose of examples like the one shown above is that they can tell you what might happen for specific classes of experiments.  For example, when you see a paper in which the result seems counterintuitive (i.e., unlikely to be true given everything you know), this experiment falls into a class in which p(H1) is low and the probability of a false positive is therefore high.  And if you can see that the data are noisy, then the study probably has low power, and this also tends to increase the probability of a false positive.  So, even though you never know the actual power and p(H1), you can probably make reasonable guesses in some cases.

Most real research consists of a mixture of different power levels and p(H1) levels.  This makes it even harder to know the effective false positive rate, which is one more reason to be skeptical of NHST.

4. What should we do about this problem?

I ended the previous post with the advice that my graduate advisor, Steve Hillyard, liked to give: Replication is the best statistic.  Here’s something else he told me on multiple occasions: The more important a result is, the more important it is for you to replicate it before publishing it.  Given the false positive rates shown in the figure above, I would like to rephrase this as: The more surprising a result is, the more important it is to replicate the result before believing it.

In practice, a result can be surprising for at least two different reasons.  First, it can be surprising because the effect is unlikely to be true.  In other words, p(H1) is low.  A widely discussed example of this is the hypothesis that people have extrasensory perception.

However, a result can also seem surprising because it’s hard to believe that our methods are sensitive enough to detect it.  This is essentially saying that the power is low.  For example, consider the hypothesis that breast-fed babies grow up to have higher IQs than bottle-fed babies.  Personally, I think this hypothesis is likely to be true.  However, the effect is likely to be small, there are many other factors that affect IQ, and there are many potential confounds that would need to be ruled out. As a result, it seems unlikely that this effect could be detected in a well-controlled study with a realistic number of participants. 

For both of these classes of surprising results (i.e., low p(H1) and low power), the false positive rate is high.  So, when a statistically significant result seems surprising for either reason, you shouldn’t believe it until you see a replication (and preferably a preregistered replication).  Replications are easy in some areas of research, and you should expect to see replications reported within a given paper in these areas (but see this blog post by Uli Schimmackfor reasons to be skeptical when the p value for every replication is barely below .05). Replications are much more difficult in other areas, but you should still be cautious about surprising or low-powered results in those areas.

Electrophysiological Evidence for Spatial Hyperfocusing in Schizophrenia

Kreither, J., Lopez-Calderon, J., Leonard, C. J., Robinson, B. M., Ruffle, A., Hahn, B., Gold, J. M., & Luck, S. J. (2017). Electrophysiological Evidence for Spatial Hyperfocusing in SchizophreniaThe Journal of Neuroscience, 37, 3813-3823.

Double Oddball P3 Bar Graph.jpg

This paper from last spring describes new evidence for our hyperfocusing theory of cognitive dysfunction in schizophrenia.  Remarkably, we found that people with schizophrenia were actually better able to focus centrally and filter peripheral distractors than were control subjects. Under the right conditions, we even observed a (slightly) larger P3 wave in patients than in controls.
 

New Paper: Visual short-term memory guides infants’ visual attention

Mitsven, S. G., Cantrell, L. M., Luck, S. J., & Oakes, L. M. (in press). Visual short-term memory guides infants’ visual attention. Cognition. https://doi.org/10.1016/j.cognition.2018.04.016 (Freely available until June 14, 2018 at https://authors.elsevier.com/a/1Wxvg2Hx2bbMQ)

Mitsven.jpg

This new papers shows that visual short-term memory guides attention in infants. Whereas adults orient toward items matching the contents of VSTM, infants orient toward non-matching items.

Why I've lost faith in p values

[Note: There is a followup to this post.]

There has been a lot written over the past decade (and even longer) about problems associated with null hypothesis statistical testing (NHST) and p values.  Personally, I have found most of these arguments unconvincing. However, one of the problems with p values has been gnawing at me for the past couple years, and it has finally gotten to the point that I'm thinking about abandoning p values.  Note: this has nothing to do with p-hacking (which is a huge but separate issue).

Here's the problem in a nutshell: If you run 1000 experiments over the course of your career, and you get a significant effect (p < .05) in 95 of those experiments, you might expect that 5% of these 95 significant effects would be false positives.  However, as an example shown later in this blog will show, the actual false positive rate may be 47%, even if you're not doing anything wrong (p-hacking, etc.).  In other words, nearly half of your significant effects may be false positives, leading you to draw completely bogus conclusions that you are able to publish.  On the other hand, your false positive rate might instead be 3%.  Or 20%.  And my false positive rate might be very different from your false positive rate, even though we are both using p < .05 as our criterion for significance (even if neither of us is engaged in p-hacking, etc.). In other words, p values do not actually tell you anything meaningful about the false positive rate.  

But isn't this exactly what p values are supposed to tell us?  Don't they tell us the false positive rate?  Not if you define "false positive rate" in a way that is actually useful. Here's why:

The false positive rate (Type I error rate) as defined by NHST is the probability that you will falsely reject the null hypothesis when the null hypothesis is true.  In other words, if you reject the null hypothesis when p < .05, this guarantees that you will get a significant (but bogus) effect in only 5% of experiments in which the null hypothesis is true.  However, this is a statement about what happens when the null hypothesis is actually true. In real research, we don't know whether the null hypothesis is actually true.  If we knew that, we wouldn't need any statistics!  In real research, we have a p value, and we want to know whether we should accept or reject the null hypothesis.  The probability of a false positive in that situation is not the same as the probability of a false positive when the null hypothesis is true.  It can be way higher.

For example, imagine that I am a journal editor, and I accept papers when the studies are well designed, well executed, and statistically significant (p < .05 without any p-hacking).  I would like to believe that no more than 5% of these effects are actually Type I errors (false positives).  In other words: I want to know the probability that the null is true given that an observed effect is significant. We can call this probability "p(null | significant effect)".  However, what NHST actually tells me is the probability that I will get a significant effect if the null is true. We can call this probability "p(significant effect | null)". These two probabilities seem pretty similar, because they have the exactly the same terms (but in opposite orders). Despite the superficial similarity, in practice they can be vastly different.

The rest of this blog provides concrete examples of how these two probabilities can be very different and how the probability of a false positive can be much higher than 5%.  These examples involve a little bit of math (just multiplication and division — no algebra and certainly no calculus). But you can't avoid a little bit of math if you want to understand what p values can and cannot tell you.  If you've never gone through one of these examples before, it's well worth the small amount of effort needed.  It will change your understanding of p values.

Why I lost faith in p values-1.jpeg

The first example simulates a simple situation in which—because it is a simulation—I can make assumptions that I couldn't make in actual research.  These assumptions let us see exactly what would happen under a set of simple, known conditions.  The simulation, which is summarized in Table 1, shows what I would expect to find if I ran 1000 experiments in which two things are assumed to be true: 1) the null and alternative hypotheses are equally likely to be true (i.e., the probability that there really is an effect is .5); 2) when an effect is present, there is a 50% chance that it will be statistically significant (i.e., my power to detect an effect is .5).  These two assumptions are somewhat arbitrary, but they are a reasonable approximation of a lot of studies.  

Table 1 shows what I would expect to find in this situation.  The null will be true in 500 of my 1000 experiments (as a result of assumption 1).  In those 500 experiments, I would expect a significant effect 5% of the time (assuming that my alpha is .05).  This is because my Type I error rate is 5% (assuming an alpha of .05).  This Type I error rate is what I previously called p(significant effect | null), because it's the probability that I will get a significant effect when the null hypothesis is actually true.  In the other 500 experiments, the alternative hypothesis is true.  Because my power to detect an effect is .5 (as a result of assumption 2), I get a significant effect in half of these 500 experiments.  Unless you are running a lot of subjects in your experiments, this is a pretty typical level of statistical power.

However, the Type I error rate of 5% does not help me determine the likelihood that I am falsely rejecting the null hypothesis when I get a significant effect, p(null | significant effect). This probability is shown in the yellow box.  In other words, in real research, I don't actually know when the null is actually true or false; all I know is whether the p value is < .05.  This example shows that—if the null is true in half of my experiments and my power is .05—I would expect to get 275 significant effects (i.e., 275 experiments in which p < .05), and I would expect that the null is actually true in 25 of these 275 experiments.  In other words, the probability that one of my significant effects is actually bogus (a false positive) is 9%, not 5%.

Why I lost faith in p values-2.jpeg

This might not seem so bad.  I'm still drawing the right conclusion over 90% of the time when I get a significant effect (assuming that I've done everything appropriately in running and analyzing my experiments).  However, there are many cases where I am testing bold, risky hypotheses—that is, hypotheses that are unlikely to be true.  As Table 2 shows, if there is a true effect in only 10% of the experiments I run, almost half of my significant effects will be bogus (i.e., p(null | significant effect) = .47).

The probability of a bogus effect is also high if I run an experiment with low power.  For example, if the null and alternative are equally likely to be true (as in Table 1), but my power to detect an effect (when an effect is present) is only .1, fully 1/3 of my significant effects would be expected to be bogus (i.e., p(null | significant effect) = .33).  

Of course, the research from most labs (and the papers submitted to most journals) consist of a mixture of high-risk and low-risk studies and a mixture of different levels of statistical power.  But without knowing the probability of the null and the statistical power, I can't know what proportion of the significant results are likely to be bogus.  This is why, as I stated earlier, p values do not actually tell you anything meaningful about the false positive rate.  In a real experiment, you do not know when the null is true and when it is false, and a p value only tells you about what will happen when the null is true.  It does not tell you the probability that a significant effect is bogus. This is why I've lost my faith in p values.  They just don't tell me anything.

Yesterday, one of my postdocs showed me a small but statistically significant effect that seemed unlikely to be true.  That is, if he had asked me how likely this effect was before I saw the result, I would have said something like 20%.  And the power to detect this effect, if real, was pretty small, maybe .25.  So I told him that I didn't believe the result, even though it was significant, because p(null | significant effect) is high when an effect is unlikely and when power is low.  He agreed.

Tables 1 and 2 make me wonder why anyone ever thought that we should use p values as a heuristic to avoid publishing a lot of effects that are actually bogus.  The whole point of NHST is supposedly to maintain a low probability of false positives.  However, this would require knowing p(null | significant effect), which is something we can never know in real research. We can see what would be expected by conducting simulations (like those in Tables 1 and 2).  However, we do not know the probability that the null hypothesis is true (assumption 1) and we do not know the statistical power (assumption 2), and we would need to know these to be able to calculate p(null | significant effect). So why did statisticians tell us that we should use this approach?  And why did we believe them? [Moreover, why did they not insist that we do a correction for multiple comparison when we do a factorial ANOVA that produces multiple p values? See this post on the Virtual ERP Boot Camp blog and this related paper from the Wagenmakers lab.]

Here's an even more pressing, practical question: What should we do given that p values can't tell us what we actually need to know?  I've spent the last year exploring Bayes factors as an alternative.  I've had a really interesting interchange with advocates of Bayesian approaches about this on Facebook (see the series of posts beginning on April 7, 2018). This interchange has convinced me that Bayes factors are potentially useful.  However, they don't really solve the problem of wanting to know the probability that an effect is actually null.  This isn't what Bayes factors are for: this would be using a Bayesian statistic to ask a frequentist question.

Another solution is to make sure that statistical power is high by testing larger sample sizes. I'm definitely in favor of greater power, and the typical N in my lab is about twice as high now as it was 15 years ago. But this doesn't solve the problem, because the false positive rate is still high when you are testing bold, novel hypotheses.  The fundamental problem is that p values don't mean what we "need" them to mean, that is p(null | significant effect).

Many researchers are now arguing that we should, more generally, move away from using statistics to make all-or-none decisions and instead use them for "estimation".  In other words, instead of asking whether an effect is null or not, we should ask how big the effect is likely to be given the data.  However, at the end of the day, editors need to make an all-or-none decision about whether to publish a paper, and if we do not have an agreed-upon standard of evidence, it would be very easy for people's theoretical biases to impact decisions about whether a paper should be published (even more than they already do). But I'm starting to warm up to the idea that we should focus more on estimation than on all-or-none decisions about the null hypothesis.

I've come to the conclusion that best solution, at least in my areas of research, is what I was told many times by my  graduate advisor, Steve Hillyard: "Replication is the best statistic."  Some have argued that replication can also be problematic.  However, most of these potential problems are relatively minor in my areas of research.  And the major research findings in these areas have held up pretty well over time, even in registered replications.

I would like to end by noting that lots of people have discussed this issue before, and there are some great papers talking about this problem.  The most famous is Ionnidis (2005, PLoS Medicine).  A neuroscience-specific example is Button et al. (2015, Nature Reviews Neuroscience) (but see Nord et al., 2017, Journal of Neuroscience for an important re-analysis). However, I often find that these papers are bombastic and/or hard to understand.  I hope that this post helps more people understand why p values are so problematic.

For more, see this follow-up post.

Classic Article: "Features and Objects in Visual Processing" by Anne Treisman

Treisman, A. (1986). Features and objects in visual processing. Scientific American, 255, 114-125.

Treisman_Circles_and_Lollies.png

I read this article—a review of the then-new feature integration theory—early in my first year of grad school.  It totally changed my life.  My first real experiment in grad school was an ERP version of the "circles and lollies" experiment shown in the attached image:

Luck, S. J., & Hillyard, S. A. (1990). Electrophysiological evidence for parallel and serial processing during visual search. Perception & Psychophysics, 48, 603-617.

In that experiment, I discovered the N2pc component (because I followed some smart advice from Steve Hillyard about including event codes that indicated whether the target was in the left or right visual field).  I've ended up publishing dozens of N2pc papers over the years (along with at least 100 N2pc papers by other labs).

The theory presented in this Scientific American paper was also one of the inspirations for my first study of visual working memory:

Luck, S. J., & Vogel, E. K. (1997). The capacity of visual working memory for features and conjunctions. Nature, 390, 279-281.

As you may know, Anne passed away recently (see NY Times obituary).  Anne was my most important scientific role model (other than my official mentors).  I'm sure she had no idea how much impact she had on me. She probably thought that I was an idiot, because I became a blathering fool anytime I was in her presence (even after I had moved on from grad student to new assistant professor and then to senior faculty).  But her intelligence and creativity just turned me to jello...

Anyway, this is a great paper, and very easy to read.  I recommend it to anyone who is interested in visual cognition.

 

 

Review article: How do we avoid being distracted by salient but irrelevant objects in the environment?

Gaspelin, N., & Luck, S. J. (2018). The Role of Inhibition in Avoiding Distraction by Salient Stimuli. Trends in Cognitive Sciences, 22, 79-92.

TICS Suppression.jpg

In this recent TICS paper, Nick Gaspelin and I review the growing evidence that the human brain can actively suppress objects that might otherwise capture our attention.

Postdoc position available in the Luck Lab

A postdoctoral position is available in the laboratory of Steve Luck at the UC-Davis Center for Mind & Brain (http://lucklab.ucdavis.edu).  Both U.S. and international applicants are welcome. Multiple years of funding are possible. Our lab places great emphasis on postdoctoral training, and we have an excellent track record of placing our postdocs in faculty positions.

The research will focus on a broad range of topics in visual cognition, using a combination of traditional behavioral methods, ERPs, eye tracking, and possibly fMRI. We are seeking an individual with an excellent background in the theories and methods of high-level vision science and cognitive psychology.  Experience with eye tracking and ERPs is not required; this will be an ideal position for someone who is interested in learning these methods or someone who already has experience but wants to become a world-class expert.  However, good quantitative and programming skills are essential.

Salary will depend on experience, with a minimum set by the University of California postdoc salary scale (which is higher than NIH scale). The position will remain open until a suitable candidate is identified. We are aiming for a start date between June 1, 2018 and September 30, 2018.

Davis is a vibrant college town in Northern California, located approximately 20 minutes from Sacramento, 75 minutes from San Francisco, 45 minutes from Napa, and 2 hours from Lake Tahoe.  The Center for Mind & Brain is an interdisciplinary research center devoted to cognitive science and cognitive neuroscience, located in a beautiful new building with state-of-the-art laboratories (see http://mindbrain.ucdavis.edu/).

To apply, send a cover letter describing your background and interests, a CV, and at least two letters of recommendation to Aaron Simmons (lucklab.manager@gmail.com). 

Twitter Background3.jpg

Decoding the contents of working memory from scalp EEG/ERP signals

Bae, G. Y., & Luck, S. J. (2018). Dissociable Decoding of Working Memory and Spatial Attention from EEG Oscillations and Sustained Potentials. The Journal of Neuroscience, 38, 409-422.

In this recent paper, we show that it is possible to decode the exact orientation of a stimulus as it is being held in working memory from sustained (CDA-like) ERPs.  A key finding is that we could decode both the orientation and the location of the attended stimulus with these sustained ERPs, whereas alpha-band EEG signals contained information only about the location.  

Our decoding accuracy is only about 50% above the chance level, but it's still pretty amazing that such precise information can be decoded from brain activity that we're recording from electrodes on the scalp!

Stay tuned for more cool EEG/ERP decoding results — we will be submitting a couple more studies in the near future.