New Paper: Using ERPs and RSA to examine saliency maps and meaning maps for natural scenes
/Kiat, J.E., Hayes, T.R., Henderson, J.M., Luck, S.J. (in press). Rapid extraction of the spatial distribution of physical saliency and semantic informativeness from natural scenes in the human brain. The Journal of Neuroscience. https://doi.org/10.1523/JNEUROSCI.0602-21.2021 [preprint]
The influence of physical salience on visual attention in real-world scenes has been extensively studied over the past few decades. Intriguingly, however, recent research has shown that semantically informative scene features often trump physical salience in predicting even the fastest eye movements in natural scene viewing. These results suggest that the brain extracts visual information that is, at the very least, predictive of the spatial distribution of potentially meaningful scene regions very rapidly.
In this new paper, Steve Luck, Taylor Hayes, John Henderson, and I sought to assess the evidence for a neural representation of the spatial distribution of meaningful features and (assuming we found such a link!) contrast the onset of its emergence relative to the onset of physical saliency. To do so, we recorded 64-channel EEG data from subjects viewing a series of real-world scene photographs while performing a modified 1-back task in which subjects were probed on 10% of trials to identify which of four scene quadrants was part of the most recently presented image (see Figure 1).
With this dataset in hand, we next obtained spatial maps of meaning and saliency for each of the scenes. To measure the spatial distribution of meaningful features, we leveraged the “meaning maps” that had previously been obtained by the Henderson group. These maps are obtained by crowd-sourced human judgments of the meaningfulness of each patch of a given scene. The scene is first decomposed into a series of partially overlapping and tiled circular patches, and subjects rate each circular patch for informativeness (see Figure 2 and Henderson & Hayes, 2017). Then, these ratings are averaged and smoothed to produce a “meaning map,” which reflect the extent to which each location in a scene contains meaningful information. Note that these maps do not indicate the specific meanings, but simply indicate the extent to which any kind of meaningful information is present at each location.
The spatial distribution of physical saliency was estimated algorithmically using the Graph-Based Visual Saliency approach (Harel et al., 2006). This algorithm extracts low-level color, orientation, and contrast feature vectors from an image using biologically inspired filters. These features are then used to compute activation maps for each feature type. Finally, these maps are normalized, additively combined, and smoothened to produce an overall “saliency map”. A few examples of meaning and saliency maps for specific scenes are shown in Figure 3. We chose this algorithm in particular because of its combination of biological plausibility and performance at matching human eye movement data.
We then used the meaning maps and saliency maps to predict our ERP signals using Representational Similarity Analysis. For an overview of Representational Similarity Analysis in the context of ERPs, check out this video and this blog post.
The results are summarized in Figure 4. Not surprisingly, we found that a link between physical saliency and the ERPs emerged rapidly (ca. 78 ms after stimulus onset). The main question was how long it would take for a link to the meaning maps to be present. Would the spatial distribution of semantic informativeness take hundreds of milliseconds to develop, or would the brain rapidly determine which locations likely contained meaningful information? We found that the link between the meaning maps and the ERPs occurred extremely rapidly, less than 10 ms after the link to the saliency maps (ca. 87 ms after stimulus onset). You can see the timecourse of changes in the strength of the representational link for saliency and meaning in panel A (colored horizontal lines demark FDR corrected p < .05 timepoints) and the jackknifed mean onset latencies for the representational link of saliency and meaning in Panel B (error bars denote standard errors).
Note that the waveforms show semipartial correlations (i.e., the unique contribution of one type of map when variance due to the other type is factored out). These findings therefore show that meaning maps have a unique neurophysiological basis from saliency.
The rapid time course of the meaning map waveform also indicates that information related to the locations containing potentially meaningful information is computed rapidly, early enough to influence even the earliest eye movements. This is a correlation-based approach, so these results do not indicate that meaning per se is calculated by 87 ms. However, the results indicate that information that predicts the locations of meaningful scene elements is computed by 87 ms. Presumably, this information would be useful for directing shifts of covert and/or overt attention that would in turn allow the actual meanings to be computed.
The data and code are available at https://osf.io/zg7ue/. Please feel free to use this code and dataset (high-density ERP averages for 50 real-world scenes from 32 subjects) to explore research questions that interest you!