### Participants

Fifty 8-month-old infants (M = 8 m0d, SD = 12 d, 25 females) were recruited for the study from a database of volunteer families. Infants had to carry out at least 20 trials to be included in the analysis. Six infants failed to reach this threshold and were thus excluded from the analysis. One additional infant was excluded due to a MATLAB crash. The final sample consisted of 43 infants (M = 7 m29d, SD = 12 d, 24 females). Families received a book or 10 Euros for their participation. The local ethics review board approved the study (Ethical approval number: ECSW2017-3001-470), and the institutional review board guidelines were followed.

### Experimental design

The stimuli consisted of eight shapes (star, heart, trefoil, triangle, crescent, rhombus, octagon, and cross). The shapes were presented as cues and as targets within a frame (see Fig. 1). Vertical and horizontal lines divided the screen in four locations (the target locations), and a central circular area defined the cue location. Cues were looming in the center of the screen. Targets were rotating in one of the four locations around the center of the screen (see movie S1).

Infants were presented with 16 sequences of cue-target couplings. Each sequence was composed of 15 trials. Every trial consisted of a cue phase (1000 ms), an interstimulus interval (750 ms), a target phase (1500 ms), and an intertrial interval (750 ms). In the cue phase, a simple shape (e.g., a star) appeared in the middle of the screen. In the target phase, the same shape appeared in one of four quadrants.

Every sequence consisted of one only type of shape (e.g., only stars). The shape location during the target phase was systematically manipulated. In four of the 16 sequences, the target appeared in the same location in 100% of the trials; in six sequences, the target appeared in one quadrant 80% of the times and the remaining 20% of the times its appearance was distributed over the other three locations; in the remaining six sequences, the target appeared in one quadrant 60% of the times and the remaining 40% of the times its appearance was distributed over the other three locations. Hence, the cue was always predictive of the target location, but its degree of predictability varied.

We created the sequences in MATLAB. First, 16 sequences were sampled pseudo-randomly, with the probabilities specified above as only constraint. Then, the sequences were concatenated. To check that the target location could be predicted only by relying on cue-target conditional probabilities, we fed the result of the sampling into a machine learning random forest classifier. If the classifier was able to reliably predict the target location with no information about the cue-target conditional probabilities (e.g., it successfully predicted the target location at trial N only based on target location at trial N-1), then the entire process was repeated and new sequences were sampled. The sequences obtained through this procedure were presented to all participants. The only element of the sequences that was pseudo-randomized across participants was the exact location of the target. For example, participants 1 and 2 would see the same deterministic sequence, but for participant 1, the target always appeared in the upper left corner, while for participant 2, the target always appeared in the bottom right corner. In this way, every participant was exposed to the same statistical regularities, but we were able to control for other biases (e.g., toward the left side of the screen), that might have influenced participants’ performance. Averaging across all trials of all sequences, each of the four target locations had the same probability of showing the target (25%).

Throughout the presentation of the stimuli, background music was played to increase overall attention toward the screen.

### Procedure

The study was conducted in a quiet room without daylight. Infants were seated in a baby seat placed on their caregiver’s lap, 60 to 65 cm from a 23″ monitor. Their looking behavior was recorded using a Tobii X300 eye-tracker (www.tobii.com). Infants’ behavior was monitored through an external video camera. Stimulus presentation and data collection were carried out using MATLAB Psychtoolbox. For every infant, the eye-tracker was calibrated with a five-point calibration sequence. If more than two points were not accurately calibrated, the calibration was repeated for a maximum of three times.

The sequences were played one after the other. When the infant looked away from the screen for 1 s or more, the sequence was stopped. When the infant looked back to the screen, the following sequence was played. The experiment lasted until the infant had watched all 16 sequences or became fussy. Parents were instructed not to interact with their child, unless infants sought their attention and, even in that case, not to try to bring infants’ attention back to the screen.

### Data processing

Raw eye-tracking data were first processed through identification by 2-means clustering (I2MC) (*36*). Settings for the I2MC algorithm were the following: interpolation window of 100 ms; interpolation edge of 6.7 ms; clustering window size of 200 ms; downsampling was set to 150, 60, and 30 Hz; window step size of 20 ms; clustering-weight cutoff of 2 SDs above the mean; merge fixation distance of 0.7°; merge fixation time of 40 ms; and minimum fixation duration of 40 ms. It has been shown that, when sampling at 300 Hz, these settings make I2MC very robust to high-noise infant data (*36*). The output of I2MC is a list of fixation points, each consisting of *x*–*y* coordinates (expressed in pixels) and a timestamp (expressed in milliseconds).

Areas of interest of 400 × 400 pixels were then delineated around the four target locations and the central cue location. Saccadic latencies, looking times to the targets, and look-away trials were extracted using MATLAB. These variables were standardized for every individual participant by computing *z*-scores using each participant’s mean and SD in lieu of the group-level mean and SD.

### Statistical analysis

*Look-aways*. We examined what factors influenced infants’ probability of looking away from a certain sequence. To do so, we used additive Cox models with time-varying covariates. This type of model allowed us to explore any kind of relationship between independent and dependent variables and not just linear relationships. It also allowed us to analyze truncated data such as look-aways, which violated the assumptions of the more common generalized linear model (GLM). We fitted the models using the R-package “mgcv.”

First, we performed a model comparison procedure to select the model with the highest goodness of fit. The aim of the model comparison is to identify which statistical model among the ones that are available better explains the pattern of the behavioral data. To score the goodness of fit of each model, we used Akaike’s information criterion (AIC). However, AIC ignores uncertainty related to smoothing parameters, which makes larger models more likely to fit better. We solve this problem as suggested by Wood *et al.* (*37*). Specifically, we compute the conditional AIC in the conventional way (*38*) with an additive correction that accounts for the uncertainty of the smoothing parameters. This correction makes complex models less likely to win over simpler ones.

A common way of comparing two models is to check the difference between their AIC. Here, ΔAIC is computed as the difference between the AIC of a given model and the AIC of the best model. Hence, the higher the ΔAIC is, the worse the model is.

The result of model selection is reported in table S1. The winning model had surprise, predictability, learning progress, and time as covariates and subjects as random factor. Time was expressed in two ways. First, in terms of trial number within a certain sequence (sequence-wise time). Second, in terms of overall number of trials seen during the task (task-wise time). The parameters of the winning model are reported in table S2. All the independent variables had a significant effect on the probability of looking away.

Additive models provide the potential for better fits to data than purely parametric models but arguably with some loss of interpretability: The effect of additive parameters cannot be quantified as clearly as the effect of β parameters. Hence, we fitted another model where we specified the relation between independent and dependent variables, instead of leaving it unspecified. This allows us to obtain beta coefficients and effect sizes. Given the results of the additive model, we specify a linear effect of surprise and learning progress and a quadratic effect of predictability. The results confirm the effects found with the additive model and are reported in table S3. As in Kidd *et al.* (*5*), and to allow direct comparison across studies, we used ⅇ^{∣β∣} as a measure of the effect size. Learning progress shows the strongest effect size (ⅇ^{∣β∣} = 7.02), followed by surprise (ⅇ^{∣β∣} = 2.44) and predictability (ⅇ^{∣β∣} = 1.27).

*Saccadic latency and looking time*. Since the distribution of both saccadic latency and looking time to the target was not normal [as is common for reaction time data, see (*39*)], we used GLMs rather than a linear model. GLMs allow the specification of the distribution of the data, leading to a better model fit and respecting the assumptions of linear regression. Specifically, we used a Cullen and Frey graph to check the distribution type that most closely resembled the ones of our data. We did so via bootstrapping 500 values from the distribution of each dependent variable. This method showed that saccadic latencies and looking time were distributed following a logistic distribution rather than a normal distribution. The logistic distribution is similar to the normal distribution but has heavier tails. The models were fitted in R using the GAMLSS package.

First, we estimated the effects of the information-theoretic measures on saccadic latencies. Time was added as a covariate, as saccadic latencies might decrease as a function of time just because of a practice effect or familiarity with the task. Participants were added as a random factor to control for interindividual differences. As reported in table S4, the results show a significant effect of surprise and predictability. The selected model fitted better than a null model with no regressors (ΔAIC = 119). It also fitted better than a more common linear model, which assumes normally distributed data (ΔAIC = 147). Last, to make sure that the correlation between information-theoretic values would not hinder the estimation of beta coefficients, we computed the variance inflation factor (VIF) for every independent variable. When the VIF is above 5, there might be a problem with multicollinearity. However, the VIFs were all below 5 (2.28 for surprise, 2.45 for predictability, 2.78 for learning progress, and 1.04 for time).

A similar model was fitted for fixation times. The only difference was that, in addition to the information-theoretic measures and time, saccadic latencies were also added as covariate. Given that saccadic latencies and fixation times were related (*r* = −0.59), in this way, we estimated the relationship between information-theoretic measures and fixation time controlling for fluctuations in saccadic latencies. The results are reported in table S5. The selected model fitted better than a null model (ΔAIC = 59) and also better than a simple linear model (ΔAIC = 248). Last, also in this model, the VIFs were all below 5 (2.38 for surprise, 2.49 for predictability, 2.81 for learning progress, 1.05 for saccadic latencies, and 1.04 for time).

### Ideal learner model

In the current study, we expected infants to keep track of the probabilities with which targets appeared in the four quadrants, update these probabilities at each trial, track the level of surprise of each event, the level of predictability of the sequence at each trial, and the amount of learning progress that the trial offered. Following previous literature (*23*, *40*), we developed an ideal learner model that performs the same computations.

The model is presented with a set of events *x*. An event is, for example, the target appearing in the upper left corner of the screen. The events followed each other until the sequence ended (or the infant looked away). The last event of a sequence, which also coincides with the length of the sequence, is named *j*, and the sequence can thus be denoted by

. The first goal of the model is to estimate the probability with which a certain event *x* will occur. Given that the target can appear in one of four possible locations *k*, the distribution of probabilities can be parameterized by the random vector *p* = [*p*_{1}, …, *p _{k}*], where

*p*is the probability of the target appearing in the

_{k}*k*th location. In our specific case, the target locations are four, and thus,

*p*= [

*p*

_{1},

*p*

_{2},

*p*

_{3},

*p*

_{4}]. The ideal learner treats

*p*

_{1 : 4}as parameters that must be estimated trial by trial given

*X*. In other words, given the past events up until the current trial, the ideal learner will estimate the probabilities with which the target will appear in any of the four possible target locations.

^{j}At the very beginning of each sequence, the ideal learner expects the target to appear in one of the four target locations with equal probability. This is expressed here as a prior Dirichlet distribution

$$P(p\mid \mathrm{\alpha})=\text{Dir}(\mathrm{p};{\mathrm{\alpha}}_{\mathrm{k}})$$(1)where all elements of α are equal to one, α = [1,1,1,1]. In this case, the parameter α determines prior expectations. If there is an imbalance between the values of α (e. g. , α = [100,1,1,1]), this means that the model is biased into thinking that the target is more likely to appear in the one location (*p*_{1} in the example) rather than the others. Conversely, when the numbers are equal, the ideal learner has no biases toward any location. Moreover, high numbers indicate that the model has strong expectations, while low numbers indicate that the model will quickly change its expectations when presented with new evidence. Thus, specifying α = [1,1,1,1], we are defining a weak uniform prior distribution. In other words, the model has no bias toward any location but is ready to change these expectations if presented with contradicting evidence.

At every trial, the prior distribution is updated given the observation of the new event *x* from the set *X ^{j}*. The posterior distribution of such update is given by

(2)where

${n}_{k}^{j}$ refers to the number of outcomes of type *k* observed up until the trial *j*. As a practical example, imagine that, at trial 1, the model observes a target in the location 1 (i.e., [1, 0, 0, 0]). The values of α will be updated with the evidence accumulated, thus moving from [1, 1, 1, 1] to [2, 1, 1, 1]. This implies that now it is slightly more likely to see the target in location 1 than in any of the other locations. Specifically, the probability of the target appearing in any location can be computed from the posterior distribution *P*(*p* ∣ *X ^{j}*, α) in the following fashion

(3)

In words, how likely the target is to appear in a certain corner is given by the total number of times it appeared in that corner, plus one (the value of α), divided by the total number of observations, plus 4 (the sum of the values of α). This updating rule implies that as evidence accumulates, new evidence will weigh less. Given that our sequences are stationary (i.e., the most likely location does not change within the same sequence), this assumption is justified for the current task.

At every trial *j*, the posterior Dirichlet distribution of trial *j* − 1 becomes the new prior distribution. The new prior is updated using (2) and the probabilities estimates are computed using (3). When infants look away and a new sequence is played, the prior is set back to (1). This means that we assume that when infants start looking to a new sequence, they consider it as independent of the previous ones. Previous research in adults demonstrated the suitability of this assumption (*24*).

The ideal learner model uses information theory (*24*) to compute the surprise of each event, the predictability of the sequence at each trial, and the learning progress at each trial. Surprise is quantified in terms of Shannon Information, *I*

(4)where *p*(*x ^{j}* =

*k*) is the probability that an event

*x*(i.e., the appearance of the target) will occur in a given location

*k*(e.g., the upper left corner). This probability depends on the prior α and on the evidence accumulated on the previous trials,

*X*

^{j−1}. By taking the negative logarithm of a probability, events that are highly probable will have low levels of surprise, while low-probability events will have a high level of surprise.

Predictability is quantified it terms of negative entropy, −*H*

(5)

Note that, different from surprise, here, predictability is estimated considering also the event *j*, and not just up to *j* − 1. This formula was applied when relating predictability to infants’ looking away and looking time, as they have the information relative to trial *j* when they decide whether to look away and when they look at the target of trial *j*. However, saccadic latencies do not depend on *X ^{j}* but rather on

*X*

^{j−1}, as when planning a saccade toward the target of trial

*j*, the target has not appeared yet. Hence, a formula slightly different from (

*5*) was used when relating predictability to saccadic latencies, in which

*X*was replaced by

^{j}*X*

^{j−1}.

Last, the learning progress is quantified in terms of Kullback-Leibler Divergence (or information gain), *D*_{KL}

(6)where *p ^{j}* is the estimate of the parameters

*p*

_{1 : k}at trial

*j*, while

*p*

^{j−1}is the estimate of the parameters

*p*

_{1 : k}that was performed on the previous trial

*j*− 1. Learning progress has been defined as the reduction in the error of an agent’s prediction (

*15*).

*D*

_{KL}is the divergence between a weighted average of prediction error at trial

*j*and a weighted average of prediction error at trial

*j*− 1, and hence, it is a suitable way to model learning progress in this task.