Data dictionary

This data frame contains information about the word-forms included in the questionnaire, together with some identifiers used to relate participants’ responses to the information of the word-forms they responded to.

Code

items |>
  head(10) |>
  knitr::kable(digits = 2)

te	meaning	language	item	ipa	xsampa	lv	n_phon	n_syll	syll	freq	freq_syll	list
115	witch	Catalan	bruixa	ˈbɾu.ʃə	“b4u.S@	0.60	5	2	b4u, S@	6.23	13.00	A, B, C, D
115	witch	Spanish	bruja	ˈbɾu.xa	“b4u.xa	0.60	5	2	b4u, xa	6.23	13.40	A, B, C, D
148	bee	Catalan	abella	əˈβɛ.ʎə	@“BE.L@	0.20	5	3	@ , BE, L@	6.09	20.48	A, B, C, D
148	bee	Spanish	abeja	aˈβe.xa	a”Be.xa	0.20	5	3	a , Be, xa	6.09	21.36	A, B, C, D
149	animal	Catalan	animals	ˈæn.ɪ.məlz	“{n.I.m@lz	0.25	8	3	{n , I , m@lz	5.90	17.71	D
149	animal	Spanish	animales	a.niˈma.les	a.ni”ma.les	0.25	8	4	a , ni , ma , les	5.90	27.70	D
150	spider	Catalan	aranya	əˈɾa.ɲə	@“4a.J@	0.60	5	3	@ , 4a, J@	6.04	19.55	A
150	spider	Spanish	arana	aˈɾa.ɲa	a”4a.Ja	0.60	5	3	a , 4a, Ja	6.04	21.31	A
153	owl	Catalan	mussol	muˈsɔɫ	mu”sO5	0.20	5	2	mu , sO5	5.95	12.57	A, B, C, D
153	owl	Spanish	buho	ˈbu.o	“bu.o	0.20	3	2	bu, o	5.95	13.60	A, B, C, D

te: integer that uniquely labels a translation equivalent, and is only repeated across the word-forms from Catalan and Spanish that are part of the same translation equivalent
meaning: character string that uniquely labels the concept associated to the word-form, and is only repeated across the word-forms from Catalan and Spanish that are part of the same translation equivalent
language: character string indicating the language (Catalan or Spanish) to which the word-form belongs to
item: character string that uniquely identifies the word-form in the questionnaire, and links it to the formr item
ipa: phonological transcription of the word-form in IPA format, generated from the X-SAMPA transcription of the word-form using the ipa::xsampa() function.
xsampa: phonological transcription of the word-form in X-SAMPA format
lv: numeric value indicating the normalised inverse of the Levenshtein distance between the X-SAMPA phonological transcriptions of the word-form and of its translation equivalent, calculated using the stringdist::stringsim() (see the Methods section in the main manuscript for more details)
n_phon: integer indicating the number of phonemes included in the X-SAMPA phonological transcription of the word-form
n_syll: integer indicating the number of syllables included in the X-SAMPA phonological transcription of the word-form
syll: list of character strings in which each element is a syllable included in the word-form
freq: numeric values indicating the lexical frequency in Zipf scores, from the English CHILDES corpora
freq_syll: numeric value indicating the um of the frequency of the syllables in the word-form, expressed as counts per million tokens
list: characters string indicating the questionnaire sub-list(s) in which the word-form appears

This data frame contains demographic and linguistic information about participants participants, together with some identifiers used to relate participants’ responses to their corresponding information.

Code

participants |>
  head(10) |>
  knitr::kable(digits = 2)

child_id	response_id	time	time_stamp	list	age	sex	lp	doe_catalan	doe_spanish	edu_parent
54531	872	1	2020-04-19	bvq-short	31.64	male	Monolingual	0.9	0.1	University
54794	828	1	2020-04-15	bvq-short	29.31	female	Monolingual	1.0	0.0	University
54828	826	1	2020-06-11	bvq-short	30.78	male	Monolingual	1.0	0.0	University
54881	1197	1	2020-06-03	bvq-short	29.21	female	Bilingual	0.5	0.5	Complementary
54939	952	1	2020-05-08	bvq-short	28.91	female	Bilingual	0.2	0.7	University
54974	1051	1	2020-05-23	bvq-short	29.24	male	Bilingual	0.4	0.6	Vocational
54978	940	1	2020-05-18	bvq-short	28.98	female	Monolingual	0.1	0.9	University
55011	942	1	2020-05-18	bvq-short	28.45	male	Monolingual	0.2	0.8	Vocational
55027	811	1	2020-04-13	bvq-short	27.14	male	Bilingual	0.6	0.3	University
55056	950	1	2020-05-08	bvq-short	27.99	male	Monolingual	0.1	0.8	University

child_id: integer that uniquely labels participant, and is only repeated across responses to the questionnaire from the same participant
response_id: integer that uniquely labels questionnaire administrations, and is never repeated across questionnaire administrations or participants
time: integer indicating the cumulative number of times the participant has provided a valid response to the questionnaire
time_stamp: date at which the response to the questionnaire was recorded (last item responded)
list: character string indicating the questionnaire sub-list to which the participant responded, which is virtually always the same for the same participant
age: numeric value indicating the age of the participant when their response to the questionnaire was recorded, calculated as the difference in months between such date and the birth date of the participant
lp: character string indicating the language profile of the participant (Monolingual or Bilingual), calculated from doe_catalan and doe_spanish ("Monolingual" if >=80% DoE to Catalan or Spanish, "Bilingual" otherwise)
doe_catalan: numeric value indicating participant’s degree of exposure (DoE) to Catalan, as reported by their caregivers
doe_spanish: numeric value indicating participant’s degree of exposure (DoE) to Spanish, as reported by their caregivers
edu_parent: factor indicating the caregivers maximum educational attainment

This data frame is the one used to fit the main model, and the model included in Appendix A. It contains participants’ responses to each item included in the sub-list of the questionnaire they responded to, together with participant- and word-level predictors of interest.

Code

responses |>
  head(10) |>
  knitr::kable(digits = 2)

child_id	response_id	time	age	age_std	te	language	meaning	item	response	lv	lv_std	freq	freq_std	n_phon	n_phon_std	doe	doe_std	exposure	exposure_std
54531	872	1	31.64	1.96	115	Catalan	witch	bruixa	Understands and Says	0.60	0.98	6.23	1.10	5	-0.21	0.9	1.36	5.61	1.46
54531	872	1	31.64	1.96	115	Spanish	witch	bruja	Understands	0.60	0.98	6.23	1.10	5	-0.21	0.1	-1.31	0.62	-1.29
54531	872	1	31.64	1.96	148	Catalan	bee	abella	Understands and Says	0.20	-0.58	6.09	0.31	5	-0.21	0.9	1.36	5.48	1.39
54531	872	1	31.64	1.96	148	Spanish	bee	abeja	No	0.20	-0.58	6.09	0.31	5	-0.21	0.1	-1.31	0.61	-1.30
54531	872	1	31.64	1.96	153	Catalan	owl	mussol	Understands and Says	0.20	-0.58	5.95	-0.45	5	-0.21	0.9	1.36	5.35	1.32
54531	872	1	31.64	1.96	153	Spanish	owl	buho	No	0.20	-0.58	5.95	-0.45	3	-1.49	0.1	-1.31	0.59	-1.31
54531	872	1	31.64	1.96	159	Catalan	snail	cargol	Understands and Says	0.29	-0.25	5.84	-1.03	6	0.43	0.9	1.36	5.25	1.26
54531	872	1	31.64	1.96	159	Spanish	snail	caracol	Understands	0.29	-0.25	5.84	-1.03	7	1.07	0.1	-1.31	0.58	-1.32
54531	872	1	31.64	1.96	160	Catalan	zebra	zebra	Understands	0.60	0.98	6.13	0.54	5	-0.21	0.9	1.36	5.52	1.41
54531	872	1	31.64	1.96	160	Spanish	zebra	cebra	No	0.60	0.98	6.13	0.54	5	-0.21	0.1	-1.31	0.61	-1.30

child_id: integer that uniquely labels participant, and is only repeated across responses to the questionnaire from the same participant
response_id: integer that uniquely labels questionnaire administrations, and is never repeated across questionnaire administrations or participants
age: numeric value indicating the age of the participant when their response to the questionnaire was recorded, calculated as the difference in months between such date and the birth date of the participant
age_std: numeric value indicating the participant’s standardised age
te: integer that uniquely labels a translation equivalent, and is only repeated across the word-forms from Catalan and Spanish that are part of the same translation equivalent
language: character string indicating the language (Catalan or Spanish) to which the word-form belongs to
meaning: character string that uniquely labels the concept associated to the word-form, and is only repeated across the word-forms from Catalan and Spanish that are part of the same translation equivalent
item: character string that uniquely identifies the word-form in the questionnaire, and links it to the formr item
response: ordered factor indicating participant’s response to the item, which can take "No", "Understands", or "Understands and Says" as values
lv: numeric value indicating the normalised inverse of the Levenshtein distance between the X-SAMPA phonological transcriptions of the word-form and of its translation equivalent, calculated using the stringdist::stringsim() (see the Methods section in the main manuscript for more details)
lv_std: numeric value indicating the word-form’s standardised lv
freq: numeric values indicating the lexical frequency in Zipf scores, from the English CHILDES corpora
freq_std: numeric value indicating the word-form’s standardised freq
n_phon: integer indicating the number of phonemes included in the X-SAMPA phonological transcription of the word-form
n_phon_std: numeric value indicating the word-form’s standardised n_phon
doe: numeric value indicating participant’s degree of exposure (DoE) to the language the item belongs to
doe_std: numeric value indicating the participant’s standardised doe to the language the item belongs to
exposure: numeric value indicating the participants’ exposure score to the word-form, calculated as the product of freq and doe
doe_std: numeric value indicating the participant’s standardised exposure score for the item and its freq (see the Methods section in the main Manuscript for more details)