Vocabulary sizes: ml_vocabulary

library(multilex)
my_email <- "gonzalo.garciadecastro@upf.edu"
ml_connect(google_email = my_email)

The ml_vocabulary function allows to extract vocabulary sizes for individual responses to any of the questionnaires:

ml_connect()
#> [1] TRUE
p <- ml_participants()
r <- ml_responses()
ml_vocabulary(participants = p, responses = r)
#> # A tibble: 1,166 x 9
#>    id      time   age type   vocab_count_total vocab_count_dom~ vocab_count_dom~
#>    <chr>  <dbl> <dbl> <chr>              <int>            <int>            <int>
#>  1 bilex~     1  20.9 produ~               240              112              128
#>  2 bilex~     1  20.9 under~               405              206              199
#>  3 bilex~     1  17.1 produ~                 3                3                0
#>  4 bilex~     1  17.1 under~               114              114                0
#>  5 bilex~     1  16.4 produ~                11                6                5
#>  6 bilex~     1  16.4 under~               217              145               72
#>  7 bilex~     1  16.4 produ~                 4                4                0
#>  8 bilex~     1  16.4 under~               125              124                1
#>  9 bilex~     1  16.4 produ~                 5                2                3
#> 10 bilex~     1  16.4 under~                95               78               17
#> # ... with 1,156 more rows, and 2 more variables: vocab_count_conceptual <int>,
#> #   vocab_count_te <int>

Vocabulary sizes are, by default, computed in two different scales:

Counts (default): sum of the total number of items the child was reported to know.
Proportion: proportion of the items the child was reported to known, from the total of items that were included in the questionnaire, and parents answered.

By default, four modalities of vocabulary size are computed:

Total: total number of item the child was reported to know, summing both languages together.
L1: number of items the child was reported to know in their dominant language (e.g., Catalan words for a child whose language of most exposure is Catalan)
L2: number of items the child was reported to know in their non-dominant language (e.g., Spanish words for a child whose language of most exposure is Catalan)
Conceptual: number of concepts the child know at least one item for (regadless of the language the item belongs to).
TE: number of translation equivalents the child knows (how many concepts the child know one item in each language for).

Vocabulary sizes are also computed in two types:

Comprehension: number of items the child understands
Production: number of items the child says

Vocabulary size as counts

This is what the default output looks like:

library(multilex)
ml_connect()
#> [1] TRUE
p <- ml_participants()
r <- ml_responses(update = FALSE)
ml_vocabulary(participants = p, responses = r)
#> # A tibble: 1,166 x 9
#>    id      time   age type   vocab_count_total vocab_count_dom~ vocab_count_dom~
#>    <chr>  <dbl> <dbl> <chr>              <int>            <int>            <int>
#>  1 bilex~     1  20.9 produ~               240              112              128
#>  2 bilex~     1  20.9 under~               405              206              199
#>  3 bilex~     1  17.1 produ~                 3                3                0
#>  4 bilex~     1  17.1 under~               114              114                0
#>  5 bilex~     1  16.4 produ~                11                6                5
#>  6 bilex~     1  16.4 under~               217              145               72
#>  7 bilex~     1  16.4 produ~                 4                4                0
#>  8 bilex~     1  16.4 under~               125              124                1
#>  9 bilex~     1  16.4 produ~                 5                2                3
#> 10 bilex~     1  16.4 under~                95               78               17
#> # ... with 1,156 more rows, and 2 more variables: vocab_count_conceptual <int>,
#> #   vocab_count_te <int>

This data frame includes two rows per response: one for comprehensive vocabulary and one for productive vocabulary, and includes the following columns:

id: participant ID. This ID is unique for every participant and is the same across all responses to the questionnaire from the same participant.
time: how many times has this participant completed any of the questionnaires, including this one?
age: age in months at time of completion
type: vocabulary size type (understands for comprehension, produces for production)
vocab_count_total: total number of item the child was reported to know, summing both languages together
vocab_count_dominance_l1: number of items the child was reported to know in their dominant language (e.g., Catalan words for a child whose language of most exposure is Catalan)
vocab_count_dominance_l2: number of items the child was reported to know in their non-dominant language (e.g., Spanish words for a child whose language of most exposure is Catalan)
vocab_count_conceptual: number of concepts the child know at least one item for (regadless of the language the item belongs to).
vocab_count_te: number of translation equivalents the child knows (how many concepts the child know one item in each language for).

Vocabulary size as proportions

This is what the output looks like when scale = "prop":


ml_vocabulary(p, r, scale = "prop")
#> # A tibble: 1,166 x 9
#>    id      time   age type   vocab_prop_total vocab_prop_domin~ vocab_prop_domi~
#>    <chr>  <dbl> <dbl> <chr>             <dbl>             <dbl>            <dbl>
#>  1 bilex~     1  20.9 produ~          0.338             0.302            0.378  
#>  2 bilex~     1  20.9 under~          0.570             0.555            0.587  
#>  3 bilex~     1  17.1 produ~          0.00422           0.00877          0      
#>  4 bilex~     1  17.1 under~          0.160             0.333            0      
#>  5 bilex~     1  16.4 produ~          0.0153            0.0173           0.0135 
#>  6 bilex~     1  16.4 under~          0.303             0.419            0.194  
#>  7 bilex~     1  16.4 produ~          0.00574           0.0115           0      
#>  8 bilex~     1  16.4 under~          0.179             0.356            0.00287
#>  9 bilex~     1  16.4 produ~          0.00703           0.00585          0.00813
#> 10 bilex~     1  16.4 under~          0.134             0.228            0.0461 
#> # ... with 1,156 more rows, and 2 more variables: vocab_prop_conceptual <dbl>,
#> #   vocab_prop_te <dbl>

This data frame follows a similar structure to the one returned by ml_vocabulary when run with default arguments, but vocabulary sizes are now expressed as proportions:

id: participant ID. This ID is unique for every participant and is the same across all responses to the questionnaire from the same participant.
time: how many times has this participant completed any of the questionnaires, including this one?
age: age in months at time of completion
type: vocabulary size type (understands for comprehension, produces for production)
vocab_prop_total: proportion number of item the child was reported to know, summing both languages together
vocab_prop_dominance_l1: proportion of items the child was reported to know in their dominant language (e.g., Catalan words for a child whose language of most exposure is Catalan)
vocab_prop_dominance_l2: proportion of items the child was reported to know in their non-dominant language (e.g., Spanish words for a child whose language of most exposure is Catalan)
vocab_prop_conceptual: proportion of concepts the child know at least one item for (regadless of the language the item belongs to).
vocab_prop_te: proportion of translation equivalents the child knows (how many concepts the child know one item in each language for).

We can also ask for vocabulary sizes expressed in both scales (counts and proportions):

ml_vocabulary(p, r, scale = c("count", "prop"))
#> # A tibble: 1,166 x 14
#>    id      time   age type   vocab_count_total vocab_count_dom~ vocab_count_dom~
#>    <chr>  <dbl> <dbl> <chr>              <int>            <int>            <int>
#>  1 bilex~     1  20.9 produ~               240              112              128
#>  2 bilex~     1  20.9 under~               405              206              199
#>  3 bilex~     1  17.1 produ~                 3                3                0
#>  4 bilex~     1  17.1 under~               114              114                0
#>  5 bilex~     1  16.4 produ~                11                6                5
#>  6 bilex~     1  16.4 under~               217              145               72
#>  7 bilex~     1  16.4 produ~                 4                4                0
#>  8 bilex~     1  16.4 under~               125              124                1
#>  9 bilex~     1  16.4 produ~                 5                2                3
#> 10 bilex~     1  16.4 under~                95               78               17
#> # ... with 1,156 more rows, and 7 more variables: vocab_count_conceptual <int>,
#> #   vocab_count_te <int>, vocab_prop_total <dbl>,
#> #   vocab_prop_dominance_l1 <dbl>, vocab_prop_dominance_l2 <dbl>,
#> #   vocab_prop_conceptual <dbl>, vocab_prop_te <dbl>

Conditional vocabulary size: the `by` argument

We can also compute vocabulary sizes conditional to some variables at the item or participant level, such as semantic/functional category (category), cognate status (cognate) or language profile (lp), using the argument by. Just take a look the variables included i nthe data frames returned by ml_participants() or in the pool of items. You can use this argument as:

ml_vocabulary(p, r, by = "dominance")
#> # A tibble: 1,166 x 10
#>    id         time   age type     dominance vocab_count_tot~ vocab_count_domina~
#>    <chr>     <dbl> <dbl> <chr>    <chr>                <int>               <int>
#>  1 bilexico~     1  20.9 produces Spanish                240                 112
#>  2 bilexico~     1  20.9 underst~ Spanish                405                 206
#>  3 bilexico~     1  17.1 produces Catalan                  3                   3
#>  4 bilexico~     1  17.1 underst~ Catalan                114                 114
#>  5 bilexico~     1  16.4 produces Catalan                 11                   6
#>  6 bilexico~     1  16.4 underst~ Catalan                217                 145
#>  7 bilexico~     1  16.4 produces Catalan                  4                   4
#>  8 bilexico~     1  16.4 underst~ Catalan                125                 124
#>  9 bilexico~     1  16.4 produces Catalan                  5                   2
#> 10 bilexico~     1  16.4 underst~ Catalan                 95                  78
#> # ... with 1,156 more rows, and 3 more variables:
#> #   vocab_count_dominance_l2 <int>, vocab_count_conceptual <int>,
#> #   vocab_count_te <int>

This data frame follows a similar structure as the ones above, but preserves a column for the variable category, which indexes that functiona/semantic category the items belongs to. The value of this argument is passed to dplyr’s group_by under the hood. As with group_by, you can compute vocabulary sizes for combinations of variables:

ml_vocabulary(p, r, by = c("dominance", "lp"))
#> # A tibble: 1,166 x 11
#>    id       time   age type   dominance lp    vocab_count_tot~ vocab_count_domi~
#>    <chr>   <dbl> <dbl> <chr>  <chr>     <chr>            <int>             <int>
#>  1 bilexi~     1  20.9 produ~ Spanish   Bili~              240               112
#>  2 bilexi~     1  20.9 under~ Spanish   Bili~              405               206
#>  3 bilexi~     1  17.1 produ~ Catalan   Mono~                3                 3
#>  4 bilexi~     1  17.1 under~ Catalan   Mono~              114               114
#>  5 bilexi~     1  16.4 produ~ Catalan   Mono~               11                 6
#>  6 bilexi~     1  16.4 under~ Catalan   Mono~              217               145
#>  7 bilexi~     1  16.4 produ~ Catalan   Mono~                4                 4
#>  8 bilexi~     1  16.4 under~ Catalan   Mono~              125               124
#>  9 bilexi~     1  16.4 produ~ Catalan   Bili~                5                 2
#> 10 bilexi~     1  16.4 under~ Catalan   Bili~               95                78
#> # ... with 1,156 more rows, and 3 more variables:
#> #   vocab_count_dominance_l2 <int>, vocab_count_conceptual <int>,
#> #   vocab_count_te <int>

Vocabulary size as counts

Vocabulary size as proportions

Conditional vocabulary size: the by argument

Conditional vocabulary size: the `by` argument