Skip to content

Latest commit

 

History

History
74 lines (45 loc) · 7.48 KB

190904.md

File metadata and controls

74 lines (45 loc) · 7.48 KB

Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche, Coupé et al., 2019

Paper, Tags: #linguistics, #social-sciences

Using quantitative methods on a large cross-linguistic corpus of 17 languages, we show that the coupling between language-level (information per syllabe) and speaker-level (speech rate SR) properties results in language encoding similar information rates IR: ~39bits/s. This highlights the intimate feedback loops between languages and their speakers due to communicative pressures.

We suggest that this phenomenon is rooted in the human neurocognitive capacity, and that human language can be analyzed as the product of a multiscale communicative and cultural nice construction process involving biology, environment and culture.

The Uniform Information Density hypothesis suggests that speakers distribute information along the speech signal following a smooth distribution rather than high-amplitude fluctuations. Compatible with Shannon's theory, this optimization process guarantees the robust information transmission at a rate close to the channel capacity.

In this work we compare, across very different languages (note: only 17 though), the average rates at which information is emitted. This enables us to estimate the channel capacity and to assess whether the differences observed among languages in terms of encoding result in:

  • analog differences in channel capacity, or
  • whether there are compensanting strategies that go beyond the local adaptation operating during speech production

We investigate the interaction between information encoding and average SR and whether the variation among languages in IR is regulated by communicative constraints. low IR -> communicative inefficiency? very high IR -> heavy physiological and cognitive costs?

We've chosen to focus on the syllable as the information encoding unit for both linguistic and cognitive reasons. There are reasons to use syllables when analyzing the encoding of information:

  1. Syllables are much less prone than phonemes to complete deletion in casual speech -> more robust estimates of SR
  2. Morphemes or words rely more on a language-specific linguistic analysis
  3. There have been several models and studies recently on the neurocognitive side underpinning the pivotal role of the syllabic time scale for speech comprehension.

We studied 17 languages from 9 language families spread across Europe and Asia, showing a remarkable diversity in terms of linguistic and typological features at all levels: phonetics, phonology, morphology, syntax, semantics, pragmatics.

  • Phonemes
    • Spanish, Japanese: 25
    • English, Thai: 40
  • Distinct syllables:
    • Japanese: a few hundred
    • English: 7000
  • Tonal complexity:
    • From none to six contrastive tones

Because of the size (note: size?) and diversity, the sample is adequate to reveal robust trends reflecting phenomena that can potentially be extrapolated to human language in general. (note: really?)

Experiment

We collected recordings of 170 native adult speakers, each reading at their normal rate a standardized set of 15 semantically similar texts across the languages (~240k syllables in total). We extracted the duration in seconds, and the total number of syllables (NS) of the text's 'canonical' pronunciation (= standard pronunciation foudn in dictionaries and lexical databases). We obtain the speech rate SR by computing the ratio between the NS and duration.

In parallel, we estimated each language's information density ID from independently available written corpora, using the syllable conditional entropy to take word-internal syllable-bigram dependencies into account.We then compute the average IR by multiplying the ID by the SR for each text read by each speaker in our dataset. Individual SR varies in a ratio of more than 1 or 2: slowest speaker: 4.3 syllables/s, fastest: 9.1 syllables/s. ID varies also from 4.8 bits/syllabe in Basque to 8 bits/syllable in Vietnamese.

Results

We check if our definition of ID provides a relevant measure of linguistic ID using the syntagmatic density of information ratio (SDIR): this quantifies the relative informational density of language L compared to a reference language, based on semantic information expressed in the context of a limited oral corpus. It provides the ground truth on the semantic information conveyed by the sentences in the spoken corpus. We find a high correlation between ID and SDIR which suggests that despite differences in:

  • material
    • heterogeneous and written corpus vs.
    • parallel and spoken corpus
  • nature
    • entropy measured on a large lexicon vs.
    • normalized ratio derived from small texts

, our ID is a good estimate of the average amount if information per syllable.

We used generalized additive models for lication, scale and shape (GAMLSS) because they allow modelling the mean and SD of Gaussian distributions, considering sex as fixed effect and text, language and speaker as random effects.

We found that IR is centered on a mean of 39-15 bits/s with an SD of 5.10 bits/s, while SR is centered on a mean of 6.63 syllables/s with a SD of 1.15 syllables/s. Females have significantly lower means and SDs. Language has the most impact on SR, while speaker has the largest on IR. In other words, while SR is mainly clustered by language, the influence of this language-level clustering is reduced for IR.

To explore the relationship between SR and ID, we included ID as a fixed effect in the GAMLSS modeling of SR, while dopping language as a random effect. We found a negative effect on the mean and in its SD. This negative relationship indicates that there's a trade-off between SR and ID, the languages with lower IDs being spoken faster. We also found that languages are significantly more similar to each other in IR than in NS (number of syllables per text) and SR.

Discussion

We investigated relationship between ID (estimated from written data) and SR (computed from parallel spoken data) across 17 languages. Our results, based on a controlled linguistic material and consisting of read speech, do reflect actual phenomena found in more natural settings. The effects of text in SR and IR models are much smaller than those of speaker and language.

While there'swide interspeaker variation in speech and IR, this variation is also structured by language: an Individual's speech behavior isn't entirely due to individual characteristics but is further constrained by the language being spoken.

note: get polyglots doing this test and get their SR and ID in different languages

Human communication seems to avoid two extreme sociolinguistic profiles: high ID languages spoken fast by their speakers (high-fast), and low ID languages spoken slowly by their speakers (low-slow). Both speakers and listeners have an interest in avoiding high-fast languages. Low-slow languages, if they existed, would lead to longer turns in interaction, and they would require their speakers to keep longer chunks in working memory for a given informational content. One can hypothesize that speakers from this language would accelerate their articulation rate to compensate for the low ID.

ID depeneds on the language itself, grammar, lexicon and long-term usage, and SR reflects individual speakers' behavior, so individuals continuously monitor and adapt their SR to the specific linguistic and communicative context.

Code, data and analysis and plots can be found in Github.