Speech-To-Screen: Improving speech intelligibility for small screen devices using binaural auralisation and head tracking.

Posted on 14.05.2022 - 11:17 by Philippa Demonte
Speech-To-Screen was a subjective, quantitative psychoacoustic listening experiment conducted in the Listening Room at the University of Salford in 2017 as part of the EPSRC-funded S3A - Future Spatial Audio in the Home project. The experiment intended to address the scientific problem of how best to improve the speech intelligibility of audio content played via headphones from small screen devices, such as mobile phones or tablets, particularly if the listener is out and about. One such solution is the spatialisation and separation of speech and noise. Thanks to rapidly evolving technological developments, this will increasingly become feasible in the future with binaural auralisation enabled by: object-based approaches to audio; the generation of personalised HRTFs; head-tracking capability from accelerometers in headphones and ear buds, and development of the audio and head-tracking rendering software for mobile devices.

The Speech-To-Screen listening experiment involved a speech-in-noise test: subjects listened over headphones to spoken sentences played with either speech-shaped noise (SSN)** (stationary noise) or speech-modulated noise (SMN)** (time-varying noise). They were tasked with identifying target words, in this case - letter and number pairs from the GRID speech corpus - which they entered via a graphical user interface. As a proxy for quantifying the effects on speech intelligibility of the different independent variable conditions, correct word identification scores were applied to the data collected. These were then collated across all subjects for each condition. Data were statistically analysed with RMANOVA, for which the data fulfilled the criteria assumptions, and then post-hoc pairwise comparisons with the Bonferroni correction applied.

** The SSN was set at -9 dB SNR and the SMN was set at -12 dB SNR, with speech presented at a calibrated of 69 dB A.

The independent variables tested were:


- INT: speech internalised; noise internalised
- SN: speech internalised; noise externalised
- NS: speech externalised; noise internalised
- EXT: speech externalised; noise externalised

i.e. two conditions where speech and noise are co-located, and two conditions where the impression of spatial separation between speech and noise are generated. INT is the control condition, i.e. speech + noise played simultaneously in regular stereo for headphones such that the sound source is perceived to be from within the listener's head (internalised). To promote externalisation, the relevant sound was rendered with binaural room impulse responses determined by dynamic head tracking: speech was convolved with the 1m / 0 deg azimuth BRIRs set, and noise was convolved with the +/- 30 deg azimuth BRIRs set recorded previously in the same room by Satongar, Pike and Lam (2014).


- video of speaker off (0)
- video of speaker on (1)

A total of 20 normal-hearing, native English speakers between the ages of 18-40 took part in the listening experiment. Each participant completed 20 trials per combination of independent variables, i.e. 320 trials total each.

This Salford Figshare collection contains:

- a zip file with the raw data collected via a Matlab-generated graphical user interface for each trial by each subject;

- a Read Me file regarding the raw data;

- a .txt file with the word recognition percentages (as a quantifiable proxy for speech intelligibility) from the data collated across trials by subject and condition.

- an Excel spreadsheet with a full overview of the data and statistical analyses.
See the tab 'ratios (scores-div-20) > Percentages (about halfway down the sheet) for the processed data used in the statistical analyses.

Full details of the Speech-To-Screen listening experiment, including on the BRIRs and head-tracking used for the binaural auralisation, and final results and analyses, can be found in:

* Demonte et al. (2018). Speech-To-Screen: Spatial separation of dialogue from noise towards improved speech intelligibility for the small screen. Audio Engineering Society Convention Paper 10011. Presented at the 144th Convention, 2018 May 23-26, Milan, Italy.

* University of Salford PhD thesis by P. Demonte (2022).

Contact: Philippa Demonte:
email (1) p.demonte@edu.salford.ac.uk
email (2) philippademonte@gmail.com


Demonte, Philippa; Tang, Yan; Hughes, Richard James; Cox, Trevor John; Fazenda, Bruno Miguel; Shirley, Ben Guy (2022): Speech-To-Screen: Improving speech intelligibility for small screen devices using binaural auralisation and head tracking.. University of Salford. Collection. https://doi.org/10.17866/rd.salford.c.5974915.v1
Select your citation style and then place your mouse over the citation text to select it.


S3A: Future Spatial Audio for an Immersive Listener Experience at Home

Engineering and Physical Sciences Research Council


need help?