Data Dictionary
Complete reference guide for the CorpusWeave speech dataset structure, fields, languages, and quality standards.
submission_idRequiredUnique identifier for each audio submission
Example:
550e8400-e29b-41d4-a716-446655440000
contributor_idRequiredUnique identifier for the contributor
Example:
660e8400-e29b-41d4-a716-446655440000
languageRequiredLanguage of the audio sample
Example:
lg (Luganda), lsg (Lusoga), lms (Lumasaba), ach (Acholi), run (Runyakore), ate (Ateso), lug (Lugbara), en (English)
audio_urlRequiredURL to the recorded audio file
Example:
https://corpus.example.com/audio/submission_12345.wav
duration_secondsRequiredDuration of audio in seconds
Example:
15, 30, 45
text_contentRequiredThe sentence or text that was recorded
Example:
The quick brown fox jumps over the lazy dog
quality_scoreAverage quality rating from validators (0-5 scale)
Example:
4.5, 3.2, 5.0
validation_countNumber of validators who reviewed this submission
Example:
5, 12, 8
is_approvedWhether the submission has been approved for the dataset
Example:
true, false
points_awardedRequiredPoints awarded to the contributor for this submission
Example:
50, 125, 200
metadataAdditional metadata such as microphone type, background noise level
Example:
{"microphone": "built-in", "noise_level": "low", "accent": "neutral"}
created_atRequiredWhen the submission was created
Example:
2025-03-15T10:30:00Z