CW

CorpusWeave

Data Dictionary

Complete reference guide for the CorpusWeave speech dataset structure, fields, languages, and quality standards.

submission_idRequired

Unique identifier for each audio submission

UUID

Example:

550e8400-e29b-41d4-a716-446655440000

contributor_idRequired

Unique identifier for the contributor

UUID

Example:

660e8400-e29b-41d4-a716-446655440000

languageRequired

Language of the audio sample

String (language code)

Example:

lg (Luganda), lsg (Lusoga), lms (Lumasaba), ach (Acholi), run (Runyakore), ate (Ateso), lug (Lugbara), en (English)

audio_urlRequired

URL to the recorded audio file

URL

Example:

https://corpus.example.com/audio/submission_12345.wav

duration_secondsRequired

Duration of audio in seconds

Integer

Example:

15, 30, 45

text_contentRequired

The sentence or text that was recorded

String

Example:

The quick brown fox jumps over the lazy dog

quality_score

Average quality rating from validators (0-5 scale)

Float (0-5)

Example:

4.5, 3.2, 5.0

validation_count

Number of validators who reviewed this submission

Integer

Example:

5, 12, 8

is_approved

Whether the submission has been approved for the dataset

Boolean

Example:

true, false

points_awardedRequired

Points awarded to the contributor for this submission

Integer

Example:

50, 125, 200

metadata

Additional metadata such as microphone type, background noise level

JSON Object

Example:

{"microphone": "built-in", "noise_level": "low", "accent": "neutral"}

created_atRequired

When the submission was created

ISO 8601 Timestamp

Example:

2025-03-15T10:30:00Z

Ready to Access the Dataset?

Start contributing to CorpusWeave or download the dataset for your research.