We published a dataset for speech detection with a size of more than 150 thousand hours in 6000+ languages

We posted a giant dataset for speech detection (voice activity detection) is made publicly available.

The dataset contains about 150 thousand hours of audio in more than 6,000 languages. The number of unique ISO codes in a given dataset does not coincide with the actual number of languages, since similar languages can be encoded with the same code.

The data was labeled for the task of voice detection with timeOth sampling rate of approximately 30 milliseconds (or 512 samples at a sampling rate of 16 kilohertz).

This dataset is distributed under license CC BY-NC-SA 4.0.

Details

The dataset contains the following data sets in the following languages:

Name	Number of hours	Number of languages	Link	License
Bible.is	53.138	1,596	URL	Unique
globalrecordings.net	9,743	6,171	URL	CC BY-NC-SA
VoxLingua107	6,628	107	URL	CC BY
Common Voice	30,329	120	URL	CC0
MLS	50,709	8	URL	CC BY
Total	150,547	6,171+

Dataset presented as .feather files containing tagged open audio data sets, as well as a short description of each data set with loading examples. .feather files can be opened using the library pandas:

import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)

Every .feather The markup file contains the following columns:

speech_timings – markup of this audio. This is a list containing dictionaries of the form {'start': START_SECOND, 'end': END_SECOND}Where START_SECOND And END_SECOND – start and end time of speech in seconds. The number of these dictionaries is equal to the number of speech audio passages found in this audio;
language – ISO language code for this audio.

All other details and details you can find out at link.

License

The CC BY-NC-SA 4.0 license was inevitably chosen because one of the most interesting datasets, globalrecordings.net, is published under this “viral” license, which obliges users to use it for derivative works.

There is a certain issue with the interpretation of the Bible.is license, but if we are asked to delete this part of the dataset, we will have to do it.

Citation and affiliations

The dataset was created with the support of the Innovation Promotion Fund within the framework of the federal project “Artificial Intelligence” of the national program “Digital Economy of the Russian Federation”.

You can quote the dataset as follows:

@misc{Silero VAD Dataset,
  author = {Silero Team},
  title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
  email = {hello@silero.ai}
}