We published a dataset for speech detection with a size of more than 150 thousand hours in 6000+ languages
We posted a giant dataset for speech detection (voice activity detection) is made publicly available.
The dataset contains about 150 thousand hours of audio in more than 6,000 languages. The number of unique ISO codes in a given dataset does not coincide with the actual number of languages, since similar languages can be encoded with the same code.
The data was labeled for the task of voice detection with timeOth sampling rate of approximately 30 milliseconds (or 512 samples at a sampling rate of 16 kilohertz).
This dataset is distributed under license CC BY-NC-SA 4.0.
Details
The dataset contains the following data sets in the following languages:
Dataset presented as .feather
files containing tagged open audio data sets, as well as a short description of each data set with loading examples. .feather
files can be opened using the library pandas
:
import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)
Every .feather
The markup file contains the following columns:
speech_timings
– markup of this audio. This is a list containing dictionaries of the form{'start': START_SECOND, 'end': END_SECOND}
WhereSTART_SECOND
AndEND_SECOND
– start and end time of speech in seconds. The number of these dictionaries is equal to the number of speech audio passages found in this audio;language
– ISO language code for this audio.
All other details and details you can find out at link.
License
The CC BY-NC-SA 4.0 license was inevitably chosen because one of the most interesting datasets, globalrecordings.net, is published under this “viral” license, which obliges users to use it for derivative works.
There is a certain issue with the interpretation of the Bible.is license, but if we are asked to delete this part of the dataset, we will have to do it.
Citation and affiliations
The dataset was created with the support of the Innovation Promotion Fund within the framework of the federal project “Artificial Intelligence” of the national program “Digital Economy of the Russian Federation”.
You can quote the dataset as follows:
@misc{Silero VAD Dataset,
author = {Silero Team},
title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
email = {hello@silero.ai}
}