We published a dataset for speech detection with a size of more than 150 thousand hours in 6000+ languages

We posted a giant dataset for speech detection (voice activity detection) is made publicly available.

The dataset contains about 150 thousand hours of audio in more than 6,000 languages. The number of unique ISO codes in a given dataset does not coincide with the actual number of languages, since similar languages ​​can be encoded with the same code.

The data was labeled for the task of voice detection with timeOth sampling rate of approximately 30 milliseconds (or 512 samples at a sampling rate of 16 kilohertz).

This dataset is distributed under license CC BY-NC-SA 4.0.

Details

The dataset contains the following data sets in the following languages:

Name

Number of hours

Number of languages

Link

License

Bible.is

53.138

1,596

URL

Unique

globalrecordings.net

9,743

6,171

URL

CC BY-NC-SA

VoxLingua107

6,628

107

URL

CC BY

Common Voice

30,329

120

URL

CC0

MLS

50,709

8

URL

CC BY

Total

150,547

6,171+

Dataset presented as .feather files containing tagged open audio data sets, as well as a short description of each data set with loading examples. .feather files can be opened using the library pandas:

import pandas as pd
dataframe = pd.read_feather(PATH_TO_FEATHER_FILE)

Every .feather The markup file contains the following columns:

  • speech_timings – markup of this audio. This is a list containing dictionaries of the form {'start': START_SECOND, 'end': END_SECOND}Where START_SECOND And END_SECOND – start and end time of speech in seconds. The number of these dictionaries is equal to the number of speech audio passages found in this audio;

  • language – ISO language code for this audio.

All other details and details you can find out at link.

License

The CC BY-NC-SA 4.0 license was inevitably chosen because one of the most interesting datasets, globalrecordings.net, is published under this “viral” license, which obliges users to use it for derivative works.

There is a certain issue with the interpretation of the Bible.is license, but if we are asked to delete this part of the dataset, we will have to do it.

Citation and affiliations

The dataset was created with the support of the Innovation Promotion Fund within the framework of the federal project “Artificial Intelligence” of the national program “Digital Economy of the Russian Federation”.

You can quote the dataset as follows:

@misc{Silero VAD Dataset,
  author = {Silero Team},
  title = {Silero-VAD Dataset: a large public Internet-scale dataset for voice activity detection for 6000+ languages},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/snakers4/silero-vad/datasets/README.md}},
  email = {hello@silero.ai}
}

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *