Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages
Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"
File organization
- Preprocessing : contains all files used to preprocess the data (Python 3.6)
- Data : contains data required to run this code
- Statistics : contains all files that contains statistics of the dataset
Dataset
file name |
discription |
train/test/dev.csv |
This is the dataset for code-mixed Speech Translation. |
chopped_audios |
This contains all the audios, transcription and translation. |
Statistics of Corpora contained
Languages |
#types |
#tokens |
Types per line |
Tokens per line |
Avg. token length |
English[100%] |
40,324 |
601889 |
10.58 |
11.27 |
4.92 |
French (France) |
50510 |
645651 |
11.38 |
12.09 |
5.08 |
German[100%] |
50748 |
584575 |
10.44 |
10.95 |
5.57 |
Gujarati[100%] |
41959 |
584989 |
10.37 |
10.95 |
4.46 |
Hindi[100%] |
29744 |
716800 |
12.36 |
13.42 |
3.74 |
Hungarian[100%] |
84872 |
506608 |
9.13 |
9.49 |
5.89 |
Indonesian[100%] |
39365 |
653374 |
11.54 |
12.23 |
6.14 |
Italian[100%] |
52372 |
512061 |
9.23 |
9.59 |
5.37 |
Latvian[100%] |
70040 |
477106 |
8.69 |
8.93 |
5.72 |
Lithuanian[100%] |
75222 |
491558 |
8.92 |
9.2 |
6.04 |
Nepali[100%] |
52630 |
570268 |
10.03 |
10.68 |
4.88 |
Persian (Farsi)[100%] |
51722 |
598096 |
10.61 |
11.2 |
4.1 |
Polish[100%] |
71662 |
494263 |
8.99 |
9.25 |
5.86 |
Portuguese (Brazil)[100%] |
50087 |
608432 |
10.8 |
11.39 |
5.12 |
Russian[100%] |
72162 |
490908 |
8.96 |
9.19 |
5.79 |
Slovak[100%] |
73789 |
520465 |
9.39 |
9.75 |
5.37 |
Slovenian[100%] |
68619 |
516649 |
9.35 |
9.67 |
5.3 |
Spanish[100%] |
49806 |
608868 |
10.75 |
11.4 |
5.07 |
Swedish[100%] |
48233 |
581751 |
10.31 |
10.89 |
5 |
Tamil[100%] |
84183 |
460678 |
8.37 |
8.63 |
7.65 |
Telugu[100%] |
72006 |
464665 |
8.34 |
8.7 |
6.56 |
Turkish[100%] |
78957 |
453521 |
8.27 |
8.49 |
6.35 |
Bulgarian[100%] |
60712 |
564150 |
10.1 |
10.56 |
5.24 |
Croatian[100%] |
73075 |
531326 |
9.58 |
9.95 |
5.28 |
Danish[100%] |
50170 |
587253 |
10.4 |
11 |
4.98 |
Dutch[100%] |
42716 |
595464 |
10.52 |
11.15 |
5.05 |
Code-mixing
All languages in Code-mixing
Language |
Total Words |
Unique Words |
Percentage |
English |
500136 |
6312 |
83.6 |
Bengali |
46933 |
3907 |
7.84 |
Sanskrit |
51246 |
7202 |
8.56 |
Total |
598315 |
17421 |
100 |
Types of Code-mixing
|
English-Sanskrit |
Sanskrit-English |
English-Bengali |
Bengali-English |
Inter-Sentential |
2356 |
2366 |
339 |
339 |
Intra-Sentential |
2338 |
851 |
124 |
0 |