Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

Related tags

Deep Learning CMST
Overview

Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages

Code for the paper titled "Prabhupadavani: A Code-mixed Speech Translation Data for 25 languages"

File organization

  • Preprocessing : contains all files used to preprocess the data (Python 3.6)
  • Data : contains data required to run this code
  • Statistics : contains all files that contains statistics of the dataset

Dataset

file name discription
train/test/dev.csv This is the dataset for code-mixed Speech Translation.
chopped_audios This contains all the audios, transcription and translation.

Statistics of Corpora contained

Languages #types #tokens Types per line Tokens per line Avg. token length
English[100%] 40,324 601889 10.58 11.27 4.92
French (France) 50510 645651 11.38 12.09 5.08
German[100%] 50748 584575 10.44 10.95 5.57
Gujarati[100%] 41959 584989 10.37 10.95 4.46
Hindi[100%] 29744 716800 12.36 13.42 3.74
Hungarian[100%] 84872 506608 9.13 9.49 5.89
Indonesian[100%] 39365 653374 11.54 12.23 6.14
Italian[100%] 52372 512061 9.23 9.59 5.37
Latvian[100%] 70040 477106 8.69 8.93 5.72
Lithuanian[100%] 75222 491558 8.92 9.2 6.04
Nepali[100%] 52630 570268 10.03 10.68 4.88
Persian (Farsi)[100%] 51722 598096 10.61 11.2 4.1
Polish[100%] 71662 494263 8.99 9.25 5.86
Portuguese (Brazil)[100%] 50087 608432 10.8 11.39 5.12
Russian[100%] 72162 490908 8.96 9.19 5.79
Slovak[100%] 73789 520465 9.39 9.75 5.37
Slovenian[100%] 68619 516649 9.35 9.67 5.3
Spanish[100%] 49806 608868 10.75 11.4 5.07
Swedish[100%] 48233 581751 10.31 10.89 5
Tamil[100%] 84183 460678 8.37 8.63 7.65
Telugu[100%] 72006 464665 8.34 8.7 6.56
Turkish[100%] 78957 453521 8.27 8.49 6.35
Bulgarian[100%] 60712 564150 10.1 10.56 5.24
Croatian[100%] 73075 531326 9.58 9.95 5.28
Danish[100%] 50170 587253 10.4 11 4.98
Dutch[100%] 42716 595464 10.52 11.15 5.05

Code-mixing

All languages in Code-mixing

Language Total Words Unique Words Percentage
English 500136 6312 83.6
Bengali 46933 3907 7.84
Sanskrit 51246 7202 8.56
Total 598315 17421 100

Types of Code-mixing

English-Sanskrit Sanskrit-English English-Bengali Bengali-English
Inter-Sentential 2356 2366 339 339
Intra-Sentential 2338 851 124 0
You might also like...
This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166
Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Region Proportion Regularized Inference (RePRI) for Few-Shot Segmentation In this repo, we provide the code for our paper : "Few-Shot Segmentation Wit

Code for ACM MM 2020 paper
Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

Official TensorFlow code for the forthcoming paper
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

This is the code for the paper
This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python=3.7 pytorch=1.6.0 torchvision=0.8

Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation
Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Owner
Ayush Daksh
IIT Kharagpur | Mathematics & Computing | 3rd Year | NLP | UG Researcher
Ayush Daksh
Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Ancient Greek BERT The first and only available Ancient Greek sub-word BERT model! State-of-the-art post fine-tuning on Part-of-Speech Tagging and Mor

Pranaydeep Singh 22 Dec 8, 2022
The implementation of our CIKM 2021 paper titled as: "Cross-Market Product Recommendation"

FOREC: A Cross-Market Recommendation System This repository provides the implementation of our CIKM 2021 paper titled as "Cross-Market Product Recomme

Hamed Bonab 16 Sep 12, 2022
Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation"

SharinGAN Official repo for the work titled "SharinGAN: Combining Synthetic and Real Data for Unsupervised GeometryEstimation" The official project we

Koutilya PNVR 23 Oct 19, 2022
The LaTeX and Python code for generating the paper, experiments' results and visualizations reported in each paper is available (whenever possible) in the paper's directory

This repository contains the software implementation of most algorithms used or developed in my research. The LaTeX and Python code for generating the

João Fonseca 3 Jan 3, 2023
Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow 18 Oct 6, 2022
Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

Benjamin Biggs 29 Dec 28, 2022
TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

null 73 Nov 6, 2022
Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Zhensu Sun 1 Oct 26, 2021
Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

J K Terry 32 Nov 9, 2021