To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Ragesh Hajela

Last update: Feb 8, 2022

Related tags

Text Data & NLP Eye_for_the_blind

Overview

Eye for the blind

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset. This kind of model is a use-case for blind people so that they can understand any image with the help of speech. The caption generated through a CNN-RNN model will be converted to speech using a text to speech library.

This problem statement is an application of both deep learning and natural language processing. The features of an image will be extracted by CNN-based encoder and this will be decoded by an RNN model.

The project is an extended application of Show, Attend and Tell: Neural Image Caption Generation with Visual Attention paper. https://arxiv.org/abs/1502.03044

The dataset is taken from the Kaggle website and it consists of sentence-based image description having a list of 8,000 images that are each paired with five different captions which provide clear descriptions of the salient entities and events of the image.

Project Pipeline

The project pipeline can be briefly summarized in the following four steps:

Data Understanding: Here, you need to load the data and understand the representation.
Data preprocessing: In this step, you will process both images and captions to the desired format.
Train/Test Split: Combine both images and captions to create the train and test dataset.
Model-Building: This is the stage where you will create your image captioning model by building Encoder , Attention and Decoder model.
Model Evaluation: Evaluate the models using greedy search and BLEU score.

Speech Recognition for Uyghur using Speech transformer

Speech Recognition for Uyghur using Speech transformer Training: this model using CTC loss and Cross Entropy loss for training. Download pretrained mo

11 Nov 17, 2022

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

3.2k Dec 31, 2022

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

1k Dec 30, 2022

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

5 Dec 28, 2021

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Nav Module The solution for voice related stuff in Python Nav is a Python module which simplifies voice related stuff in Python. Just import the Modul

1 Dec 20, 2021

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

MTFAA-Net Unofficial PyTorch implementation of Baidu's MTFAA-Net: "Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speec

87 Dec 19, 2022

TTS is a library for advanced Text-to-Speech generation.

TTS is a library for advanced Text-to-Speech generation. It's built on the latest research, was designed to achieve the best trade-off among ease-of-training, speed and quality. TTS comes with pretrained models, tools for measuring dataset quality and already used in 20+ languages for products and research projects.

6.5k Jan 8, 2023

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

142 Jan 6, 2023

A 10000+ hours dataset for Chinese speech recognition

309 Dec 16, 2022

To create a deep learning model which can explain the content of an image in the form of speech through caption generation with attention mechanism on Flickr8K dataset.

Related tags

Overview

Eye for the blind

Project Pipeline

You might also like...

Speech Recognition for Uyghur using Speech transformer

Silero Models: pre-trained speech-to-text, text-to-speech models and benchmarks made embarrassingly simple

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

Simple Speech to Text, Text to Speech

A Python module made to simplify the usage of Text To Speech and Speech Recognition.

Multi-Scale Temporal Frequency Convolutional Network With Axial Attention for Speech Enhancement

TTS is a library for advanced Text-to-Speech generation.

PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

A 10000+ hours dataset for Chinese speech recognition

Owner

Ragesh Hajela

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

The following links explain a bit the idea of semantic search and how search mechanisms work by doing retrieve and rerank

Generate vector graphics from a textual caption

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Samantha, A covid-19 information bot which will provide basic information about this pandemic in form of conversation.

DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

Explore different way to mix speech model(wav2vec2, hubert) and nlp model(BART,T5,GPT) together

An IVR Chatbot which can exponentially reduce the burden of companies as well as can improve the consumer/end user experience.

Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding