Open Crawl Vietnamese Text

QAI Research

Last update: Jan 5, 2022

Related tags

Web Crawling Open_Crawl_Vietnamese_Text

Overview

Open Crawl Vietnamese Text

This repo contains crawled Vietnamese text from multiple sources.

This list of a topic-centric public data sources in high quality . We have collected and cleaned them from multiple sources. All of the datasets listed below are free.

Here are the ways we clean the data:

Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

1. Binhvq News Corpus:

Binhvq News Corpus was crawled from news on the internet with size of 50GB text.

link_raw, link_clean

2. Oscar corpus vietnamese crawl:

OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the goclassy architecture. Oscar has mostly 32 GB vietnamese text discarded duplicates.

link_raw, link_clean

3. Dataset story VietNamese :

Including texts of short and long story with size of 10 GB crawled by QAI on the internet.

link_clean

4. Dataset poem VietNamese :

More than 1 million sentences collected by QAI on the internet.

link_clean

Scrapes proxies and saves them to a text file

Proxy Scraper Scrapes proxies from https://proxyscrape.com and saves them to a file. Also has a customizable theme system Made by nell and Lamp

2 Dec 22, 2021

Text to speech for Vietnamese, ez to use, ez to update

Chào mọi người, đây là dự án mở nhằm giúp việc đọc được trở nên dễ dàng hơn. Rất cảm ơn đội ngũ Zalo đã cung cấp hạ tầng để mình có thể tạo ra app này

32 Jul 29, 2022

A standalone package to scrape financial data from listed Vietnamese companies via Vietstock

Scrape Financial Data of Vietnamese Listed Companies - Version 2 A standalone package to scrape financial data from listed Vietnamese companies via Vi

45 Nov 16, 2022

A Vietnamese personal card OCR website built with Django.

Django VietCardOCR Installation Creation of virtual environments is done by executing the command venv: python -m venv venv That will create a new fol

4 Sep 4, 2021

Image captioning service for healthcare domains in Vietnamese using VLP

Image captioning service for healthcare domains in Vietnamese using VLP This service is a web service that provides image captioning services for heal

2 Nov 4, 2021

A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

4 May 5, 2022

Vietnamese Language Detection and Recognition

Table of Content Introduction (Khôi viết) Dataset (đổi link thui thành 3k5 ảnh mình) Getting Started (An Viết) Requirements Usage Example Training & E

6 May 27, 2022

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

Vietnamese sign lagnuage recognition using MHI and CNN This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm

3 Feb 24, 2022

OCR-ID-Card VietNamese (new id-card)

OCR-ID-Card VietNamese (new id-card) run project: download 2 file weights and pu

12 Jun 15, 2022

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Text to speech (using Python) Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and co

19 Jun 30, 2022

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH. It leverages selenium, a website testing framework to crawl the titles and pdf urls from the conference website, and download them one by one with some simple anti-anti-crawler tricks.

39 Nov 21, 2022

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (SA), Tunisian Dialect Identification (TDI) and Reading Comprehension Question-Answering (RCQA)

72 Dec 9, 2022

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

1 Jan 10, 2022

Iptvcrawl - A scrapy project for crawl IPTV playlist

iptvcrawl a scrapy project for crawl IPTV playlist. Dependency Python3 pip insta

18 May 5, 2022

Open Crawl Vietnamese Text

Related tags

Overview

Open Crawl Vietnamese Text

1. Binhvq News Corpus:

2. Oscar corpus vietnamese crawl:

3. Dataset story VietNamese :

4. Dataset poem VietNamese :

You might also like...

Scrapes proxies and saves them to a text file

Text to speech for Vietnamese, ez to use, ez to update

A standalone package to scrape financial data from listed Vietnamese companies via Vietstock

A Vietnamese personal card OCR website built with Django.

Image captioning service for healthcare domains in Vietnamese using VLP

A transformer-based method for Healthcare Image Captioning in Vietnamese

Vietnamese Language Detection and Recognition

This is a model to classify Vietnamese sign language using Motion history image (MHI) algorithm and CNN.

OCR-ID-Card VietNamese (new id-card)

Text to speech is a process to convert any text into voice. Text to speech project takes words on digital devices and convert them into audio. Here I have used Google-text-to-speech library popularly known as gTTS library to convert text file to .mp3 file. Hope you like my project!

Python code to crawl computer vision papers from top CV conferences. Currently it supports CVPR, ICCV, ECCV, NeurIPS, ICML, ICLR, SIGGRAPH

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

A toolkit to automatically crawl the paper list and download paper pdfs of ACL Ahthology.

Crawl the information of a given keyword on Google search engine

Crawl BookCorpus

Python script who crawl first shodan page and check DBLTEK vulnerability

This script is intended to crawl license information of repositories through the GitHub API.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Iptvcrawl - A scrapy project for crawl IPTV playlist

Owner

QAI Research

Crawl BookCorpus

Python script who crawl first shodan page and check DBLTEK vulnerability

This script is intended to crawl license information of repositories through the GitHub API.

This is a simple website crawler which asks for a website link from the user to crawl and find specific data from the given website address.

Iptvcrawl - A scrapy project for crawl IPTV playlist

A python module to parse the Open Graph Protocol

The open-source web scrapers that feed the Los Angeles Times California coronavirus tracker.

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Web scraping library and command-line tool for text discovery and extraction (main content, metadata, comments)

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.