A universal package of scraper scripts for humans

Last update: Dec 15, 2022

Related tags

Web Crawling Scrapera

Overview

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
Contributing
Sponsors
License
Contact
Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

Images

Text

Audio

Youtube Playlist Scraper

Videos

Miscellaneous

Yahoo Stocks Scraper

The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

Prerequisites

Prerequisites can be installed separately through the requirements.txt file as below

pip install -r requirements.txt

Installation

Scrapera is built with Python 3 and can be pip installed directly

pip install scrapera

Alternatively, if you wish to install the latest version directly through GitHub then run

pip install git+https://github.com/DarshanDeshpande/Scrapera.git

Usage

To use any sub-module, you just need to import, instantiate and execute

from scrapera.video.vimeo import VimeoScraper
scraper = VimeoScraper()
scraper.scrape('https://vimeo.com/191955190', '540p')

For more examples, please refer to the individual test folders in respective modules

Contributing

Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
For more guidelines, refer to CONTRIBUTING

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Feel free to reach out for any issues or requests related to Scrapera

Darshan Deshpande (Owner) - Email | LinkedIn

Acknowledgements

PyTube

Comments

NSE stock price scraper
I would like to contribute an NSE scraper which will help scrape the following:

Nifty50 index value

Last traded price of a particular stock

All nifty50 stock prices

P.S. This repository is a Great initiative✨
opened by pratik-choudhari 7
Fixed Code Quality Issues
Description

Summary:

Use is to compare type of objects

Remove unnecessary f-string

Remove unnecessary generator

Removed multiple import names

Add .deepsource.toml

I ran a DeepSource Analysis on my fork of this repository. You can see all the issues raised by DeepSource here.

DeepSource helps you to automatically find and fix issues in your code during code reviews. This tool looks for anti-patterns, bug risks, performance problems, and raises issues. There are plenty of other issues in relation to Bug Discovery and Anti-Patterns which you would be interested to take a look at.

If you do not want to use DeepSource to continuously analyze this repo, I'll remove the .deepsource.toml from this PR and you can merge the rest of the fixes. If you want to setup DeepSource for Continuous Analysis, I can help you set that up.
opened by HarshCasper 4
Potential Bug Risks and Anti-Patterns
Description

Hi @DarshanDeshpande 👋

I ran DeepSource Static Code Analysis upon the Project, the results for which are available here.

The Static Code Analysis Tool found potential bugs and anti-patterns in the Code, that can be detrimental at a later point in time with respect to the Project. DeepSource helps you to automatically find and fix issues in your code during code reviews. This tool looks for anti-patterns, bug risks, performance problems, and raises issues.

Some of the notable issues are:

Missing Arguement in Function Call (here)

Unnecessary Generator (here)

f-string used without any expression (here)

Detected subprocess popen call with shell equals True (here)

Bad Expect Order (here)

There are plenty of other issues in relation to Bug Discovery and Anti-Patterns which you would be interested to take a look at.

If you would like to integrate DeepSource to autofix some of the common occurring issues, I can help you set that up :)
opened by HarshCasper 4
Reddit posts scraper
I would like to contribute a program to scrape reddit posts obtained when a specific topic is searched. Following information will be recorded:

number of upvotes

number of comments

title

author

link

subreddit name

isSponsored flag

Program will:

make use of reddit endpoints.

support explicit proxies

allow to put a cap on max posts to scrap

allow to specify sleep interval between requests
opened by pratik-choudhari 3
Make Reddit scraper Asynchronous

This update will include same functionality as previous version but the execution is a lot faster. Here are the results:

POSTS SCRAPED | NORMAL | ASYNC --|--|-- 20 | 123s | 33s 50 | 213s | 63s 100 | 375s | 143s 200 | 738s | 262s

With the file sizes remaining almost constant

opened by pratik-choudhari 2
List of available crawler on README.md

Hi,

First of all, thank you for initialize such a wonderful project!

As I am reading this repo, I found it will be convenient if I can see what crawler is implemented from README, instead of looking into codes.

Not sure if this will be a good idea at this time. Feel free to leave your comment or simply close it.

opened by zychen423 2
potential code refactor
This PR includes:

Use is instead of ==

It is recommended to use identity test ( is ) instead of equality test ( == ) when you need to compare types of two objects.

Simplify boolean expression

Remove self

The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation.
opened by tusharnankani 1

Owner

Helping Machines Learn Better 💻😃

GitHub

Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

5 Nov 29, 2022

Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

570 Dec 19, 2022

A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

1 Dec 7, 2021

A Smart, Automatic, Fast and Lightweight Web Scraper for Python

AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

4.8k Jan 4, 2023

A web scraper that exports your entire WhatsApp chat history.

WhatSoup ?? A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

87 Jan 6, 2023

Python scraper to check for earlier appointments in Clalit Health Services

clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

16 Sep 17, 2022

Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

31 Apr 17, 2022

A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

15 May 17, 2022

🤖 Threaded Scraper to get discord servers from disboard.org written in python3

Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

11 Nov 1, 2022

A simple proxy scraper that utilizes the requests module in python.

Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

3 Sep 8, 2021

A simple python web scraper.

Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

11 May 6, 2022

Twitter Scraper

Twitter's API is annoying to work with, and has lots of limitations — luckily their frontend (JavaScript) has it's own API, which I reverse–engineered. No API rate limits. No restrictions. Extremely fast.

45 Dec 30, 2022

Kusonime scraper using python3

Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url >>> from kusonime import Scrap >

2 Jan 28, 2022

simple http & https proxy scraper and checker

11 Nov 15, 2021

Nekopoi scraper using python3

Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper >>> from nekopoi import Hent >>>

9 Apr 6, 2022

A social networking service scraper in Python

snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

2.4k Jan 1, 2023

An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube automation written in Python and the dockerized Tor Browser.

44 Oct 18, 2022

Dailyiptvlist.com Scraper With Python

Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

1 Oct 16, 2021

Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

6 Apr 5, 2022