A universal package of scraper scripts for humans

Overview

Logo

MIT License version-shield release-shield python-shield

Table of Contents
  1. About The Project
  2. Getting Started
  3. Usage
  4. Contributing
  5. Sponsors
  6. License
  7. Contact
  8. Acknowledgements

About The Project

Scrapera is a completely Chromedriver free package that provides access to a variety of scraper scripts for most commonly used machine learning and data science domains. Scrapera directly and asynchronously scrapes from public API endpoints, thereby removing the heavy browser overhead which makes Scrapera extremely fast and robust to DOM changes. Currently, Scrapera supports the following crawlers:

  • Images
  • Text
  • Audio
  • Videos
  • Miscellaneous

  • The main aim of this package is to cluster common scraping tasks so as to make it more convenient for ML researchers and engineers to focus on their models rather than worrying about the data collection process

    DISCLAIMER: Owner or Contributors do not take any responsibility for misuse of data obtained through Scrapera. Contact the owner if copyright terms are violated due to any module provided by Scrapera.

    Prerequisites

    Prerequisites can be installed separately through the requirements.txt file as below

    pip install -r requirements.txt

    Installation

    Scrapera is built with Python 3 and can be pip installed directly

    pip install scrapera

    Alternatively, if you wish to install the latest version directly through GitHub then run

    pip install git+https://github.com/DarshanDeshpande/Scrapera.git

    Usage

    To use any sub-module, you just need to import, instantiate and execute

    from scrapera.video.vimeo import VimeoScraper
    scraper = VimeoScraper()
    scraper.scrape('https://vimeo.com/191955190', '540p')

    For more examples, please refer to the individual test folders in respective modules

    Contributing

    Scrapera welcomes any and all contributions and scraper requests. Please raise an issue if the scraper fails at any instance. Feel free to fork the repository and add your own scrapers to help the community!
    For more guidelines, refer to CONTRIBUTING

    License

    Distributed under the MIT License. See LICENSE for more information.

    Sponsors

    Logo

    Contact

    Feel free to reach out for any issues or requests related to Scrapera

    Darshan Deshpande (Owner) - Email | LinkedIn

    Acknowledgements

    Comments
    • NSE stock price scraper

      NSE stock price scraper

      I would like to contribute an NSE scraper which will help scrape the following:

      1. Nifty50 index value
      2. Last traded price of a particular stock
      3. All nifty50 stock prices

      P.S. This repository is a Great initiative✨

      opened by pratik-choudhari 7
    • Fixed Code Quality Issues

      Fixed Code Quality Issues

      Description

      Summary:

      • Use is to compare type of objects
      • Remove unnecessary f-string
      • Remove unnecessary generator
      • Removed multiple import names
      • Add .deepsource.toml

      I ran a DeepSource Analysis on my fork of this repository. You can see all the issues raised by DeepSource here.

      DeepSource helps you to automatically find and fix issues in your code during code reviews. This tool looks for anti-patterns, bug risks, performance problems, and raises issues. There are plenty of other issues in relation to Bug Discovery and Anti-Patterns which you would be interested to take a look at.

      If you do not want to use DeepSource to continuously analyze this repo, I'll remove the .deepsource.toml from this PR and you can merge the rest of the fixes. If you want to setup DeepSource for Continuous Analysis, I can help you set that up.

      opened by HarshCasper 4
    • Potential Bug Risks and Anti-Patterns

      Potential Bug Risks and Anti-Patterns

      Description

      Hi @DarshanDeshpande 👋

      I ran DeepSource Static Code Analysis upon the Project, the results for which are available here.

      The Static Code Analysis Tool found potential bugs and anti-patterns in the Code, that can be detrimental at a later point in time with respect to the Project. DeepSource helps you to automatically find and fix issues in your code during code reviews. This tool looks for anti-patterns, bug risks, performance problems, and raises issues.

      Some of the notable issues are:

      • Missing Arguement in Function Call (here)
      • Unnecessary Generator (here)
      • f-string used without any expression (here)
      • Detected subprocess popen call with shell equals True (here)
      • Bad Expect Order (here)

      There are plenty of other issues in relation to Bug Discovery and Anti-Patterns which you would be interested to take a look at.

      If you would like to integrate DeepSource to autofix some of the common occurring issues, I can help you set that up :)

      opened by HarshCasper 4
    • Reddit posts scraper

      Reddit posts scraper

      I would like to contribute a program to scrape reddit posts obtained when a specific topic is searched. Following information will be recorded:

      • number of upvotes
      • number of comments
      • title
      • author
      • link
      • subreddit name
      • isSponsored flag

      Program will:

      • make use of reddit endpoints.
      • support explicit proxies
      • allow to put a cap on max posts to scrap
      • allow to specify sleep interval between requests
      opened by pratik-choudhari 3
    • Make Reddit scraper Asynchronous

      Make Reddit scraper Asynchronous

      This update will include same functionality as previous version but the execution is a lot faster. Here are the results:

      POSTS SCRAPED | NORMAL | ASYNC --|--|-- 20 | 123s | 33s 50 | 213s | 63s 100 | 375s | 143s 200 | 738s | 262s

      With the file sizes remaining almost constant

      opened by pratik-choudhari 2
    • List of available crawler on README.md

      List of available crawler on README.md

      Hi,

      First of all, thank you for initialize such a wonderful project!

      As I am reading this repo, I found it will be convenient if I can see what crawler is implemented from README, instead of looking into codes.

      Not sure if this will be a good idea at this time. Feel free to leave your comment or simply close it.

      opened by zychen423 2
    • potential code refactor

      potential code refactor

      This PR includes:

      • Use is instead of ==
        • It is recommended to use identity test ( is ) instead of equality test ( == ) when you need to compare types of two objects.
      • Simplify boolean expression
      • Remove self
        • The method doesn't use its bound instance. Decorate this method with @staticmethod decorator, so that Python does not have to instantiate a bound method for every instance of this class thereby saving memory and computation.
      opened by tusharnankani 1
    Owner
    Helping Machines Learn Better 💻😃
    null
    Shopee Scraper - A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil

    Shopee Scraper A web scraper in python that extract sales, price, avaliable stock, location and more of a given seller in Brazil. The project was crea

    Paulo DaRosa 5 Nov 29, 2022
    Web Content Retrieval for Humans™

    Lassie Lassie is a Python library for retrieving basic content from websites. Usage >>> import lassie >>> lassie.fetch('http://www.youtube.com/watch?v

    Mike Helmick 570 Dec 19, 2022
    A spider for Universal Online Judge(UOJ) system, converting problem pages to PDFs.

    Universal Online Judge Spider Introduction This is a spider for Universal Online Judge (UOJ) system (https://uoj.ac/). It also works for all other Onl

    TriNitroTofu 1 Dec 7, 2021
    A Smart, Automatic, Fast and Lightweight Web Scraper for Python

    AutoScraper: A Smart, Automatic, Fast and Lightweight Web Scraper for Python This project is made for automatic web scraping to make scraping easy. It

    Mika 4.8k Jan 4, 2023
    A web scraper that exports your entire WhatsApp chat history.

    WhatSoup ?? A web scraper that exports your entire WhatsApp chat history. Table of Contents Overview Demo Prerequisites Instructions Frequen

    Eddy Harrington 87 Jan 6, 2023
    Python scraper to check for earlier appointments in Clalit Health Services

    clalit-appt-checker Python scraper to check for earlier appointments in Clalit Health Services Some background If you ever needed to schedule a doctor

    Dekel 16 Sep 17, 2022
    Automated data scraper for Thailand COVID-19 data

    The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

    Porames Vatanaprasan 31 Apr 17, 2022
    A Web Scraper built with beautiful soup, that fetches udemy course information. Get udemy course information and convert it to json, csv or xml file

    Udemy Scraper A Web Scraper built with beautiful soup, that fetches udemy course information. Installation Virtual Environment Firstly, it is recommen

    Aditya Gupta 15 May 17, 2022
    🤖 Threaded Scraper to get discord servers from disboard.org written in python3

    Disboard-Scraper Threaded Scraper to get discord servers from disboard.org written in python3. Setup. One thread / tag If you whant to look for multip

    Ѵιcнч 11 Nov 1, 2022
    A simple proxy scraper that utilizes the requests module in python.

    Proxy Scraper A simple proxy scraper that utilizes the requests module in python. Usage Depending on your python installation your commands may vary.

    null 3 Sep 8, 2021
    A simple python web scraper.

    Dissec A simple python web scraper. It gets a website and its contents and parses them with the help of bs4. Installation To install the requirements,

    null 11 May 6, 2022
    Twitter Scraper

    Twitter's API is annoying to work with, and has lots of limitations — luckily their frontend (JavaScript) has it's own API, which I reverse–engineered. No API rate limits. No restrictions. Extremely fast.

    Tayyab Kharl 45 Dec 30, 2022
    Kusonime scraper using python3

    Features Scrap from url Scrap from recommendation Search by query Todo [+] Search by genre Example # Get download url >>> from kusonime import Scrap >

    MhankBarBar 2 Jan 28, 2022
    simple http & https proxy scraper and checker

    simple http & https proxy scraper and checker

    Neospace 11 Nov 15, 2021
    Nekopoi scraper using python3

    Features Scrap from url Todo [+] Search by genre [+] Search by query [+] Scrap from homepage Example # Hentai Scraper >>> from nekopoi import Hent >>>

    MhankBarBar 9 Apr 6, 2022
    A social networking service scraper in Python

    snscrape snscrape is a scraper for social networking services (SNS). It scrapes things like user profiles, hashtags, or searches and returns the disco

    null 2.4k Jan 1, 2023
    An automated, headless YouTube Watcher and Scraper

    Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube automation written in Python and the dockerized Tor Browser.

    null 44 Oct 18, 2022
    Dailyiptvlist.com Scraper With Python

    Dailyiptvlist.com scraper Info Made in python Linux only script Script requires to have wget installed Running script Clone repository with: git clone

    null 1 Oct 16, 2021
    Github scraper app is used to scrape data for a specific user profile created using streamlit and BeautifulSoup python packages

    Github Scraper Github scraper app is used to scrape data for a specific user profile. Github scraper app gets a github profile name and check whether

    Siva Prakash 6 Apr 5, 2022