Universal Reddit Scraper - A comprehensive Reddit scraping command-line tool written in Python.

Joseph Lai

Last update: Jan 3, 2023

Related tags

Web Crawling python json data-science data-mining reddit command-line livestream subreddit data-analysis comments praw trees wordcloud-generator redditor reddit-scraper universal-reddit-scraper osint-tool

Overview

 __  __  _ __   ____  
/\ \/\ \/\`'__\/',__\ 
\ \ \_\ \ \ \//\__, `\
 \ \____/\ \_\\/\____/
  \/___/  \/_/ \/___/... Universal Reddit Scraper

usage: $ Urs.py
     
    [-h]
    [-e]
    [-v]

    [-t [
   
    ]]
    [--check]

    [-r 
    
      <(h|n|c|t|r|s)> 
     
       [
      
       ]] 
        [-y]
        [--csv]
        [--rules]
    [-u 
        
        
         ] [-c 
          
          
           ] [--raw] [-b] [--csv] [-lr 
           
            ] [-lu 
            
             ] [--nosave] [--stream-submissions] [-f 
             
              ] [--csv] [-wc 
              
                [
               
                ] [--nosave]

Contact
Introduction
Installation
- Troubleshooting
  - ModuleNotFoundError
Exporting
- Export File Format
  - Exporting to CSV
- Export Directory Structure
URS Overview
Sponsors
Contributors
Contributing
Derivative Projects
- skiwheelr/URS
Supplemental Documents

Contact

Whether you are using URS for enterprise or personal use, I am very interested in hearing about your use case and how it has helped you achieve a goal.

Additionally, please send me an email if you would like to contribute, have questions, or want to share something you have built on top of it.

You can send me an email or leave a note by clicking on either of these badges. I look forward to hearing from you!

Introduction

This is a comprehensive Reddit scraping tool that integrates multiple features:

Scrape Reddit via PRAW (the official Python Reddit API Wrapper)
- Scrape Subreddits
- Scrape Redditors
- Scrape submission comments
Livestream Reddit via PRAW
- Livestream comments submitted within Subreddits or by Redditors
- Livestream submissions submitted within Subreddits or by Redditors
Analytical tools for scraped data
- Generate frequencies for words that are found in submission titles, bodies, and/or comments
- Generate a wordcloud from scrape results

See the Getting Started section to get your API credentials.

Installation

NOTE: Requires Python 3.7+

git clone --depth=1 https://github.com/JosephLai241/URS.git
cd URS
pip3 install . -r requirements.txt

Troubleshooting

`ModuleNotFoundError`

You may run into an error that looks like this:

from urs.utils.Logger import LogMain ModuleNotFoundError: No module named 'urs' ">

Traceback (most recent call last):
  File "/home/joseph/URS/urs/./Urs.py", line 30, in 
   
    
    from urs.utils.Logger import LogMain
ModuleNotFoundError: No module named 'urs'

This means you will need to add the URS directory to your PYTHONPATH. Here is a link that explains how to do so for each operating system.

Exporting

Export File Format

All files except for those generated by the wordcloud tool are exported to JSON by default. Wordcloud files are exported to PNG by default.

URS supports exporting to CSV as well, but JSON is the more versatile option.

Exporting to CSV

You will have to include the --csv flag to export to CSV.

You can only export to CSV when using:

The Subreddit scrapers
The word frequencies generator

These tools are also suitable for CSV format and are optimized to do so if you want to use that format instead.

The --csv flag is ignored if it is present while using any of the other scrapers.

Export Directory Structure

All exported files are saved within the scrapes directory and stored in a sub-directory labeled with the date. Many more sub-directories may be created in the date directory. Sub-directories are only created when its respective tool is run. For example, if you only use the Subreddit scraper, only the subreddits directory is created.

PRAW Scrapers

The subreddits, redditors, or comments directories may be created.

PRAW Livestream Scrapers

The livestream directory is created when you run any of the livestream scrapers. Within it, the subreddits or redditors directories may be created.

Analytical Tools

The analytics directory is created when you run any of the analytical tools. Within it, the frequencies or wordclouds directories may be created. See the Analytical Tools section for more information.

Example Directory Structure

This is the samples directory structure generated by the tree command.

scrapes/
└── 06-02-2021
    ├── analytics
    │   ├── frequencies
    │   │   ├── comments
    │   │   │   └── What’s something from the 90s you miss_-all.json
    │   │   ├── livestream
    │   │   │   └── subreddits
    │   │   │       └── askreddit-comments-20_44_11-00_01_10.json
    │   │   └── subreddits
    │   │       └── cscareerquestions-search-'job'-past-year-rules.json
    │   └── wordcloud
    │       ├── comments
    │       │   └── What’s something from the 90s you miss_-all.png
    │       ├── livestream
    │       │   └── subreddits
    │       │       └── askreddit-comments-20_44_11-00_01_10.png
    │       └── subreddits
    │           └── cscareerquestions-search-'job'-past-year-rules.png
    ├── comments
    │   └── What’s something from the 90s you miss_-all.json
    ├── livestream
    │   └── subreddits
    │       ├── askreddit-comments-20_44_11-00_01_10.json
    │       └── askreddit-submissions-20_46_12-00_01_52.json
    ├── redditors
    │   └── spez-5-results.json
    ├── subreddits
    │   ├── askreddit-hot-10-results.json
    │   └── cscareerquestions-search-'job'-past-year-rules.json
    └── urs.log

URS Overview

Scrape Speeds

Your internet connection speed is the primary bottleneck that will establish the scrape duration; however, there are additional bottlenecks such as:

The number of results returned for Subreddit or Redditor scraping.
The submission's popularity (total number of comments) for submission comments scraping.

Scraping Reddit via PRAW

Getting Started

It is very quick and easy to get Reddit API credentials. Refer to my guide to get your credentials, then update the environment variables located in .env.

Rate Limits

Yes, PRAW has rate limits. These limits are proportional to how much karma you have accumulated - the higher the karma, the higher the rate limit. This has been implemented to mitigate spammers and bots that utilize PRAW.

Rate limit information for your account is displayed in a small table underneath the successful login message each time you run any of the PRAW scrapers. I have also added a --check flag if you want to quickly view this information.

URS will display an error message as well as the rate limit reset date if you have used all your available requests.

There are a couple ways to circumvent rate limits:

Scrape intermittently
Use an account with high karma to get your PRAW credentials
Scrape less results per run

Available requests are refilled if you use the PRAW scrapers intermittently, which might be the best solution. This can be especially helpful if you have automated URS and are not looking at the output on each run.

A Table of All Subreddit, Redditor, and Submission Comments Attributes

These attributes are included in each scrape.

Subreddits (submissions)	Redditors	Submission Comments
`author`	`comment_karma`	`author`
`created_utc`	`created_utc`	`body`
`distinguished`	`fullname`	`body_html`
`edited`	`has_verified_email`	`created_utc`
`id`	`icon_img`	`distinguished`
`is_original_content`	`id`	`edited`
`is_self`	`is_employee`	`id`
`link_flair_text`	`is_friend`	`is_submitter`
`locked`	`is_mod`	`link_id`
`name`	`is_gold`	`parent_id`
`num_comments`	`link_karma`	`score`
`nsfw`	`name`	`stickied`
`permalink`	`subreddit`
`score`	*`trophies`
`selftext`	*`comments`
`spoiler`	*`controversial`
`stickied`	*`downvoted` (may be forbidden)
`title`	*`gilded`
`upvote_ratio`	*`gildings` (may be forbidden)
`url`	*`hidden` (may be forbidden)
	*`hot`
	*`moderated`
	*`multireddits`
	*`new`
	*`saved` (may be forbidden)
	*`submissions`
	*`top`
	*`upvoted` (may be forbidden)

*Includes additional attributes; see Redditors section for more information.

Available Flags

[-r 
   
     <(h|n|c|t|r|s)> 
    
      [
     
      ]] 
    [-y]
    [--csv]
    [--rules]
[-u 
       
       
        ] [-c 
         
         
          ] [--raw] [-b] [--csv]

Subreddits

*This GIF is uncut.

Usage: $ ./Urs.py -r SUBREDDIT (H|N|C|T|R|S) N_RESULTS_OR_KEYWORDS

Supported export formats: JSON and CSV. To export to CSV, include the --csv flag.

You can specify Subreddits, the submission category, and how many results are returned from each scrape. I have also added a search option where you can search for keywords within a Subreddit.

These are the submission categories:

Hot
New
Controversial
Top
Rising
Search

The file names for all categories except for Search will follow this format:

"[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s).[FILE_FORMAT]"

If you searched for keywords, file names will follow this format:

"[SUBREDDIT]-Search-'[KEYWORDS]'.[FILE_FORMAT]"

Scrape data is exported to the subreddits directory.

NOTE: Up to 100 results are returned if you search for keywords within a Subreddit. You will not be able to specify how many results to keep.

Time Filters

Time filters may be applied to some categories. Here is a table of the categories on which you can apply a time filter as well as the valid time filters.

Categories	Time Filters
Controversial	All (default)
Top	Day
Search	Hour
	Month
	Week
	Year

Specify the time filter after the number of results returned or keywords you want to search for.

Usage: $ ./Urs.py -r SUBREDDIT (C|T|S) N_RESULTS_OR_KEYWORDS OPTIONAL_TIME_FILTER

If no time filter is specified, the default time filter all is applied. The Subreddit settings table will display None for categories that do not offer the additional time filter option.

If you specified a time filter, -past-[TIME_FILTER] will be appended to the file name before the file format like so:

"[SUBREDDIT]-[POST_CATEGORY]-[N_RESULTS]-result(s)-past-[TIME_FILTER].[FILE_FORMAT]"

Or if you searched for keywords:

"[SUBREDDIT]-Search-'[KEYWORDS]'-past-[TIME_FILTER].[FILE_FORMAT]"

Subreddit Rules and Post Requirements

You can also include the Subreddit's rules and post requirements in your scrape data by including the --rules flag. This only works when exporting to JSON. This data will be included in the subreddit_rules field.

If rules are included in your file, -rules will be appended to the end of the file name.

Bypassing the Final Settings Check

After submitting the arguments and Reddit validation, URS will display a table of Subreddit scraping settings as a final check before executing. You can include the -y flag to bypass this and immediately scrape.

Redditors

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -u REDDITOR N_RESULTS

Supported export formats: JSON.

You can also scrape Redditor profiles and specify how many results are returned.

Redditor information will be included in the information field and includes the following attributes:

Redditor Information
`comment_karma`
`created_utc`
`fullname`
`has_verified_email`
`icon_img`
`id`
`is_employee`
`is_friend`
`is_mod`
`is_gold`
`link_karma`
`name`
`subreddit`
`trophies`

Redditor interactions will be included in the interactions field. Here is a table of all Redditor interaction attributes that are also included, how they are sorted, and what type of Reddit objects are included in each.

Attribute Name	Sorted By/Time Filter	Reddit Objects
Comments	Sorted By: New	Comments
Controversial	Time Filter: All	Comments and submissions
Downvoted	Sorted By: New	Comments and submissions
Gilded	Sorted By: New	Comments and submissions
Gildings	Sorted By: New	Comments and submissions
Hidden	Sorted By: New	Comments and submissions
Hot	Determined by other Redditors' interactions	Comments and submissions
Moderated	N/A	Subreddits
Multireddits	N/A	Multireddits
New	Sorted By: New	Comments and submissions
Saved	Sorted By: New	Comments and submissions
Submissions	Sorted By: New	Submissions
Top	Time Filter: All	Comments and submissions
Upvoted	Sorted By: New	Comments and submissions

These attributes contain comments or submissions. Subreddit attributes are also included within both.

This is a table of all attributes that are included for each Reddit object:

Subreddits	Comments	Submissions	Multireddits	Trophies
`can_assign_link_flair`	`body`	`author`	`can_edit`	`award_id`
`can_assign_user_flair`	`body_html`	`created_utc`	`copied_from`	`description`
`created_utc`	`created_utc`	`distinguished`	`created_utc`	`icon_40`
`description`	`distinguished`	`edited`	`description_html`	`icon_70`
`description_html`	`edited`	`id`	`description_md`	`name`
`display_name`	`id`	`is_original_content`	`display_name`	`url`
`id`	`is_submitter`	`is_self`	`name`
`name`	`link_id`	`link_flair_text`	`nsfw`
`nsfw`	`parent_id`	`locked`	`subreddits`
`public_description`	`score`	`name`	`visibility`
`spoilers_enabled`	`stickied`	`num_comments`
`subscribers`	*`submission`	`nsfw`
`user_is_banned`	`subreddit_id`	`permalink`
`user_is_moderator`		`score`
`user_is_subscriber`		`selftext`
		`spoiler`
		`stickied`
		*`subreddit`
		`title`
		`upvote_ratio`
		`url`

* Contains additional metadata.

The file names will follow this format:

"[USERNAME]-[N_RESULTS]-result(s).json"

Scrape data is exported to the redditors directory.

NOTE: If you are not allowed to access a Redditor's lists, PRAW will raise a 403 HTTP Forbidden exception and the program will just append "FORBIDDEN" underneath that section in the exported file.

NOTE: The number of results returned are applied to all attributes. I have not implemented code to allow users to specify different number of results returned for individual attributes.

Submission Comments

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -c SUBMISSION_URL N_RESULTS

Supported export formats: JSON.

You can also scrape comments from submissions and specify the number of results returned.

Submission metadata will be included in the submission_metadata field and includes the following attributes:

Submission Attributes
`author`
`created_utc`
`distinguished`
`edited`
`is_original_content`
`is_self`
`link_flair_text`
`locked`
`nsfw`
`num_comments`
`permalink`
`score`
`selftext`
`spoiler`
`stickied`
`subreddit`
`title`
`upvote_ratio`

If the submission contains a gallery, the attributes gallery_data and media_metadata will be included.

Comments are written to the comments field. They are sorted by "Best", which is the default sorting option when you visit a submission.

PRAW returns submission comments in level order, which means scrape speeds are proportional to the submission's popularity.

The file names will generally follow this format:

"[POST_TITLE]-[N_RESULTS]-result(s).json"

Scrape data is exported to the comments directory.

Number of Comments Returned

You can scrape all comments from a submission by passing in 0 for N_RESULTS. Subsequently, [N_RESULTS]-result(s) in the file name will be replaced with all.

Otherwise, specify the number of results you want returned. If you passed in a specific number of results, the structured export will return up to N_RESULTS top level comments and include all of its replies.

Structured Comments

This is the default export style. Structured scrapes resemble comment threads on Reddit. This style takes just a little longer to export compared to the raw format because URS uses depth-first search to create the comment Forest after retrieving all comments from a submission.

If you want to learn more about how it works, refer to this additional document where I describe how I implemented the Forest.

Raw Comments

Raw scrapes do not resemble comment threads, but returns all comments on a submission in level order: all top-level comments are listed first, followed by all second-level comments, then third, etc.

You can export to raw format by including the --raw flag. -raw will also be appended to the end of the file name.

Livestreaming Reddit via PRAW

These tools may be used to livestream comments or submissions submitted within Subreddits or by Redditors.

Comments are streamed by default. To stream submissions instead, include the --stream-submissions flag.

New comments or submissions will continue to display within your terminal until you abort the stream using Ctrl + C.

The filenames will follow this format:

[SUBREDDIT_OR_REDDITOR]-[comments_OR_submissions]-[START_TIME_IN_HOURS_MINUTES_SECONDS]-[DURATION_IN_HOURS_MINUTES_SECONDS].json

This file is saved in the main livestream directory into the subreddits or redditors directory depending on which stream was run.

Reddit objects will be written to this JSON file in real time. After aborting the stream, the filename will be updated with the start time and duration.

Displayed vs. Saved Attributes

Displayed comment and submission attributes have been stripped down to essential fields to declutter the output. Here is a table of what is shown during the stream:

Comment Attributes	Submission Attributes
`author`	`author`
`body`	`created_utc`
`created_utc`	`is_self`
`is_submitter`	`link_flair_text`
`submission_author`	`nsfw`
`submission_created_utc`	`selftext`
`submission_link_flair_text`	`spoiler`
`submission_nsfw`	`stickied`
`submission_num_comments`	`title`
`submission_score`	`url`
`submission_title`
`submission_upvote_ratio`
`submission_url`

Comment and submission attributes that are written to file will include the full list of attributes found in the Table of All Subreddit, Redditor, and Submission Comments Attributes.

Available Flags

[-lr 
   
    ]
[-lu 
    
     ]

    [--nosave]
    [--stream-submissions]

Livestreaming Subreddits

*This GIF has been cut for demonstration purposes.

Usage: $ ./Urs.py -lr SUBREDDIT

Supported export formats: JSON.

Default stream objects: Comments. To stream submissions instead, include the --stream-submissions flag.

You can livestream comments or submissions that are created within a Subreddit.

Reddit object information will be displayed in a PrettyTable as they are submitted.

NOTE: PRAW may not be able to catch all new submissions or comments within a high-volume Subreddit, as mentioned in these disclaimers located in the "Note" boxes.

Livestreaming Redditors

Livestream demo was not recorded for Redditors because its functionality is identical to the Subreddit livestream.

Usage: $ ./Urs.py -lu REDDITOR

Supported export formats: JSON.

Default stream objects: Comments. To stream submissions instead, include the --stream-submissions flag.

You can livestream comments or submissions that are created by a Redditor.

Reddit object information will be displayed in a PrettyTable as they are submitted.

Do Not Save Livestream to File

Include the --nosave flag if you do not want to save the livestream to file.

Analytical Tools

This suite of tools can be used after scraping data from Reddit. Both of these tools analyze the frequencies of words found in submission titles and bodies, or comments within JSON scrape data.

There are a few ways you can quickly get the correct filepath to the scrape file:

Drag and drop the file into the terminal.
Partially type the path and rely on tab completion support to finish the full path for you.

Running either tool will create the analytics directory within the date directory. This directory is located in the same directory in which the scrape data resides. For example, if you run the frequencies generator on February 16th for scrape data that was captured on February 14th, analytics will be created in the February 14th directory. Command history will still be written in the February 16th urs.log.

The sub-directories frequencies or wordclouds are created in analytics depending on which tool is run. These directories mirror the directories in which the original scrape files reside. For example, if you run the frequencies generator on a Subreddit scrape, the directory structure will look like this:

analytics/
└── frequencies
    └── subreddits
        └── SUBREDDIT_SCRAPE.json

A shortened export path is displayed once URS has completed exporting the data, informing you where the file is saved within the scrapes directory. You can open urs.log to view the full path.

Target Fields

The data varies depending on the scraper, so these tools target different fields for each type of scrape data:

Scrape Data	Targets
Subreddit	`selftext`, `title`
Redditor	`selftext`, `title`, `body`
Submission Comments	`body`
Livestream	`selftext` and `title`, or `body`

For Subreddit scrapes, data is pulled from the selftext and title fields for each submission (submission title and body).

For Redditor scrapes, data is pulled from all three fields because both submission and comment data is returned. The title and body fields are targeted for submissions, and the selftext field is targeted for comments.

For submission comments scrapes, data is only pulled from the body field of each comment.

For livestream scrapes, comments or submissions may be included depending on user settings. The selftext and title fields are targeted for submissions, and the body field is targeted for comments.

File Names

File names are identical to the original scrape data so that it is easier to distinguish which analytical file corresponds to which scrape.

Available Flags

[-f 
   
    ]
    [--csv]
[-wc 
    
      [
     
      ]]
    [--nosave]

Generating Word Frequencies

*This GIF is uncut.

Usage: $ ./Urs.py -f FILE_PATH

Supported export formats: JSON and CSV. To export to CSV, include the --csv flag.

You can generate a dictionary of word frequencies created from the words within the target fields. These frequencies are sorted from highest to lowest.

Frequencies export to JSON by default, but this tool also works well in CSV format.

Exported files will be saved to the analytics/frequencies directory.

Generating Wordclouds

*This GIF is uncut.

Usage: $ ./Urs.py -wc FILE_PATH

Supported export formats: eps, jpeg, jpg, pdf, png (default), ps, rgba, tif, tiff.

Taking word frequencies to the next level, you can generate wordclouds based on word frequencies. This tool is independent of the frequencies generator - you do not need to run the frequencies generator before creating a wordcloud.

PNG is the default format, but you can also export to any of the options listed above by including the format as the second flag argument.

Usage: $ ./Urs.py -wc FILE_PATH OPTIONAL_EXPORT_FORMAT

Exported files will be saved to the analytics/wordclouds directory.

Display Wordcloud Instead of Saving

Wordclouds are saved to file by default. If you do not want to keep a file, include the --nosave flag to only display the wordcloud.

Utilities

This section briefly outlines the utilities included with URS.

Available Flags

[-t [
   
    ]]
[--check]

Display Directory Tree

Usage: $ ./Urs.py -t

If no date is provided, you can quickly view the directory structure for the current date. This is a quick alternative to the tree command.

You can also display a different day's scrapes by providing a date after the -t flag.

Usage: $ ./Urs.py -t OPTIONAL_DATE

The following date formats are supported:

MM-DD-YYYY
MM/DD/YYYY

An error is displayed if URS was not run on the entered date (if the date directory is not found within the scrapes directory).

Check PRAW Rate Limits

Usage: $ ./Urs.py --check

You can quickly check the rate limits for your account by using this flag.

Contributing

See the Contact section for ways to reach me.

Before Making Pull or Feature Requests

Consider the scope of this project before submitting a pull or feature request. URS stands for Universal Reddit Scraper. Two important aspects are listed in its name - universal and scraper.

I will not approve feature or pull requests that deviate from its sole purpose. This may include scraping a specific aspect of Reddit or adding functionality that allows you to post a comment with URS. Adding either of these requests will no longer allow URS to be universal or merely a scraper. However, I am more than happy to approve requests that enhance the current scraping capabilities of URS.

Building on Top of URS

Although I will not approve requests that deviate from the project scope, feel free to reach out if you have built something on top of URS or have made modifications to scrape something specific on Reddit. I will add your project to the Derivative Projects section!

Making Pull or Feature Requests

You can suggest new features or changes by going to the Issues tab and fill out the Feature Request template. If there is a good reason for a new feature, I will consider adding it.

You are also more than welcome to create a pull request - adding additional features, improving runtime, or refactoring existing code. If it is approved, I will merge the pull request into the master branch and credit you for contributing to this project.

Contributors

Date	User	Contribution
March 11, 2020	ThereGoesMySanity	Created a pull request adding 2FA information to README
October 6, 2020	LukeDSchenk	Created a pull request fixing "[Errno 36] File name too long" issue, making it impossible to save comment scrapes with long titles
October 10, 2020	IceBerge421	Created a pull request fixing a cloning error occuring on Windows machines due to illegal file name characters, `"`, found in two scrape samples

Derivative Projects

This is a showcase for projects that are built on top of URS!

skiwheelr/URS

Contains a bash script built on URS which counts ticker mentions in Subreddits, subsequently cURLs all the relevant links in parallel, and counts the mentions of those.

Comments

Bump certifi from 2021.5.30 to 2022.12.7
Bumps certifi from 2021.5.30 to 2022.12.7.

Commits

9e9e840 2022.12.07

b81bdb2 2022.09.24

939a28f 2022.09.14

aca828a 2022.06.15.2

de0eae1 Only use importlib.resources's new files() / Traversable API on Python ≥3.11 ...

b8eb5e9 2022.06.15.1

47fb7ab Fix deprecation warning on Python 3.11 (#199)

b0b48e0 fixes #198 -- update link in license

9d514b4 2022.06.15

4151e88 Add py.typed to MANIFEST.in to package in sdist (#196)

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Suggest to loosen the dependency on halo

Dear developers,

Your project URS requires "halo==0.0.31" in its dependency. After analyzing the source code, we found that the following versions of halo can also be suitable without affecting your project, i.e., halo 0.0.30. Therefore, we suggest to loosen the dependency on halo from "halo==0.0.31" to "halo>=0.0.30,<=0.0.31" to avoid any possible conflict for importing more packages or for downstream projects that may use ddos_script.

May I pull a request to further loosen the dependency on halo?

By the way, could you please tell us whether such dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Details:

Your project (commit id: 9f8cf3a3adb9aa5079dfc7bfd7832b53358ee40f) directly uses 6 APIs from package halo.

halo.halo.Halo.start, halo.halo.Halo.succeed, halo.halo.Halo.info, halo.halo.Halo.warn, halo.halo.Halo.fail, halo.halo.Halo.__init__

Beginning fromwhich, 28 functions are then indirectly called, including 17 halo's internal APIs and 11 outsider APIs as follows:

[/JosephLai241/URS]
+--halo.halo.Halo.start
|      +--halo.halo.Halo._check_stream
|      +--halo.halo.Halo._hide_cursor
|      |      +--halo.halo.Halo._check_stream
|      |      +--halo.cursor.hide
|      |      |      +--halo.cursor._CursorInfo.__init__
|      |      |      +--ctypes.windll.kernel32.GetStdHandle
|      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
|      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
|      |      |      +--ctypes.byref
|      +--threading.Event
|      +--threading.Thread
|      +--halo.halo.Halo._render_frame
|      |      +--halo.halo.Halo.clear
|      |      |      +--halo.halo.Halo._write
|      |      |      |      +--halo.halo.Halo._check_stream
|      |      +--halo.halo.Halo.frame
|      |      |      +--halo._utils.colored_frame
|      |      |      |      +--termcolor.colored
|      |      |      +--halo.halo.Halo.text_frame
|      |      |      |      +--halo._utils.colored_frame
|      |      +--halo.halo.Halo._write
|      |      +--halo._utils.encode_utf_8_text
|      |      |      +--codecs.encode
+--halo.halo.Halo.succeed
|      +--halo.halo.Halo.stop_and_persist
|      |      +--halo._utils.decode_utf_8_text
|      |      |      +--codecs.decode
|      |      +--halo._utils.colored_frame
|      |      +--halo.halo.Halo.stop
|      |      |      +--halo.halo.Halo.clear
|      |      |      +--halo.halo.Halo._show_cursor
|      |      |      |      +--halo.halo.Halo._check_stream
|      |      |      |      +--halo.cursor.show
|      |      |      |      |      +--halo.cursor._CursorInfo.__init__
|      |      |      |      |      +--ctypes.windll.kernel32.GetStdHandle
|      |      |      |      |      +--ctypes.windll.kernel32.GetConsoleCursorInfo
|      |      |      |      |      +--ctypes.windll.kernel32.SetConsoleCursorInfo
|      |      |      |      |      +--ctypes.byref
|      |      +--halo.halo.Halo._write
|      |      +--halo._utils.encode_utf_8_text
+--halo.halo.Halo.info
|      +--halo.halo.Halo.stop_and_persist
+--halo.halo.Halo.warn
|      +--halo.halo.Halo.stop_and_persist
+--halo.halo.Halo.fail
|      +--halo.halo.Halo.stop_and_persist
+--halo.halo.Halo.__init__
|      +--halo._utils.get_environment
|      |      +--IPython.get_ipython
|      +--halo.halo.Halo.stop
|      +--IPython.get_ipython
|      +--atexit.register

Since all these functions have not been changed between any version for package "halo" from [0.0.30] and 0.0.31. Therefore, we believe it is safe to loosen the corresponding dependency.

opened by Agnes-U 0

Bump pillow from 8.2.0 to 9.3.0
Bumps pillow from 8.2.0 to 9.3.0.

Release notes

Sourced from pillow's releases.

9.3.0

https://pillow.readthedocs.io/en/stable/releasenotes/9.3.0.html

Changes

Initialize libtiff buffer when saving #6699 [@radarhere]

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [@wiredfool]

Inline fname2char to fix memory leak #6329 [@nulano]

Fix memory leaks related to text features #6330 [@nulano]

Use double quotes for version check on old CPython on Windows #6695 [@hugovk]

GHA: replace deprecated set-output command with GITHUB_OUTPUT file #6697 [@nulano]

Remove backup implementation of Round for Windows platforms #6693 [@cgohlke]

Upload fribidi.dll to GitHub Actions #6532 [@nulano]

Fixed set_variation_by_name offset #6445 [@radarhere]

Windows build improvements #6562 [@nulano]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [@cgohlke]

Only use ASCII characters in C source file #6691 [@cgohlke]

Release Python GIL when converting images using matrix operations #6418 [@hmaarrfk]

Added ExifTags enums #6630 [@radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [@radarhere]

Added support for reading BMP images with RLE4 compression #6674 [@npjg]

Decode JPEG compressed BLP1 data in original mode #6678 [@radarhere]

pylint warnings #6659 [@marksmayo]

Added GPS TIFF tag info #6661 [@radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [@radarhere]

Do not attempt normalization if mode is already normal #6644 [@radarhere]

Fixed seeking to an L frame in a GIF #6576 [@radarhere]

Consider all frames when selecting mode for PNG save_all #6610 [@radarhere]

Don't reassign crc on ChunkStream close #6627 [@radarhere]

Raise a warning if NumPy failed to raise an error during conversion #6594 [@radarhere]

Only read a maximum of 100 bytes at a time in IMT header #6623 [@radarhere]

Show all frames in ImageShow #6611 [@radarhere]

Allow FLI palette chunk to not be first #6626 [@radarhere]

If first GIF frame has transparency for RGB_ALWAYS loading strategy, use RGBA mode #6592 [@radarhere]

Round box position to integer when pasting embedded color #6517 [@radarhere]

Removed EXIF prefix when saving WebP #6582 [@radarhere]

Pad IM palette to 768 bytes when saving #6579 [@radarhere]

Added DDS BC6H reading #6449 [@ShadelessFox]

Added support for opening WhiteIsZero 16-bit integer TIFF images #6642 [@JayWiz]

Raise an error when allocating translucent color to RGB palette #6654 [@jsbueno]

Moved mode check outside of loops #6650 [@radarhere]

Added reading of TIFF child images #6569 [@radarhere]

Improved ImageOps palette handling #6596 [@PososikTeam]

Defer parsing of palette into colors #6567 [@radarhere]

Apply transparency to P images in ImageTk.PhotoImage #6559 [@radarhere]

Use rounding in ImageOps contain() and pad() #6522 [@bibinhashley]

Fixed GIF remapping to palette with duplicate entries #6548 [@radarhere]

Allow remap_palette() to return an image with less than 256 palette entries #6543 [@radarhere]

Corrected BMP and TGA palette size when saving #6500 [@radarhere]

... (truncated)

Changelog

Sourced from pillow's changelog.

9.3.0 (2022-10-29)

Limit SAMPLESPERPIXEL to avoid runtime DOS #6700 [wiredfool]

Initialize libtiff buffer when saving #6699 [radarhere]

Inline fname2char to fix memory leak #6329 [nulano]

Fix memory leaks related to text features #6330 [nulano]

Use double quotes for version check on old CPython on Windows #6695 [hugovk]

Remove backup implementation of Round for Windows platforms #6693 [cgohlke]

Fixed set_variation_by_name offset #6445 [radarhere]

Fix malloc in _imagingft.c:font_setvaraxes #6690 [cgohlke]

Release Python GIL when converting images using matrix operations #6418 [hmaarrfk]

Added ExifTags enums #6630 [radarhere]

Do not modify previous frame when calculating delta in PNG #6683 [radarhere]

Added support for reading BMP images with RLE4 compression #6674 [npjg, radarhere]

Decode JPEG compressed BLP1 data in original mode #6678 [radarhere]

Added GPS TIFF tag info #6661 [radarhere]

Added conversion between RGB/RGBA/RGBX and LAB #6647 [radarhere]

Do not attempt normalization if mode is already normal #6644 [radarhere]

... (truncated)

Commits

d594f4c Update CHANGES.rst [ci skip]

909dc64 9.3.0 version bump

1a51ce7 Merge pull request #6699 from hugovk/security-libtiff_buffer

2444cdd Merge pull request #6700 from hugovk/security-samples_per_pixel-sec

744f455 Added release notes

0846bfa Add to release notes

799a6a0 Fix linting

00b25fd Hide UserWarning in logs

05b175e Tighter test case

13f2c5a Prevent DOS with large SAMPLESPERPIXEL in Tiff IFD

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Update timestamp to use ISO8601 standard
Overview

Summary

As much as I love this scraper, finding my results in /scrapes/09-17-2022/ makes my toes curl and causes me physical pain. Refer to https://xkcd.com/1179/ and https://www.reddit.com/r/ISO8601/ for explanations of why ISO8601 is superior in every aspect.

Motivation/Context

ISO8601 was created as an international standard for timestamps. This article provides some context for its inception.

New Dependencies

None

Issue Fix or Enhancement Request

N/A

Type of Change

[x] Code Refactor

[x] This change requires a documentation update

Breaking Change

N/A

List All Changes That Have Been Made

Changed

Source code

In Global.py:

Updated timestamp format for the date variable.

README

Summary of change

Describing the change

Tests

Summary of change

Describing the change

How Has This Been Tested?

Put "N/A" in this block if this is not applicable.

Please describe the tests that you ran to verify your changes. Provide instructions so I can reproduce. Please also list any relevant details for your test configuration. Section your tests by relevance if it is lengthy. An example outline is shown below:

Summary of a test here

Details here with relevant test commands underneath.

Ran test command here.

If applicable, more details about the command underneath.

Then ran another test command here.

Test Configuration

Put "N/A" in this block if this is not applicable.

Python version: 3.x.x

If applicable, describe more configuration settings. An example outline is shown below:

Summary goes here.

Configuration 1.

Configuration 2.

If applicable, provide extra details underneath a configuration.

Configuration 3.

Dependencies

N/A

Checklist

Tip: You can check off items by writing an "x" in the brackets, e.g. [x].

[ ] My code follows the style guidelines of this project.

[ ] I have performed a self-review of my own code, including testing to ensure my fix is effective or that my feature works.

[ ] My changes generate no new warnings.

[ ] I have commented my code, providing a summary of the functionality of each method, particularly in areas that may be hard to understand.

[ ] I have made corresponding changes to the documentation.

[ ] I have performed a self-review of this Pull Request template, ensuring the Markdown file renders correctly.

refactor
opened by fridde 1
Bump numpy from 1.21.0 to 1.22.0
Bumps numpy from 1.21.0 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0

Releases(v3.3.2)

v3.3.2(Jul 22, 2021)
Summary

This release fixes an open issue.

PRAW v7.3.0 changed the Redditor object's subreddit attribute. This change breaks the Redditor scraper. It would be nice if all the tools worked as advertised.

Full Changelog

Added

Source code

In Redditor.py:

Added a new method GetInteractions._get_user_subreddit() - extracts subreddit data from the UserSubreddit object into a dictionary.

Tests

In test_Redditor.py:

Added TestGetUserSubredditMethod().test_get_user_subreddit() to test the new method.

Changed

Source code

In Redditor.py:

GetInteractions._get_user_info() calls the new GetInteractions._get_user_subreddit() method to set the Redditor's subreddit data within the main Redditor information dictionary.

In Version.py:

Incremented version number.

README

Incremented PRAW badge version number.

Source code(tar.gz)
Source code(zip)
v3.3.1(Jul 3, 2021)
Summary

Introduced a new utility, -t, which will display a visual tree of the current day's scrape directory by default. Optionally, include a different date to display that day's scrape directory.

Move CI providers from Travis-CI to GitHub Actions.

Travis-CI is no longer free - there is now a free build cap.

Minor code refactoring and issue resolution.

Full Changelog

Added

User interface

Added a new utility:

-t/--tree - display the directory structure of the current date directory. Or optionally include a date to display that day's scrape directory.

Source code

Added a new file Utilities.py to the urs/utils module.

Added a class DateTree which contains methods to find and build a visual tree for the target date's directory.

Added logging when this utility is run.

Added an additional Halo to the wordcloud generator.

README

Added new "Utilities" section.

This section describes how to use the -t/--tree and --check utility flags.

Added new "Sponsors" section.

Tests

Added test_Utilities.py under the test_utils module.

Changed

Source code

Refactored the following methods within the analytics module:

GetPath.get_scrape_type()

GetPath.name_file()

FinalizeWordcloud().save_wordcloud()

Implemented pathlib's Path() method to get the path.

Upgraded all string formatting from old-school Python formatting (using the % operator) to the superior f-string.

Updated GitHub Actions workflow pytest.yml.

This workflow was previously disabled. The workflow has been upgraded to test URS on all platforms (ubuntu-latest, macOS-latest, and windows-latest) and to send test coverage to Codecov after testing completes on ubuntu-latest.

README

Changed the Travis-CI badge to a GitHub Actions badge.

Updated badge link to route to the workflows page within the repository.

Tests

Upgraded all string formatting from old-school Python formatting (using the % operator) to the superior f-string in the following modules:

test_utils/test_Export.py

test_praw_scrapers/test_live_scrapers/test_Livestream.py

Refactored two tests within test_Export.py:

TestExportWriteCSVAndWriteJSON().test_write_csv()

TestExportExportMethod().test_export_write_csv()

Community documents

Updated PULL_REQUEST_TEMPLATE.md.

Removed Travis-CI configuration block.

Deprecated

Source code

Removed .travis.yml - URS no longer uses Travis-CI as its CI provider.

Source code(tar.gz)
Source code(zip)
v3.3.0(Jun 15, 2021)
Summary

Introduced livestreaming tools:

Livestream comments or submissions submitted within Subreddits.

Livestream comments or submissions submitted by a Redditor.

Full Changelog

Added

User interface

Added livestream scraper flags:

-lr - livestream a Subreddit

-lu - livestream a Redditor

Added livestream scrape control flags to limit stream exclusively to submissions (default is streaming comments):

--stream-submissions

Added a flag -v/--version to display the version number.

Source code

Added a new sub-module live_scrapers within praw_scrapers for livestream functionality:

Livestream.py

utils/DisplayStream.py

utils/StreamGenerator.py

Added a new file Version.py to single-source the package version.

Added a gallery_data and media_metadata check in Comments.py, which includes the above fields if the submission contains a gallery.

README

Added a new "Installation" section with updated installation procedures.

Added a new section "Livestreaming Subreddits and Redditors" with sub-sections containing details for each flag.

Updated the Table of Contents accordingly.

Tests

Added additional unit tests for the live_scrapers module. These tests are located in tests/test_praw_scrapers/test_live_scrapers:

tests/test_praw_scrapers/test_live_scrapers/test_Livestream.py

tests/test_praw_scrapers/test_live_scrapers/test_utils/test_DisplayStream.py

tests/test_praw_scrapers/test_live_scrapers/test_utils/test_StreamGenerator.py

Repository documents

Added a Table of Contents for The Forest.md

Changed

User interface

Updated the usage menu to clarify which tools may use which optional flags.

Source code

Reindexed the praw_scrapers module:

Moved the following files into the new static_scrapers sub-module:

Basic.py

Comments.py

Redditor.py

Subreddit.py

Updated absolute imports throughout the source code.

Moved confirm_options(), previously located in Subreddit.py to Global.py.

Moved PrepRedditor.prep_redditor() algorithm to its own class method PrepMutts.prep_mutts().

Added additional error handling to the algorithm to fix the KeyError exception mentioned in the Issue Fix or Enhancement Request section.

Removed Colorama's init() method from many modules - it only needs to be called once and is now located in Urs.py.

Updated requirements.txt.

README

The "Exporting" section is now one large section and is now located on top of the "URS Overview" section.

Tests

Updated absolute imports for existing PRAW scrapers.

Removed a few tests for DirInit.py since the make_directory() and make_type_directory() methods have been deprecated.

Deprecated

Source code

Removed many methods defined in the InitializeDirectory class in DirInit.py:

LogMissingDir.log()

create()

make_directory()

make_type_directory()

make_analytics_directory()

Replaced these methods with a more versatile create_dirs() method.

Source code(tar.gz)
Source code(zip)
v3.2.1(Mar 28, 2021)
Release date: March 28, 2021

Summary

Structured comments export has been upgraded to include comments of all levels.

Structured comments are now the default export format. Exporting to raw format requires including the --raw flag.

Tons of metadata has been added to all scrapers. See the Full Changelog section for a full list of attributes that have been added.

Credentials.py has been deprecated in favor of .env to avoid hard-coding API credentials.

Added more terminal eye candy - Halo has been implemented to spice up the output.

Full Changelog

Added

User interface

Added Halo to spice up the output while maintaining minimalism.

Source code

Created a comment Forest and accompanying CommentNode.

The Forest contains methods for inserting CommentNodes, including a depth-first search algorithm to do so.

Subreddit.py has been refactored and submission metadata has been added to scrape files:

"author"

"created_utc"

"distinguished"

"edited"

"id"

"is_original_content"

"is_self"

"link_flair_text"

"locked"

"name"

"num_comments"

"nsfw"

"permalink"

"score"

"selftext"

"spoiler"

"stickied"

"title"

"upvote_ratio"

"url"

Comments.py has been refactored and submission comments now include the following metadata:

"author"

"body"

"body_html"

"created_utc"

"distinguished"

"edited"

"id"

"is_submitter"

"link_id"

"parent_id"

"score"

"stickied"

Major refactor for Redditor.py on top of adding additional metadata.

Additional Redditor information has been added to scrape files:

"has_verified_email"

"icon_img"

"subreddit"

"trophies"

Additional Redditor comment, submission, and multireddit metadata has been added to scrape files:

subreddit objects are nested within comment and submission objects and contain the following metadata:

"can_assign_link_flair"

"can_assign_user_flair"

"created_utc"

"description"

"description_html"

"display_name"

"id"

"name"

"nsfw"

"public_description"

"spoilers_enabled"

"subscribers"

"user_is_banned"

"user_is_moderator"

"user_is_subscriber"

comment objects will contain the following metadata:

"type"

"body"

"body_html"

"created_utc"

"distinguished"

"edited"

"id"

"is_submitter"

"link_id"

"parent_id"

"score"

"stickied"

"submission" - contains additional metadata

"subreddit_id"

submission objects will contain the following metadata:

"type"

"author"

"created_utc"

"distinguished"

"edited"

"id"

"is_original_content"

"is_self"

"link_flair_text"

"locked"

"name"

"num_comments"

"nsfw"

"permalink"

"score"

"selftext"

"spoiler"

"stickied"

"subreddit" - contains additional metadata

"title"

"upvote_ratio"

"url"

multireddit objects will contain the following metadata:

"can_edit"

"copied_from"

"created_utc"

"description_html"

"description_md"

"display_name"

"name"

"nsfw"

"subreddits"

"visibility"

interactions are now sorted in alphabetical order.

CLI

Flags

--raw - Export comments in raw format instead (structure format is the default)

Created a new .env file to store API credentials.

README

Added new bullet point for The Forest Markdown file.

Tests

Added a new test for the Status class in Global.py.

Repository documents

Added "The Forest".

This Markdown file is just a place where I describe how I implemented the Forest.

Changed

User interface

Submission comments scraping parameters have changed due to the improvements made in this pull request.

Structured comments is now the default format.

Users will have to include the new --raw flag to export to raw format.

Both structured and raw formats can now scrape all comments from a submission.

Source code

The submission comments JSON file's structure has been modified to fit the new submission_metadata dictionary. "data" is now a dictionary that contains the submission metadata dictionary and scraped comments list. Comments are now stored in the "comments" field within "data".

Exporting Redditor or submission comments to CSV is now forbidden.

URS will ignore the --csv flag if it is present while trying to use either scraper.

The created_utc field for each Subreddit rule is now converted to readable time.

requirements.txt has been updated.

As of v1.20.0, numpy has dropped support for Python 3.6, which means Python 3.7+ is required for URS.

.travis.yml has been modified to exclude Python 3.6. Added Python 3.9 to test configuration.

Note: Older versions of Python can still be used by downgrading to numpy<=1.19.5.

Reddit object validation block has been refactored.

A new reusable module has been defined at the bottom of Validation.py.

Urs.py no longer pulls API credentials from Credentials.py as it is now deprecated.

Credentials are now read from the .env file.

Minor refactoring within Validation.py to ensure an extra Halo line is not rendered on failed credential validation.

README

Updated the Comments section to reflect new changes to comments scraper UI.

Repository documents

Updated How to Get PRAW Credentials.md to reflect new changes.

Tests

Updated CLI usage and examples tests.

Updated c_fname() test because submission comments scrapes now follow a different naming convention.

Deprecated

User interface

Specifying 0 comments does not only export all comments to raw format anymore. Defaults to structured format.

Source code

Deprecated many global variables defined in Global.py:

eo

options

s_t

analytical_tools

Credentials.py has been replaced with the .env file.

The LogError.log_login decorator has been deprecated due to the refactor within Validation.py.

Source code(tar.gz)
Source code(zip)
v3.2.0(Feb 26, 2021)
Release date: February 25, 2021

Summary

Added analytical tools

Word frequencies generator

Wordcloud generator

Significantly improved JSON structure

JSON is now the default export option; the --json flag is deprecated

Added numerous extra flags

Improved logging

Bug fixes

Code refactor

Full Changelog

Added

User Interface

Analytical tools

Word frequencies generator.

Wordcloud generator.

Source code

CLI

Flags

-e - Display additional example usage.

--check - Runs a quick check for PRAW credentials and displays the rate limit table after validation.

--rules - Include the Subreddit's rules in the scrape data (for JSON only). This data is included in the subreddit_rules field.

-f - Word frequencies generator.

-wc - Wordcloud generator.

--nosave - Only display the wordcloud; do not save to file.

Added metavar for args help message.

Added additional verbose feedback if invalid arguments are given.

Log decorators

Added new decorator to log individual argument errors.

Added new decorator to log when no Reddit objects are left to scrape after failing validation check.

Added new decorator to log when an invalid file is passed into the analytical tools.

Added new decorator to log when the scrapes directory is missing, which would cause the new make_analytics_directory() method in DirInit.py to fail.

This decorator is also defined in the same file to avoid a circular import error.

ASCII art

Added new art for the word frequencies and wordcloud generators.

Added new error art displayed when a problem arises while exporting data.

Added new error art displayed when Reddit object validation is completed and there are no objects left to scrape.

Added new error art displayed when an invalid file is passed into the analytical tools.

README

Added new Contact section and moved contact badges into it.

Apparently it was not obvious enough in previous versions since users did not send emails to the address specifically created for URS-related inquiries.

Added new sections for the analytical tools.

Updated demo GIFs

Moved all GIFs to a separate branch to avoid unnecessary clones.

Hosting static images on Imgur.

Tests

Added additional tests for analytical tools.

Changed

User interface

JSON is now the default export option. --csv flag is required to export to CSV instead.

Improved JSON structure.

PRAW scraping export structure:

Scrape details are now included at the top of each exported file in the scrape_details field.

Subreddit scrapes - Includes subreddit, category, n_results_or_keywords, and time_filter.

Redditor scrapes - Includes redditor and n_results.

Submission comments scrapes - Includes submission_title, n_results, and submission_url.

Scrape data is now stored in the data field.

Subreddit scrapes - data is a list containing submission objects.

Redditor scrapes - data is an object containing additional nested dictionaries:

information - a dictionary denoting Redditor metadata,

interactions - a dictionary denoting Redditor interactions (submissions and/or comments). Each interaction follows the Subreddit scrapes structure.

Submission comments scrapes - data is an list containing additional nested dictionaries.

Raw comments contains dictionaries of comment_id: SUBMISSION_METADATA.

Structured comments follows the structure seen in raw comments, but includes an extra replies field in the submission metadata, holding a list of additional nested dictionaries of comment_id: SUBMISSION_METADATA. This pattern repeats down to third level replies.

Word frequencies export structure:

The original scrape data filepath is included in the raw_file field.

data is a dictionary containing word: frequency.

Log:

scrapes.log is now named urs.log.

Validation of Reddit objects is now included - invalid Reddit objects will be logged as a warning.

Rate limit information is now included in the log.

Source code

Moved PRAW scrapers into its own package.

Subreddit scraper's "edited" field is now either a boolean (if the post was not edited) or a string (if it was).

Previous iterations did not distinguish the different types and would solely return a string.

Scrape settings for the basic Subreddit scraper is now cleaned within Basic.py, further streamlining conditionals in Subreddit.py and Export.py.

Returning final scrape settings dictionary from all scrapers after execution for logging purposes, further streamlining the LogPRAWScraper class in Logger.py.

Passing the submission URL instead of the exception into the not_found list for submission comments scraping.

This is a part of a bug fix that is listed in the Fixed section.

ASCII art:

Modified the args error art to display specific feedback when invalid arguments are passed.

Upgraded from relative to absolute imports.

Replaced old header comments with docstring comment block.

Upgraded method comments to Numpy/Scipy docstring format.

README

Moved Releases section into its own document.

Deleted all media from master branch.

Tests

Updated absolute imports to match new directory structure.

Updated a few tests to match new changes made in the source code.

Community documents

Updated PULL_REQUEST_TEMPLATE:

Updated section for listing changes that have been made to match new Releases syntax.

Wrapped New Dependencies in a code block.

Updated STYLE_GUIDE:

Created new rules for method comments.

Added Releases:

Moved Releases section from main README to a separate document.

Fixed

Source code

PRAW scraper settings

Bug: Invalid Reddit objects (Subreddits, Redditors, or submissions) and their respective scrape settings would be added to the scrape settings dictionary even after failing validation.

Behavior: URS would try to scrape invalid Reddit objects, then throw an error mid-scrape because it is unable to pull data via PRAW.

Fix: Returning the invalid objects list from each scraper into GetPRAWScrapeSettings.get_settings() to circumvent this issue.

Basic Subreddit scraper

Bug: The time filter all would be applied to categories that do not support time filter use, resulting in errors while scraping.

Behavior: URS would throw an error when trying to export the file, resulting in a failed run.

Fix: Added a conditional to check if the category allows for a time filter, and applies either the all time filter or None accordingly.

Deprecated

User interface

Removed the --json flag since it is now the default export option.

Source code(tar.gz)
Source code(zip)
v3.1.2(Feb 6, 2021)
Release date: February 05, 2021

Scrapes will now be exported to scrape-defined directories within the date directory.

New in 3.1.2

URS will create sub-directories within the date directory based on the scraper.

Exported files will now be stored in the subreddits, redditors, or comments directories.

These directories are only created if the scraper is ran. For example, the redditors directory will not be created if you never run the Redditor scraper.

Removed the first character used in exported filenames to distinguish scrape type in previous iterations of URS.

This is no longer necessary due to the new sub-directory creation.

The forbidden access message that may appear when running the Redditor scraper was originally red. Changed the color from red to yellow to avoid confusion.

Fixed a filenaming bug that would omit the scrape type if the filename length is greater than 50 characters.

Updated README

Updated demo GIFs

Added new directory structure visual generated by the tree command.

Created new section headers to improve navigation.

Minor code reformatting/refactoring.

Updated STYLE_GUIDE to reflect new changes and made a minor change to the PRAW API walkthrough.

Source code(tar.gz)
Source code(zip)
v3.1.1(Jun 28, 2020)
Release date: June 27, 2020

Fulfilled user enhancement request by adding Subreddit time filter option.

New in 3.1.1:

Users will now be able to specify a time filter for Subreddit categories Controversial, Search, and Top.

The valid time filters are:

all

day

hour

month

week

year

Updated CLI unit tests to match new changes to how Subreddit args are parsed.

Updated community documents located in the .github/ directory: STYLE_GUIDE, and PULL_REQUEST_TEMPLATE.

Updated README to reflect new changes.

Source code(tar.gz)
Source code(zip)
v3.1.0(Jun 22, 2020)
Release date: June 22, 2020

Major code refactor. Applied OOP concepts to existing code and rewrote methods in attempt to improve readability, maintenance, and scalability.

New in 3.1.0:

Scrapes will now be exported to the scrapes/ directory within a subdirectory corresponding to the date of the scrape. These directories are automatically created for you when you run URS.

Added log decorators that record what is happening during each scrape, which scrapes were ran, and any errors that might arise during runtime in the log file scrapes.log. The log is stored in the same subdirectory corresponding to the date of the scrape.

Replaced bulky titles with minimalist titles for a cleaner look.

Added color to terminal output.

Improved naming convention for scripts.

Integrating Travis CI and Codecov.

Updated community documents located in the .github/ directory: BUG_REPORT, CONTRIBUTING, FEATURE_REQUEST, PULL_REQUEST_TEMPLATE, and STYLE_GUIDE

Numerous changes to README. The most significant change was splitting and storing walkthroughs in docs/.

Source code(tar.gz)
Source code(zip)
v3.0(Jan 22, 2020)
Release date: January 15, 2020

New features

Added JSON support

Scrape Redditors

Scrape post comments

Source code(tar.gz)
Source code(zip)
v2.0(Jan 22, 2020)
Release date: July 29, 2019

Added CLI support

Source code(tar.gz)
Source code(zip)
v1.0(Jan 22, 2020)
Release date: May 25, 2019

Scrape Subreddits

Source code(tar.gz)
Source code(zip)