Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Marshal Miller

Last update: Nov 21, 2022

Related tags

Overview

Introduction

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Features

Extract references and metadata from a given PDF.
Detects pdf, url, arxiv and doi references.
Checks for valid SSL certificate.
Find broken hyperlinks (using the -c flag).
Output as text or JSON (using the -j flag).
Extract the PDF text (using the --text flag).
Use as command-line tool or Python package.
Works with local and online pdfs.

Installation

Grab a copy of the code with pip:

pip install linkrot

Usage

linkrot can be used to extract info from a PDF in two ways:

Command line/Terminal tool linkrot
Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)

Examples

Extract text to console

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set of

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference pdfs should be downloaded

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference pdfs to specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url)

Return type: String

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status.

Usage:

status_code = get_status_code(url)

Return type: String

Information Provided: Checks if the url is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arxivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set of arxivs

Information Provided: All arxivs in the text

extract_doi(text)

Arguments:

text: String of text to extract dois from

Usage:

doi = extract_doi(text)

Return type: Set of dois

Information Provided: All dois in the text

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an MIT License.

Comments

xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

Receive this error when I run the file. Traceback below. File Attached.

Traceback (most recent call last): File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main return run_code(code, main_globals, None, File "c:\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Python38\Scripts\linkrot.exe_main.py", line 7, in File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main pdf = linkrot.linkrot(args.pdf) File "c:\python38\lib\site-packages\linkrot_init.py", line 131, in init self.reader = PDFMinerBackend(self.stream) File "c:\python38\lib\site-packages\linkrot\backends.py", line 213, in init self.metadata.update(xmp_to_dict(metadata)) File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 92, in xmp_to_dict return XmpParser(xmp).meta File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 41, in init self.tree = ET.XML(xmp) File "c:\python38\lib\xml\etree\ElementTree.py", line 1320, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

ah-5.pdf
bug help wanted good first issue hacktoberfest python

opened by marshalmiller 11
Remove Python 2 checks and functionality.

Keeping support for Python 2 might be slowing down some of the process. Of more concern is that in order to patch vulnerabilities that exist in some libraries Python 2 depends on, we have had to cut support for some versions of Python 3. Specifically 3.6,3.7. 3.7 is still fairly widely used and I think I'd prefer to remove Python 2 support and bring back 3.7. Even though it's clearly a bigger task.
enhancement help wanted good first issue dependencies python

opened by marshalmiller 10
Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

Is your feature request related to a problem? Please describe. Hey @marshalmiller. As you may already know, the use of setup.cfg, setup.py, and requirements.txt files is quite outdated. Because of PEP 517, PEP 660, and PEP 631, the packaging is now being standardized on the usage of the pyproject.toml file.

Describe the solution you'd like Given the above info, the project packaging should add support for pyproject.toml.

Describe alternatives you've considered Not available.

Additional context That's pretty much it. What do you think? Also, I would like to work on this issue.
enhancement hacktoberfest python

opened by wiseaidev 7
(Bug) AttributeError: 'NoneType' object has no attribute 'findall'
Describe the bug Certain PDFs give Attribute Error

To Reproduce Steps to reproduce the behavior:

Download Research_Ethics.pdf

Open terminal and run:

linkrot <path_to_above_file>

Expected behavior It should generate the expected linkrot report.

Screenshots
bug help wanted hacktoberfest
opened by aditirao7 7
Add Link Archiving

I'd like to add a feature that takes all links that are verified to be active and add them to the Internet Archive Wayback Machine to preserve them in time. There is a draft python script in lib called archive.py. The idea is that you navigate to https://web.archive.org/save/{url} the service automatically archives that page. So after verifying that it returns a valid code, we would just connect to all of those sites and it would create a snapshot. I'd love for this to be an optional argument like -a or something. This way it is optional and we don't take more resources than we need. Anyone able to complete this task, please take a stab at it.
enhancement help wanted good first issue hacktoberfest python

opened by marshalmiller 6

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to .

Receiving this error when running the file. Traceback Below. File Attached.

> Traceback (most recent call last):
>   File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
>     return _run_code(code, main_globals, None,
>   File "c:\python38\lib\runpy.py", line 86, in _run_code
>     exec(code, run_globals)
>   File "C:\Python38\Scripts\linkrot.exe\__main__.py", line 7, in <module>
>   File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main
>     pdf = linkrot.linkrot(args.pdf)
>   File "c:\python38\lib\site-packages\linkrot\__init__.py", line 131, in __init__
>     self.reader = PDFMinerBackend(self.stream)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 204, in __init__
>     self.metadata[k] = make_compat_str(v)
>   File "c:\python38\lib\site-packages\linkrot\backends.py", line 67, in make_compat_str
>     out_str = in_str.decode(enc["encoding"])
>   File "c:\python38\lib\encodings\cp1254.py", line 15, in decode
>     return codecs.charmap_decode(input,errors,decoding_table)
> UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>

ah-1.pdf

bug help wanted hacktoberfest python

opened by marshalmiller 5

(Update) documentation for python library usage

The main documentation needs to be updated to include the usage of linkrot as a python library as well. Some of it can be found in the docstrings of this file.
enhancement

opened by aditirao7 5
Separate code from data
Is your feature request related to a problem? Please describe.

The current size of the repo is too big because of pdf data samples:

➜ du -sh * | sort -h 4.0K CONTRIBUTING.md 4.0K LICENSE 4.0K Makefile 4.0K pyproject.toml 4.0K SECURITY.md 8.0K code_of_conduct.md 8.0K README.md 44K branding 68K linkrot 1.7M tests 919M Random PDF Samples

Describe the solution you'd like I suggest either storing the pdf files in a separate repo or on a cloud provider's bucket.

Describe alternatives you've considered Not available.

Additional context That's pretty much. I am currently working on this issue.
documentation enhancement hacktoberfest
opened by wiseaidev 4
Add Link Check Results to CLI Output

Right now, if you use the -o argument to export the results to a text file, the document metadata and the list of links are the only components listed. I would like to add the results of the link check to this output as well.
enhancement help wanted good first issue hacktoberfest python hacktoberfest-accepted

opened by marshalmiller 4
Displays Page Number Wrong in Results

When it returns the results of links that it tests, it gives a list of the links, along with a page number. The page number would appear to be the page the link was found on but it is actually just the total number of pages in the PDF. It would be extremely helpful if we could get it to display the correct page number.
bug enhancement help wanted hacktoberfest python hacktoberfest-accepted

opened by marshalmiller 4
Update Tests

The tests written for this repo were developed during the very early stages of this project. I don't think they are a great representation of where the project is now. I'd love to have them updated to be more rigorous and keep the quality of the project high.
enhancement help wanted good first issue hacktoberfest python

opened by marshalmiller 2
Update ReadMe to Include Changes from Hacktoberfest.

We have had a lot of great improvements already during Hacktoberfest. I will update the ReadMe with all the changes once the event is over, if not before.
documentation enhancement hacktoberfest

opened by marshalmiller 3
Consider Replacing Threadpool with Redis

Given the performance and timeout issues with the flask app, I am wondering if I should be replacing the current thread pool with a Redis model, as suggested by other forums and Heroku.

https://python-rq.org/
enhancement help wanted dependencies hacktoberfest python

opened by marshalmiller 2

Releases(3.9.5)

3.9.5(Oct 3, 2022)
What's Changed

Add test cases for detecting embedded URLs by @marwansalem in https://github.com/marshalmiller/linkrot/pull/161

rm Random PDF Samples by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/163

updated .gitignore, added mega.py, rm pdfs, cleanups by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/164

cleanup python 2 syntax by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/165

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.4...3.9.5
Source code(tar.gz)
Source code(zip)
3.9.4(Oct 2, 2022)
What's Changed

Migrating from setup.py to pyproject.toml by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/149

Upgrade to PyProject by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/156

add missing dependencies by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/158

add missing cli entry point by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/157

handle UnicodeDecode exception by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/159

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.3...3.9.4
Source code(tar.gz)
Source code(zip)
3.9.3(Oct 2, 2022)
What's Changed

Resolved Add Link Archiving #102 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/150

add etree xml_parser to ignore invalid tags by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/155

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.2...3.9.3
Source code(tar.gz)
Source code(zip)
3.9.2(Oct 1, 2022)
What's Changed

Fix the page number error, in the link checker by @ajratnam in https://github.com/marshalmiller/linkrot/pull/147

Add Link Check Results to CLI Output #120 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/145

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.1...3.9.2
Source code(tar.gz)
Source code(zip)
3.9.1(Oct 1, 2022)
What's Changed

Bump mypy from 0.971 to 0.981 by @dependabot in https://github.com/marshalmiller/linkrot/pull/142

Bump coverage from 6.4.4 to 6.5.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/143

Resolved Add DOIs to References Summary #128 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/144

Remove numpy import by @ajratnam in https://github.com/marshalmiller/linkrot/pull/146

New Contributors

@mailtodanish made their first contribution in https://github.com/marshalmiller/linkrot/pull/144

@ajratnam made their first contribution in https://github.com/marshalmiller/linkrot/pull/146

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9...3.9.1
Source code(tar.gz)
Source code(zip)
3.9(Sep 25, 2022)
What's Changed

Bump flake8 from 5.0.3 to 5.0.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/131

Bump coverage from 6.4.2 to 6.4.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/132

Bump numpy from 1.23.1 to 1.23.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/133

Bump coverage from 6.4.3 to 6.4.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/134

Bump pylint from 2.14.5 to 2.15.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/135

Bump black from 22.6.0 to 22.8.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/136

Bump pytest from 7.1.2 to 7.1.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/137

Bump pylint from 2.15.0 to 2.15.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/138

Bump numpy from 1.23.2 to 1.23.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/139

Bump pylint from 2.15.2 to 2.15.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/141

Resolve issue130 by @westofwest in https://github.com/marshalmiller/linkrot/pull/140

New Contributors

@westofwest made their first contribution in https://github.com/marshalmiller/linkrot/pull/140

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.8...3.9
Source code(tar.gz)
Source code(zip)
3.8.8(Aug 2, 2022)

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.7...3.8.8
Source code(tar.gz)
Source code(zip)
3.8.5(Aug 2, 2022)
What's Changed

Bump flake8 from 5.0.1 to 5.0.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/129

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.4...3.8.5
Source code(tar.gz)
Source code(zip)
3.5(Jun 1, 2022)
What's Changed

Bump mypy from 0.910 to 0.920 by @dependabot in https://github.com/marshalmiller/linkrot/pull/71

Bump mypy from 0.920 to 0.930 by @dependabot in https://github.com/marshalmiller/linkrot/pull/73

Bump mypy from 0.930 to 0.931 by @dependabot in https://github.com/marshalmiller/linkrot/pull/75

Bump mccabe from 0.6.1 to 0.7.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/76

Bump coverage from 6.2 to 6.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/77

Bump black from 21.12b0 to 22.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/78

Bump coverage from 6.3 to 6.3.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/79

Bump pytest from 6.2.5 to 7.0.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/80

Bump pytest from 7.0.0 to 7.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/81

Bump coverage from 6.3.1 to 6.3.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/82

Bump pytest from 7.0.1 to 7.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/84

Bump mypy from 0.931 to 0.940 by @dependabot in https://github.com/marshalmiller/linkrot/pull/83

Bump mypy from 0.940 to 0.941 by @dependabot in https://github.com/marshalmiller/linkrot/pull/85

Bump pytest from 7.1.0 to 7.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/86

Bump pdfminer-six from 20211012 to 20220319 by @dependabot in https://github.com/marshalmiller/linkrot/pull/87

Bump mypy from 0.941 to 0.942 by @dependabot in https://github.com/marshalmiller/linkrot/pull/88

Bump pylint from 2.12.2 to 2.13.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/89

Bump pylint from 2.13.0 to 2.13.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/90

Bump black from 22.1.0 to 22.3.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/91

Bump pylint from 2.13.2 to 2.13.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/92

Bump pylint from 2.13.3 to 2.13.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/93

Bump pylint from 2.13.4 to 2.13.5 by @dependabot in https://github.com/marshalmiller/linkrot/pull/94

Bump pylint from 2.13.5 to 2.13.7 by @dependabot in https://github.com/marshalmiller/linkrot/pull/95

Bump pytest from 7.1.1 to 7.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/96

Bump mypy from 0.942 to 0.950 by @dependabot in https://github.com/marshalmiller/linkrot/pull/97

Bump pylint from 2.13.7 to 2.13.8 by @dependabot in https://github.com/marshalmiller/linkrot/pull/98

Bump pdfminer-six from 20220319 to 20220506 by @dependabot in https://github.com/marshalmiller/linkrot/pull/99

Bump coverage from 6.3.2 to 6.3.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/100

Bump pylint from 2.13.8 to 2.13.9 by @dependabot in https://github.com/marshalmiller/linkrot/pull/101

Bump coverage from 6.3.3 to 6.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/103

Bump pdfminer-six from 20220506 to 20220524 by @dependabot in https://github.com/marshalmiller/linkrot/pull/104

Bump mypy from 0.950 to 0.960 by @dependabot in https://github.com/marshalmiller/linkrot/pull/105

A fix for: Exclude Email Addresses #106 by @marwansalem in https://github.com/marshalmiller/linkrot/pull/107

New Contributors

@marwansalem made their first contribution in https://github.com/marshalmiller/linkrot/pull/107

Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.4...3.5
Source code(tar.gz)
Source code(zip)
3.4(Dec 11, 2021)
What's Changed

Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41

fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42

Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43

Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44

Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46

Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47

Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48

Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49

Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50

Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51

Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

Bump black from 21.9b0 to 21.10b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/55

Bump coverage from 6.0.2 to 6.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/54

Add comments to colorprint.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/56

Bump coverage from 6.1.1 to 6.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/57

Bump black from 21.10b0 to 21.11b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/58

Add Comments to cli.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/60

Bump black from 21.11b0 to 21.11b1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/59

Bump pylint from 2.11.1 to 2.12.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/61

Bump coverage from 6.1.2 to 6.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/63

Bump black from 21.11b1 to 21.12b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/67

Bump pylint from 2.12.1 to 2.12.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/66

New Contributors

@sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42

@alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48

@rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

@vacom13 made their first contribution in https://github.com/marshalmiller/linkrot/pull/56

Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...3.4
Source code(tar.gz)
Source code(zip)
2.3(Oct 24, 2021)
What's Changed

Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41

fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42

Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43

Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44

Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46

Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47

Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48

Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49

Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50

Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51

Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

New Contributors

@sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42

@alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48

@rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...2.3
Source code(tar.gz)
Source code(zip)

Owner

Marshal Miller

GitHub

Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

1.8k Jan 3, 2023

Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

5 Nov 28, 2021

Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

4 Dec 5, 2021

pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

2 Dec 17, 2021

A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements ?? Your system must have the f

3 Apr 23, 2022

CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

9 Aug 1, 2022

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

1.8k Dec 29, 2022

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

76 Dec 12, 2022

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

5k Jan 4, 2023

An automation program that checks whether email addresses are real, whether they exist and whether they are a validated mail

Email Validator It is an automation program that checks whether email addresses are real, whether they exist and whether they are a validated mail. Re

4 Dec 22, 2021

Script to calculate Active Directory Kerberos keys (AES256 and AES128) for an account, using its plaintext password

27 Dec 20, 2022

Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Related tags

Overview

Introduction

Features

Installation

Usage

1. Command Line/Terminal tool

Arguments

positional arguments:

optional arguments:

Examples

Extract text to console

Extract text to file

Check Links

2. Main Python Library

get_metadata()

get_text()

get_references(reftype=None, sort=False)

get_references_as_dict(reftype=None, sort=False)

download_pdfs(target_dir)

3. Linkrot downloader functions

sanitize_url(url)

get_status_code(url)

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

4. Linkrot extractor functions

extract_urls(text)

extract_arxiv(text)

extract_doi(text)

Code of Conduct

License

Comments

Releases(3.9.5)

3.9.5(Oct 3, 2022)

What's Changed

3.9.4(Oct 2, 2022)

What's Changed

3.9.3(Oct 2, 2022)

What's Changed

3.9.2(Oct 1, 2022)

What's Changed

3.9.1(Oct 1, 2022)

What's Changed

New Contributors

3.9(Sep 25, 2022)

What's Changed

New Contributors

3.8.8(Aug 2, 2022)

3.8.5(Aug 2, 2022)

What's Changed

3.5(Jun 1, 2022)

What's Changed

New Contributors

3.4(Dec 11, 2021)

What's Changed

New Contributors

2.3(Oct 24, 2021)

What's Changed

New Contributors

Owner

Marshal Miller

Camelot is a Python library that can help you extract tables from PDFs!

Pdfencrypt is a tool to encrypt/lock PDFs

Auto Convert PDFs to png files in python

pdf_sprinkles: sprinkles text in your PDFs

A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

CLI tool to generate pdf invoices written in python

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

borb is a library for reading, creating and manipulating PDF files in python.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

pikepdf is a Python library for reading and writing PDF files.

Simple pdf editor while preserving structure and format.

PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

Split given PDF document into 4 page groups and convert them to booklet format

Convert PDF to AudioBook and Audio Speech to PDF

An automation program that checks whether email addresses are real, whether they exist and whether they are a validated mail