Scans pdfs for links written in plaintext and checks if they are active or returns an error code.

Overview

linkrot logo

Introduction

Scans pdfs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (pdf, url, doi, arxiv) and metadata from a PDF.

Features

  • Extract references and metadata from a given PDF.
  • Detects pdf, url, arxiv and doi references.
  • Checks for valid SSL certificate.
  • Find broken hyperlinks (using the -c flag).
  • Output as text or JSON (using the -j flag).
  • Extract the PDF text (using the --text flag).
  • Use as command-line tool or Python package.
  • Works with local and online pdfs.

Installation

Grab a copy of the code with pip:

pip install linkrot

Usage

linkrot can be used to extract info from a PDF in two ways:

  • Command line/Terminal tool linkrot
  • Python library import linkrot

1. Command Line/Terminal tool

linkrot [pdf-file-or-url]

Run linkrot -h to see the help output:

linkrot -h

usage:

linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf

Extract metadata and references from a PDF, and optionally download all referenced PDFs.

Arguments

positional arguments:

pdf (Filename or URL of a PDF file)

optional arguments:

-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)  

Examples

Extract text to console

linkrot https://example.com/example.pdf -t

Extract text to file

linkrot https://example.com/example.pdf -t -o pdf-text.txt

Check Links

linkrot https://example.com/example.pdf -c

2. Main Python Library

Import the library:

import linkrot

Create an instance of the linkrot class like so:

pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class

Now the following function can be used to extract specific data from the pdf:

get_metadata()

Arguments: None

Usage:

metadata = pdf.get_metadata() #pdf is the instance of the linkrot class

Return type: Dictionary

Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc...

get_text()

Arguments: None

Usage:

text = pdf.get_text() #pdf is the instance of the linkrot class

Return type: String

Information Provided: The entire content of the PDF in string form.

get_references(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_list = pdf.get_references() #pdf is the instance of the linkrot class

Return type: Set of

linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced

Information Provided: All references with their corresponding type and page number.

get_references_as_dict(reftype=None, sort=False)

Arguments:

reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.

sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.

Usage:

references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class

Return type: Dictionary with keys 'pdf', 'url', 'doi', 'arxiv' that each have a list of refs of that type.

Information Provided: All references in their corresponding type list.

download_pdfs(target_dir)

Arguments:

target_dir: The path of the directory to which the reference pdfs should be downloaded 

Usage:

pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class

Return type: None

Information Provided: Downloads all the reference pdfs to specified directory.

3. Linkrot downloader functions

Import:

from linkrot.downloader import sanitize_url, get_status_code, check_refs

sanitize_url(url)

Arguments:

url: The url to be sanitized.

Usage:

new_url = sanitize_url(old_url) 

Return type: String

Information Provided: URL is prefixed with 'http://' if it was not before and makes sure it is in utf-8 format.

get_status_code(url)

Arguments:

url: The url to be checked for its status. 

Usage:

status_code = get_status_code(url) 

Return type: String

Information Provided: Checks if the url is active or broken.

check_refs(refs, verbose=True, max_threads=MAX_THREADS_DEFAULT)

Arguments:

refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading

Usage:

check_refs(pdf.get_references()) #pdf is the instance of the linkrot class

Return type: None

Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.

4. Linkrot extractor functions

Import:

from linkrot.extractor import extract_urls, extract_doi, extract_arxiv

Get pdf text:

text = pdf.get_text() #pdf is the instance of the linkrot class

extract_urls(text)

Arguments:

text: String of text to extract urls from

Usage:

urls = extract_urls(text)

Return type: Set of URLs

Information Provided: All URLs in the text

extract_arxiv(text)

Arguments:

text: String of text to extract arxivs from

Usage:

arxiv = extract_arxiv(text)

Return type: Set of arxivs

Information Provided: All arxivs in the text

extract_doi(text)

Arguments:

text: String of text to extract dois from

Usage:

doi = extract_doi(text)

Return type: Set of dois

Information Provided: All dois in the text

Code of Conduct

To view our code of conduct please visit our Code of Conduct page.

License

This program is licensed with an MIT License.

Comments
  • xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    Receive this error when I run the file. Traceback below. File Attached.

    Traceback (most recent call last): File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main return run_code(code, main_globals, None, File "c:\python38\lib\runpy.py", line 86, in run_code exec(code, run_globals) File "C:\Python38\Scripts\linkrot.exe_main.py", line 7, in File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main pdf = linkrot.linkrot(args.pdf) File "c:\python38\lib\site-packages\linkrot_init.py", line 131, in init self.reader = PDFMinerBackend(self.stream) File "c:\python38\lib\site-packages\linkrot\backends.py", line 213, in init self.metadata.update(xmp_to_dict(metadata)) File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 92, in xmp_to_dict return XmpParser(xmp).meta File "c:\python38\lib\site-packages\linkrot\libs\xmp.py", line 41, in init self.tree = ET.XML(xmp) File "c:\python38\lib\xml\etree\ElementTree.py", line 1320, in XML parser.feed(text) xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 55, column 10

    ah-5.pdf

    bug help wanted good first issue hacktoberfest python 
    opened by marshalmiller 11
  • Remove Python 2 checks and functionality.

    Remove Python 2 checks and functionality.

    Keeping support for Python 2 might be slowing down some of the process. Of more concern is that in order to patch vulnerabilities that exist in some libraries Python 2 depends on, we have had to cut support for some versions of Python 3. Specifically 3.6,3.7. 3.7 is still fairly widely used and I think I'd prefer to remove Python 2 support and bring back 3.7. Even though it's clearly a bigger task.

    enhancement help wanted good first issue dependencies python 
    opened by marshalmiller 10
  • Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

    Move from `requirements.txt`, `requirements_dev.txt`, `setup.cfg`, and `setup.py` to `pyproject.toml`.

    Is your feature request related to a problem? Please describe. Hey @marshalmiller. As you may already know, the use of setup.cfg, setup.py, and requirements.txt files is quite outdated. Because of PEP 517, PEP 660, and PEP 631, the packaging is now being standardized on the usage of the pyproject.toml file.

    Describe the solution you'd like Given the above info, the project packaging should add support for pyproject.toml.

    Describe alternatives you've considered Not available.

    Additional context That's pretty much it. What do you think? Also, I would like to work on this issue.

    enhancement hacktoberfest python 
    opened by wiseaidev 7
  • (Bug) AttributeError: 'NoneType' object has no attribute 'findall'

    (Bug) AttributeError: 'NoneType' object has no attribute 'findall'

    Describe the bug Certain PDFs give Attribute Error

    To Reproduce Steps to reproduce the behavior:

    1. Download Research_Ethics.pdf
    2. Open terminal and run:
    linkrot <path_to_above_file>
    

    Expected behavior It should generate the expected linkrot report.

    Screenshots Screenshot from 2021-10-12 23-37-47

    bug help wanted hacktoberfest 
    opened by aditirao7 7
  • Add Link Archiving

    Add Link Archiving

    I'd like to add a feature that takes all links that are verified to be active and add them to the Internet Archive Wayback Machine to preserve them in time. There is a draft python script in lib called archive.py. The idea is that you navigate to https://web.archive.org/save/{url} the service automatically archives that page. So after verifying that it returns a valid code, we would just connect to all of those sites and it would create a snapshot. I'd love for this to be an optional argument like -a or something. This way it is optional and we don't take more resources than we need. Anyone able to complete this task, please take a stab at it.

    enhancement help wanted good first issue hacktoberfest python 
    opened by marshalmiller 6
  • UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>.

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to .

    Receiving this error when running the file. Traceback Below. File Attached.

    > Traceback (most recent call last):
    >   File "c:\python38\lib\runpy.py", line 193, in _run_module_as_main
    >     return _run_code(code, main_globals, None,
    >   File "c:\python38\lib\runpy.py", line 86, in _run_code
    >     exec(code, run_globals)
    >   File "C:\Python38\Scripts\linkrot.exe\__main__.py", line 7, in <module>
    >   File "c:\python38\lib\site-packages\linkrot\cli.py", line 182, in main
    >     pdf = linkrot.linkrot(args.pdf)
    >   File "c:\python38\lib\site-packages\linkrot\__init__.py", line 131, in __init__
    >     self.reader = PDFMinerBackend(self.stream)
    >   File "c:\python38\lib\site-packages\linkrot\backends.py", line 204, in __init__
    >     self.metadata[k] = make_compat_str(v)
    >   File "c:\python38\lib\site-packages\linkrot\backends.py", line 67, in make_compat_str
    >     out_str = in_str.decode(enc["encoding"])
    >   File "c:\python38\lib\encodings\cp1254.py", line 15, in decode
    >     return codecs.charmap_decode(input,errors,decoding_table)
    > UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 0: character maps to <undefined>
    

    ah-1.pdf

    bug help wanted hacktoberfest python 
    opened by marshalmiller 5
  • (Update) documentation for python library usage

    (Update) documentation for python library usage

    The main documentation needs to be updated to include the usage of linkrot as a python library as well. Some of it can be found in the docstrings of this file.

    enhancement 
    opened by aditirao7 5
  • Separate code from data

    Separate code from data

    Is your feature request related to a problem? Please describe.

    The current size of the repo is too big because of pdf data samples:

    ➜  du -sh * | sort -h
    4.0K	CONTRIBUTING.md
    4.0K	LICENSE
    4.0K	Makefile
    4.0K	pyproject.toml
    4.0K	SECURITY.md
    8.0K	code_of_conduct.md
    8.0K	README.md
    44K	branding
    68K	linkrot
    1.7M	tests
    919M	Random PDF Samples
    

    Describe the solution you'd like I suggest either storing the pdf files in a separate repo or on a cloud provider's bucket.

    Describe alternatives you've considered Not available.

    Additional context That's pretty much. I am currently working on this issue.

    documentation enhancement hacktoberfest 
    opened by wiseaidev 4
  • Add Link Check Results to CLI Output

    Add Link Check Results to CLI Output

    Right now, if you use the -o argument to export the results to a text file, the document metadata and the list of links are the only components listed. I would like to add the results of the link check to this output as well.

    enhancement help wanted good first issue hacktoberfest python hacktoberfest-accepted 
    opened by marshalmiller 4
  • Displays Page Number Wrong in Results

    Displays Page Number Wrong in Results

    When it returns the results of links that it tests, it gives a list of the links, along with a page number. The page number would appear to be the page the link was found on but it is actually just the total number of pages in the PDF. It would be extremely helpful if we could get it to display the correct page number.

    bug enhancement help wanted hacktoberfest python hacktoberfest-accepted 
    opened by marshalmiller 4
  • Update Tests

    Update Tests

    The tests written for this repo were developed during the very early stages of this project. I don't think they are a great representation of where the project is now. I'd love to have them updated to be more rigorous and keep the quality of the project high.

    enhancement help wanted good first issue hacktoberfest python 
    opened by marshalmiller 2
  • Update ReadMe to Include Changes from Hacktoberfest.

    Update ReadMe to Include Changes from Hacktoberfest.

    We have had a lot of great improvements already during Hacktoberfest. I will update the ReadMe with all the changes once the event is over, if not before.

    documentation enhancement hacktoberfest 
    opened by marshalmiller 3
  • Consider Replacing Threadpool with Redis

    Consider Replacing Threadpool with Redis

    Given the performance and timeout issues with the flask app, I am wondering if I should be replacing the current thread pool with a Redis model, as suggested by other forums and Heroku.

    https://python-rq.org/

    enhancement help wanted dependencies hacktoberfest python 
    opened by marshalmiller 2
Releases(3.9.5)
  • 3.9.5(Oct 3, 2022)

    What's Changed

    • Add test cases for detecting embedded URLs by @marwansalem in https://github.com/marshalmiller/linkrot/pull/161
    • rm Random PDF Samples by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/163
    • updated .gitignore, added mega.py, rm pdfs, cleanups by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/164
    • cleanup python 2 syntax by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/165

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.4...3.9.5

    Source code(tar.gz)
    Source code(zip)
  • 3.9.4(Oct 2, 2022)

    What's Changed

    • Migrating from setup.py to pyproject.toml by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/149
    • Upgrade to PyProject by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/156
    • add missing dependencies by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/158
    • add missing cli entry point by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/157
    • handle UnicodeDecode exception by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/159

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.3...3.9.4

    Source code(tar.gz)
    Source code(zip)
  • 3.9.3(Oct 2, 2022)

    What's Changed

    • Resolved Add Link Archiving #102 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/150
    • add etree xml_parser to ignore invalid tags by @wiseaidev in https://github.com/marshalmiller/linkrot/pull/155

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.2...3.9.3

    Source code(tar.gz)
    Source code(zip)
  • 3.9.2(Oct 1, 2022)

    What's Changed

    • Fix the page number error, in the link checker by @ajratnam in https://github.com/marshalmiller/linkrot/pull/147
    • Add Link Check Results to CLI Output #120 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/145

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9.1...3.9.2

    Source code(tar.gz)
    Source code(zip)
  • 3.9.1(Oct 1, 2022)

    What's Changed

    • Bump mypy from 0.971 to 0.981 by @dependabot in https://github.com/marshalmiller/linkrot/pull/142
    • Bump coverage from 6.4.4 to 6.5.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/143
    • Resolved Add DOIs to References Summary #128 by @mailtodanish in https://github.com/marshalmiller/linkrot/pull/144
    • Remove numpy import by @ajratnam in https://github.com/marshalmiller/linkrot/pull/146

    New Contributors

    • @mailtodanish made their first contribution in https://github.com/marshalmiller/linkrot/pull/144
    • @ajratnam made their first contribution in https://github.com/marshalmiller/linkrot/pull/146

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.9...3.9.1

    Source code(tar.gz)
    Source code(zip)
  • 3.9(Sep 25, 2022)

    What's Changed

    • Bump flake8 from 5.0.3 to 5.0.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/131
    • Bump coverage from 6.4.2 to 6.4.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/132
    • Bump numpy from 1.23.1 to 1.23.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/133
    • Bump coverage from 6.4.3 to 6.4.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/134
    • Bump pylint from 2.14.5 to 2.15.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/135
    • Bump black from 22.6.0 to 22.8.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/136
    • Bump pytest from 7.1.2 to 7.1.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/137
    • Bump pylint from 2.15.0 to 2.15.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/138
    • Bump numpy from 1.23.2 to 1.23.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/139
    • Bump pylint from 2.15.2 to 2.15.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/141
    • Resolve issue130 by @westofwest in https://github.com/marshalmiller/linkrot/pull/140

    New Contributors

    • @westofwest made their first contribution in https://github.com/marshalmiller/linkrot/pull/140

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.8...3.9

    Source code(tar.gz)
    Source code(zip)
  • 3.8.8(Aug 2, 2022)

  • 3.8.5(Aug 2, 2022)

    What's Changed

    • Bump flake8 from 5.0.1 to 5.0.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/129

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.8.4...3.8.5

    Source code(tar.gz)
    Source code(zip)
  • 3.5(Jun 1, 2022)

    What's Changed

    • Bump mypy from 0.910 to 0.920 by @dependabot in https://github.com/marshalmiller/linkrot/pull/71
    • Bump mypy from 0.920 to 0.930 by @dependabot in https://github.com/marshalmiller/linkrot/pull/73
    • Bump mypy from 0.930 to 0.931 by @dependabot in https://github.com/marshalmiller/linkrot/pull/75
    • Bump mccabe from 0.6.1 to 0.7.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/76
    • Bump coverage from 6.2 to 6.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/77
    • Bump black from 21.12b0 to 22.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/78
    • Bump coverage from 6.3 to 6.3.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/79
    • Bump pytest from 6.2.5 to 7.0.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/80
    • Bump pytest from 7.0.0 to 7.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/81
    • Bump coverage from 6.3.1 to 6.3.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/82
    • Bump pytest from 7.0.1 to 7.1.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/84
    • Bump mypy from 0.931 to 0.940 by @dependabot in https://github.com/marshalmiller/linkrot/pull/83
    • Bump mypy from 0.940 to 0.941 by @dependabot in https://github.com/marshalmiller/linkrot/pull/85
    • Bump pytest from 7.1.0 to 7.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/86
    • Bump pdfminer-six from 20211012 to 20220319 by @dependabot in https://github.com/marshalmiller/linkrot/pull/87
    • Bump mypy from 0.941 to 0.942 by @dependabot in https://github.com/marshalmiller/linkrot/pull/88
    • Bump pylint from 2.12.2 to 2.13.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/89
    • Bump pylint from 2.13.0 to 2.13.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/90
    • Bump black from 22.1.0 to 22.3.0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/91
    • Bump pylint from 2.13.2 to 2.13.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/92
    • Bump pylint from 2.13.3 to 2.13.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/93
    • Bump pylint from 2.13.4 to 2.13.5 by @dependabot in https://github.com/marshalmiller/linkrot/pull/94
    • Bump pylint from 2.13.5 to 2.13.7 by @dependabot in https://github.com/marshalmiller/linkrot/pull/95
    • Bump pytest from 7.1.1 to 7.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/96
    • Bump mypy from 0.942 to 0.950 by @dependabot in https://github.com/marshalmiller/linkrot/pull/97
    • Bump pylint from 2.13.7 to 2.13.8 by @dependabot in https://github.com/marshalmiller/linkrot/pull/98
    • Bump pdfminer-six from 20220319 to 20220506 by @dependabot in https://github.com/marshalmiller/linkrot/pull/99
    • Bump coverage from 6.3.2 to 6.3.3 by @dependabot in https://github.com/marshalmiller/linkrot/pull/100
    • Bump pylint from 2.13.8 to 2.13.9 by @dependabot in https://github.com/marshalmiller/linkrot/pull/101
    • Bump coverage from 6.3.3 to 6.4 by @dependabot in https://github.com/marshalmiller/linkrot/pull/103
    • Bump pdfminer-six from 20220506 to 20220524 by @dependabot in https://github.com/marshalmiller/linkrot/pull/104
    • Bump mypy from 0.950 to 0.960 by @dependabot in https://github.com/marshalmiller/linkrot/pull/105
    • A fix for: Exclude Email Addresses #106 by @marwansalem in https://github.com/marshalmiller/linkrot/pull/107

    New Contributors

    • @marwansalem made their first contribution in https://github.com/marshalmiller/linkrot/pull/107

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/3.4...3.5

    Source code(tar.gz)
    Source code(zip)
  • 3.4(Dec 11, 2021)

    What's Changed

    • Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41
    • fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42
    • Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43
    • Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44
    • Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46
    • Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47
    • Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48
    • Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49
    • Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50
    • Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51
    • Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52
    • Bump black from 21.9b0 to 21.10b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/55
    • Bump coverage from 6.0.2 to 6.1.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/54
    • Add comments to colorprint.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/56
    • Bump coverage from 6.1.1 to 6.1.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/57
    • Bump black from 21.10b0 to 21.11b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/58
    • Add Comments to cli.py by @vacom13 in https://github.com/marshalmiller/linkrot/pull/60
    • Bump black from 21.11b0 to 21.11b1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/59
    • Bump pylint from 2.11.1 to 2.12.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/61
    • Bump coverage from 6.1.2 to 6.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/63
    • Bump black from 21.11b1 to 21.12b0 by @dependabot in https://github.com/marshalmiller/linkrot/pull/67
    • Bump pylint from 2.12.1 to 2.12.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/66

    New Contributors

    • @sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42
    • @alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48
    • @rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51
    • @vacom13 made their first contribution in https://github.com/marshalmiller/linkrot/pull/56

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...3.4

    Source code(tar.gz)
    Source code(zip)
  • 2.3(Oct 24, 2021)

    What's Changed

    • Added documentation for library by @aditirao7 in https://github.com/marshalmiller/linkrot/pull/41
    • fix(downloader.py): change string comparison to use regex by @sousatg in https://github.com/marshalmiller/linkrot/pull/42
    • Bump flake8 from 4.0.0 to 4.0.1 by @dependabot in https://github.com/marshalmiller/linkrot/pull/43
    • Bump coverage from 6.0.1 to 6.0.2 by @dependabot in https://github.com/marshalmiller/linkrot/pull/44
    • Bump pdfminer-six from 20201018 to 20211012 by @dependabot in https://github.com/marshalmiller/linkrot/pull/46
    • Bring up to date by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/47
    • Replace pagenos with a safe default value by @alanyee in https://github.com/marshalmiller/linkrot/pull/48
    • Staging to Main 10-17-2021 by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/49
    • Start testing for Python 3.10 by @alanyee in https://github.com/marshalmiller/linkrot/pull/50
    • Checking the rdftree before parsing the metadata #45 by @rosdyana in https://github.com/marshalmiller/linkrot/pull/51
    • Staging by @marshalmiller in https://github.com/marshalmiller/linkrot/pull/52

    New Contributors

    • @sousatg made their first contribution in https://github.com/marshalmiller/linkrot/pull/42
    • @alanyee made their first contribution in https://github.com/marshalmiller/linkrot/pull/48
    • @rosdyana made their first contribution in https://github.com/marshalmiller/linkrot/pull/51

    Full Changelog: https://github.com/marshalmiller/linkrot/compare/2.1.1...2.3

    Source code(tar.gz)
    Source code(zip)
Owner
Marshal Miller
Marshal Miller
Camelot is a Python library that can help you extract tables from PDFs!

A Python library to extract tabular data from PDFs

null 1.8k Jan 3, 2023
Pdfencrypt is a tool to encrypt/lock PDFs

Pdfencrypt Pdfencrypt is a tool to encrypt/lock PDFs Installation $ apt update $ apt upgrade $ apt install git $ apt install python $ git clone https:

Anontemitayo 5 Nov 28, 2021
Auto Convert PDFs to png files in python

This python tool, which is an application of PyMuPDF module, could auto convert PDFs to png files

Bo-Yu 4 Dec 5, 2021
pdf_sprinkles: sprinkles text in your PDFs

pdf_sprinkles: sprinkles text in your PDFs pdf_sprinkles remotely OCRs a PDF with Google Cloud Document AI, and returns the result as a PDF with searc

Will Angley 2 Dec 17, 2021
A bulk pdf generator. This application can generate PDFs in bulk by using just one click.

A bulk html pdf generator. This application can generate PDFs in bulk by using just one click. Screenshots Requirements ?? Your system must have the f

Aman Nirala 3 Apr 23, 2022
CLI tool to generate pdf invoices written in python

invoicepy CLI invoice tool, store and print invoices as pdf. save companies and customers for later use. installation pip install invoicepy config co

Adam Wojtczak 9 Aug 1, 2022
Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

Small python-gtk application, which helps the user to merge or split pdf documents and rotate, crop and rearrange their pages using an interactive and intuitive graphical interface

null 1.8k Dec 29, 2022
Simple HTML and PDF document generator for Python - with built-in support for popular data analysis and plotting libraries.

Esparto is a simple HTML and PDF document generator for Python. Its primary use is for generating shareable single page reports with content from popular analytics and data science libraries.

Dom 76 Dec 12, 2022
PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files.

PyPDF2 is a pure-python PDF library capable of splitting, merging together, cropping, and transforming the pages of PDF files. It can also add custom data, viewing options, and passwords to PDF files. It can retrieve text and metadata from PDFs as well as merge entire files together.

Matthew Stamy 5k Jan 4, 2023
borb is a library for reading, creating and manipulating PDF files in python.

borb is a library for reading, creating and manipulating PDF files in python.

Joris Schellekens 2.9k Jan 1, 2023
This book will take you on an exploratory journey through the PDF format, and the borb Python library.

This book will take you on an exploratory journey through the PDF format, and the borb Python library.

Joris Schellekens 281 Jan 1, 2023
Converting Html files to pdf using python script, pdfkit module and wkhtmltopdf.

Html-to-pdf-pdfkit-wkhtml- This repository has code for converting local html files and online html resources into pdf. It is an python script which u

Hemachandran P 1 Nov 9, 2021
pikepdf is a Python library for reading and writing PDF files.

A Python library for reading and writing PDF, powered by qpdf

null 1.6k Jan 3, 2023
Simple pdf editor while preserving structure and format.

SIMPdf Simple pdf editor while preserving structure and format.

Shashwat Singh 242 Jan 4, 2023
PDFSanitizer - Renders possibly unsafe PDF files and outputs harmless PDF files

PDFSanitizer Renders possibly malicious PDF files and outputs harmless PDF files

null 9 Jan 30, 2022
JoplinPdf2Images - Converts a PDF to images in Joplin and adds it to the specified note as a printout

joplinPdf2Images Converts a PDF to images in Joplin and adds it to the specified

Morten Haahr Kristensen 2 Apr 20, 2022
Split given PDF document into 4 page groups and convert them to booklet format

PUTO: PDF to Booklet converter Split given PDF document into 4 page groups and convert them to booklet format. It creates a PDF like shown below: Fir

null 3 Mar 12, 2022
Convert PDF to AudioBook and Audio Speech to PDF

In this Python project, we will build a GUI-based PDF to Audio and Audio to PDF converter using the Tkinter, OS, path, pyttsx3, SpeechRecognition, PyPDF4, and Pydub libraries and the messagebox module of the Tkinter library.

RISHABH MISHRA 1 Feb 13, 2022
An automation program that checks whether email addresses are real, whether they exist and whether they are a validated mail

Email Validator It is an automation program that checks whether email addresses are real, whether they exist and whether they are a validated mail. Re

Ender MIRIZ 4 Dec 22, 2021
Script to calculate Active Directory Kerberos keys (AES256 and AES128) for an account, using its plaintext password

Script to calculate Active Directory Kerberos keys (AES256 and AES128) for an account, using its plaintext password

Matt Creel 27 Dec 20, 2022