This tool crawls a list of websites and download all PDF and office documents

AccessibilityLU

Last update: Sep 30, 2022

Related tags

Web Crawling simplA11yPDFCrawler

Overview

simplA11yPDFCrawler

simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the commission implementing decision EU 2018/1524. It is used by SIP (Information and Press Service) in Luxembourg to monitor the websites of public sector bodies.

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues. The generated files can then be used by the tool simplA11yGenReport to give an overview of the state of document accessibility on controlled websites.

Most of the accessibility reports (in french) published by SIP on data.public.lu have been generated using simplA11yGenReport and data coming from this tool.

Accessibility Tests

On all PDF files we execute the following tests:

name	description	WCAG SC	WCAG technique	EN 301 549
EmptyText	does the file contain text or only images? scanned document?	1.4.5 Image of text (AA)?	PDF 7	10.1.4.5
Tagged	is the document tagged?
Protected	is the document protected and blocks screen readers?
hasTitle	Has the document a title?	2.4.2 Page Titled (A)	PDF 18	10.2.4.2
hasLang	Has the document a default language?	3.1.1 Language of page (A)	PDF16	10.3.1.1
hasBookmarks	Has the document bookmarks?	2.4.1 Bypass Blocks (A)		10.2.4.1

Installation

git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out 
chmod a+x *.sh

Usage

To be able to use this tool, you need a list of websites to crawl. Store this list in a file named list-sites.txt, one domain per line (without protocol and without path). Example of content for this file:

test.public.lu
etat.public.lu

Then the tool is used in two steps:

Crawl all the files. Launch the following command crawl.sh. It will crawl all the sites mentioned in list-sites.txt. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in the crawled_filesfolder. This step can be quite long.
Analyse the files and detect accessibility issues. Launch the command analyse.sh. The resulting files will be placed in the outfolder.

License

This software is developed by the Information and press service of the luxembourgish government and licensed under the MIT license.

Comments

Include incremental update mode

It would be interesting to be able to launch the crawler several times on the same site, detect new files and analyse only the accessibility of these new files.
enhancement

opened by AlainVagner 0

ModuleNotFoundError: No module named 'pikepdf'

The crawl.sh script seemed to work just fine. Was able to scrape a good list of other documents this way.

Trying to run the analysis wasn't so good:

% ./analyse.sh 
find: ./crawled_files/10x.gsa.gov/*.pdf: No such file or directory
./crawled_files/apprenticeship.gov/2021%20Apprenticeship%20Mailer.pdf
Traceback (most recent call last):
  File "/Users/mgifford/Documents/GitHub/simplA11yPDFCrawler/./pdfCheck.py", line 1, in <module>
    from pikepdf import Pdf, String, _qpdf
ModuleNotFoundError: No module named 'pikepdf'
./crawled_files/apprenticeship.gov/29_cf_30_regs_only.pdf
Traceback (most recent call last):
  File "/Users/mgifford/Documents/GitHub/simplA11yPDFCrawler/./pdfCheck.py", line 1, in <module>
    from pikepdf import Pdf, String, _qpdf
ModuleNotFoundError: No module named 'pikepdf'

I'm running on a Mac, but didn't think that would be a problem:

% pip3 install pikepdf
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
Requirement already satisfied: pikepdf in /usr/local/lib/python3.9/site-packages (4.2.0)
Requirement already satisfied: lxml>=4.0 in /usr/local/lib/python3.9/site-packages (from pikepdf) (4.6.3)
Requirement already satisfied: Pillow>=6.0 in /usr/local/lib/python3.9/site-packages (from pikepdf) (8.4.0)
DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621

I've tried installing pikepdf on it's own with pip & pip3.

I notice that pikepdf is in the requirements.txt

Not sure if this is a problem at my end or not.

I did cut short my crawl.sh as I seemed to be getting a lot more errors. Anyways, don't think that's the cause of this. It is finding lots of files in the directory.

opened by mgifford 0

Integrate VeraPDF

When this code has been developed, we were not aware of the existence of the VeraPDF tool. It could be integrated for our accessibility tests (PDF-UA1).
enhancement

opened by AlainVagner 0
Set exempt date as an environment variable

In some countries not all PDF files should be compliant. All files published before a given date are exempt. Currently this date is hard-coded for Luxembourg, it should be a parameter.
enhancement

opened by AlainVagner 0
Remove GET parameters in file names

In the crawler, remove GET parameters in the filename when saving. Find a solution for duplicates. The file extension is used for some statistics, this can lead to further issues.
bug

opened by AlainVagner 0

Owner

AccessibilityLU

GitHub

A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

2 Apr 29, 2022

A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

1 Jan 14, 2022

Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

16 Dec 26, 2022

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 7, 2021

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

3 May 31, 2022

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

0 Jan 22, 2022

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

2 Apr 26, 2022

A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

4.3k Jan 7, 2023

Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

1 Dec 26, 2021

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 9, 2022

Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

2 Jul 20, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

0 Jan 7, 2022

Liveskidordownload - Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

0 Jan 7, 2022

Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

3 Oct 14, 2022

Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

18 Dec 31, 2022

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

2 Jun 6, 2022

Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

5 Oct 17, 2022

This tool crawls a list of websites and download all PDF and office documents

Related tags

Overview

simplA11yPDFCrawler

Accessibility Tests

Installation

Usage

License

Comments

Include incremental update mode

ModuleNotFoundError: No module named 'pikepdf'

Integrate VeraPDF

Set exempt date as an environment variable

Remove GET parameters in file names

Owner

AccessibilityLU

A web crawler script that crawls the target website and lists its links

A list of Python Bots used to extract data from several websites

Command line program to download documents from web portals.

A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

A Python library for automating interaction with websites.

Amazon scraper using scrapy, a python framework for crawling websites.

Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

Scrapy-soccer-games - Scraping information about soccer games from a few websites

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Liveskidordownload - Simple tool to scrape and download cross country ski timings and results from live.skidor.com

Bulk download tool for the MyMedia platform

Web and PDF Scraper Refactoring

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.