This tool crawls a list of websites and download all PDF and office documents

Overview

simplA11yPDFCrawler

simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the commission implementing decision EU 2018/1524. It is used by SIP (Information and Press Service) in Luxembourg to monitor the websites of public sector bodies.

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues. The generated files can then be used by the tool simplA11yGenReport to give an overview of the state of document accessibility on controlled websites.

Most of the accessibility reports (in french) published by SIP on data.public.lu have been generated using simplA11yGenReport and data coming from this tool.

Accessibility Tests

On all PDF files we execute the following tests:

name description WCAG SC WCAG technique EN 301 549
EmptyText does the file contain text or only images? scanned document? 1.4.5 Image of text (AA)? PDF 7 10.1.4.5
Tagged is the document tagged?
Protected is the document protected and blocks screen readers?
hasTitle Has the document a title? 2.4.2 Page Titled (A) PDF 18 10.2.4.2
hasLang Has the document a default language? 3.1.1 Language of page (A) PDF16 10.3.1.1
hasBookmarks Has the document bookmarks? 2.4.1 Bypass Blocks (A) 10.2.4.1

Installation

git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out 
chmod a+x *.sh

Usage

To be able to use this tool, you need a list of websites to crawl. Store this list in a file named list-sites.txt, one domain per line (without protocol and without path). Example of content for this file:

test.public.lu
etat.public.lu

Then the tool is used in two steps:

  1. Crawl all the files. Launch the following command crawl.sh. It will crawl all the sites mentioned in list-sites.txt. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in the crawled_filesfolder. This step can be quite long.
  2. Analyse the files and detect accessibility issues. Launch the command analyse.sh. The resulting files will be placed in the outfolder.

License

This software is developed by the Information and press service of the luxembourgish government and licensed under the MIT license.

Comments
  • Include incremental update mode

    Include incremental update mode

    It would be interesting to be able to launch the crawler several times on the same site, detect new files and analyse only the accessibility of these new files.

    enhancement 
    opened by AlainVagner 0
  • ModuleNotFoundError: No module named 'pikepdf'

    ModuleNotFoundError: No module named 'pikepdf'

    The crawl.sh script seemed to work just fine. Was able to scrape a good list of other documents this way.

    Trying to run the analysis wasn't so good:

    % ./analyse.sh 
    find: ./crawled_files/10x.gsa.gov/*.pdf: No such file or directory
    ./crawled_files/apprenticeship.gov/2021%20Apprenticeship%20Mailer.pdf
    Traceback (most recent call last):
      File "/Users/mgifford/Documents/GitHub/simplA11yPDFCrawler/./pdfCheck.py", line 1, in <module>
        from pikepdf import Pdf, String, _qpdf
    ModuleNotFoundError: No module named 'pikepdf'
    ./crawled_files/apprenticeship.gov/29_cf_30_regs_only.pdf
    Traceback (most recent call last):
      File "/Users/mgifford/Documents/GitHub/simplA11yPDFCrawler/./pdfCheck.py", line 1, in <module>
        from pikepdf import Pdf, String, _qpdf
    ModuleNotFoundError: No module named 'pikepdf'
    

    I'm running on a Mac, but didn't think that would be a problem:

    % pip3 install pikepdf
    DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
    Requirement already satisfied: pikepdf in /usr/local/lib/python3.9/site-packages (4.2.0)
    Requirement already satisfied: lxml>=4.0 in /usr/local/lib/python3.9/site-packages (from pikepdf) (4.6.3)
    Requirement already satisfied: Pillow>=6.0 in /usr/local/lib/python3.9/site-packages (from pikepdf) (8.4.0)
    DEPRECATION: Configuring installation scheme with distutils config files is deprecated and will no longer work in the near future. If you are using a Homebrew or Linuxbrew Python, please see discussion at https://github.com/Homebrew/homebrew-core/issues/76621
    

    I've tried installing pikepdf on it's own with pip & pip3.

    I notice that pikepdf is in the requirements.txt

    Not sure if this is a problem at my end or not.

    I did cut short my crawl.sh as I seemed to be getting a lot more errors. Anyways, don't think that's the cause of this. It is finding lots of files in the directory.

    opened by mgifford 0
  • Integrate VeraPDF

    Integrate VeraPDF

    When this code has been developed, we were not aware of the existence of the VeraPDF tool. It could be integrated for our accessibility tests (PDF-UA1).

    enhancement 
    opened by AlainVagner 0
  • Set exempt date as an environment variable

    Set exempt date as an environment variable

    In some countries not all PDF files should be compliant. All files published before a given date are exempt. Currently this date is hard-coded for Luxembourg, it should be a parameter.

    enhancement 
    opened by AlainVagner 0
  • Remove GET parameters in file names

    Remove GET parameters in file names

    In the crawler, remove GET parameters in the filename when saving. Find a solution for duplicates. The file extension is used for some statistics, this can lead to further issues.

    bug 
    opened by AlainVagner 0
Owner
AccessibilityLU
AccessibilityLU
A web crawler script that crawls the target website and lists its links

A web crawler script that crawls the target website and lists its links || A web crawler script that lists links by scanning the target website.

null 2 Apr 29, 2022
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
Command line program to download documents from web portals.

command line document download made easy Highlights list available documents in json format or download them filter documents using string matching re

null 16 Dec 26, 2022
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

null 3 Dec 7, 2021
Simple python tool for the purpose of swapping latinic letters with cirilic ones and vice versa in txt, docx and pdf files in Serbian language

Alpha Swap English This is a simple python tool for the purpose of swapping latinic letters with cirylic ones and vice versa, in txt, docx and pdf fil

Aleksandar Damnjanovic 3 May 31, 2022
AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

null 5 Nov 25, 2021
Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Docker containerized Python Flask API that uses selenium to scrape and interact with websites

Christian Gracia 0 Jan 22, 2022
WebScraper - A script that prints out a list of all EXTERNAL references in the HTML response to an HTTP/S request

Project A: WebScraper A script that prints out a list of all EXTERNAL references

null 2 Apr 26, 2022
A Python library for automating interaction with websites.

Home page https://mechanicalsoup.readthedocs.io/ Overview A Python library for automating interaction with websites. MechanicalSoup automatically stor

null 4.3k Jan 7, 2023
Amazon scraper using scrapy, a python framework for crawling websites.

#Amazon-web-scraper This is a python program, which use scrapy python framework to crawl all pages of the product and scrap products data. This progra

Akash Das 1 Dec 26, 2021
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

null 1 Jul 9, 2022
Scrapy-soccer-games - Scraping information about soccer games from a few websites

scrapy-soccer-games Esse projeto tem por finalidade pegar informação de tabela d

Caio Alves 2 Jul 20, 2022
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

Scrapy project 859 Dec 29, 2022
Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings and results from live.skidor.com Usage: Put the python file in a dedic

null 0 Jan 7, 2022
Liveskidordownload - Simple tool to scrape and download cross country ski timings and results from live.skidor.com

LiveSkidorDownload Simple tool to scrape and download cross country ski timings

null 0 Jan 7, 2022
Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

Ege Feyzioglu 3 Oct 14, 2022
Web and PDF Scraper Refactoring

Web and PDF Scraper Refactoring This repository contains the example code of the Web and PDF scraper code roast. Here are the links to the videos: Par

null 18 Dec 31, 2022
A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country

A web Scraper for CSrankings.com that scrapes University and Faculty list for a particular country To run the file: Open terminal

null 2 Jun 6, 2022
Instagram_scrapper - This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or excel file easily.

Instagram_scrapper This project allow you to scrape the list of followers, following or both from a public Instagram account, and create a csv or exce

Lakhdar Belkharroubi 5 Oct 17, 2022