simplA11yPDFCrawler
simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the commission implementing decision EU 2018/1524. It is used by SIP (Information and Press Service) in Luxembourg to monitor the websites of public sector bodies.
This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues. The generated files can then be used by the tool simplA11yGenReport to give an overview of the state of document accessibility on controlled websites.
Most of the accessibility reports (in french) published by SIP on data.public.lu have been generated using simplA11yGenReport and data coming from this tool.
Accessibility Tests
On all PDF files we execute the following tests:
name | description | WCAG SC | WCAG technique | EN 301 549 |
---|---|---|---|---|
EmptyText | does the file contain text or only images? scanned document? | 1.4.5 Image of text (AA)? | PDF 7 | 10.1.4.5 |
Tagged | is the document tagged? | |||
Protected | is the document protected and blocks screen readers? | |||
hasTitle | Has the document a title? | 2.4.2 Page Titled (A) | PDF 18 | 10.2.4.2 |
hasLang | Has the document a default language? | 3.1.1 Language of page (A) | PDF16 | 10.3.1.1 |
hasBookmarks | Has the document bookmarks? | 2.4.1 Bypass Blocks (A) | 10.2.4.1 |
Installation
git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out
chmod a+x *.sh
Usage
To be able to use this tool, you need a list of websites to crawl. Store this list in a file named list-sites.txt
, one domain per line (without protocol and without path). Example of content for this file:
test.public.lu
etat.public.lu
Then the tool is used in two steps:
- Crawl all the files. Launch the following command
crawl.sh
. It will crawl all the sites mentioned inlist-sites.txt
. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in thecrawled_files
folder. This step can be quite long. - Analyse the files and detect accessibility issues. Launch the command
analyse.sh
. The resulting files will be placed in theout
folder.
License
This software is developed by the Information and press service of the luxembourgish government and licensed under the MIT license.