Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/accessibility-luxembourg/simplA11yPDFCrawler

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.
https://github.com/accessibility-luxembourg/simplA11yPDFCrawler

Last synced: 4 months ago
JSON representation

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.

Host: GitHub
URL: https://github.com/accessibility-luxembourg/simplA11yPDFCrawler
Owner: accessibility-luxembourg
License: mit
Created: 2021-12-10T12:13:04.000Z (about 3 years ago)
Default Branch: main
Last Pushed: 2024-10-22T07:19:48.000Z (4 months ago)
Last Synced: 2024-10-23T10:31:46.428Z (4 months ago)
Language: Python
Size: 16.6 KB
Stars: 22
Watchers: 4
Forks: 3
Open Issues: 9
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# simplA11yPDFCrawler

simplA11yReport is a tool supporting the simplified accessibility monitoring method as described in the [commission implementing decision EU 2018/1524](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32018D1524&from=EN). It is used by [SIP (Information and Press Service)](https://sip.gouvernement.lu/en.html) in Luxembourg to monitor the websites of public sector bodies.

This tool crawls a list of websites and download all PDF and office documents. Then it analyses the PDF documents and tries to detect accessibility issues.
The generated files can then be used by the tool [simplA11yGenReport](https://github.com/accessibility-luxembourg/simplA11yGenReport) to give an overview of the state of document accessibility on controlled websites.

Most of the [accessibility reports (in french)](https://data.public.lu/fr/datasets/audits-simplifies-de-laccessibilite-numerique-2020-2021/) published by SIP on [data.public.lu](https://data.public.lu) have been generated using [simplA11yGenReport](https://github.com/accessibility-luxembourg/simplA11yGenReport) and data coming from this tool.

## Accessibility Tests

On all PDF files we execute the following tests:

| name | description | WCAG SC | WCAG technique | EN 301 549 |
|------|-------------|---------|----------------|------------|
| EmptyText | does the file contain text or only images? scanned document? | 1.4.5 Image of text (AA)? | PDF 7 | 10.1.4.5 |
| Tagged | is the document tagged? | | | |
| Protected | is the document protected and blocks screen readers? | | | |
| hasTitle | Has the document a title? | 2.4.2 Page Titled (A) | PDF 18 | 10.2.4.2 |
| hasLang | Has the document a default language? | 3.1.1 Language of page (A) | PDF16 | 10.3.1.1 |
| hasBookmarks | Has the document bookmarks? | 2.4.1 Bypass Blocks (A) | | 10.2.4.1 |

## Installation

```
git clone https://github.com/accessibility-luxembourg/simplA11yPDFCrawler.git
cd simplA11yPDFCrawler
npm install
pip install -r requirements.txt
mkdir crawled_files ; mkdir out
chmod a+x *.sh
```

On MacOS, the `timeout` or `gtimeout` commands are not available, you will need to install the coreutils package via brew:
```
brew install coreutils
```

## Usage

To be able to use this tool, you need a list of websites to crawl. Store this list in a file named `list-sites.txt`, one domain per line (without protocol and without path). Example of content for this file:

```
test.public.lu
etat.public.lu

```

Then the tool is used in two steps:
1. Crawl all the files. Launch the following command `crawl.sh`. It will crawl all the sites mentioned in `list-sites.txt`. Each site is crawled during maximum 4 hours (it can be adjusted in crawl.sh). The resulting files will be placed in the `crawled_files`folder. This step can be quite long.
2. Analyse the files and detect accessibility issues. Launch the command `analyse.sh`. The resulting files will be placed in the `out`folder.

## Output file

The file generated by this tool is a CSV file with the following columns:

- Accessible: if we detect any accessibility issue on a file, this will be False
- TotallyInaccessible: if one of the following tests fail, this will be True: TaggedTest, EmptyTextTest, ProtectedTest
- BrokenFile: has the tool been able to read the file, if not, the file is considered "broken"
- TaggedTest: is the file tagged? If not, this is a serious accessibility issue.
- EmptyTextTest: does the file contain text or only images? it could indicate a scanned document on which no OCR has been executed. This is potentially a serious accessibility issue.
- ProtectedTest: is the file protected against the use by assistive technologies? This is a serious accessibility issue.
- TitleTest: has the file a defined title and is the flag "Display Document Title" activated in order to display the title in the title bar of the pdf reader window?
- LanguageTest: has the file a valid language defined?
- BookmarksTest: if the file has more than 20 pages, has this file bookmarks?
- Exempt: is the file outside of the scope of the luxembourgish law? It can be the case if it has been published before the 23 September 2018. This test is just an estimation based on the creation date of the file.
- Date: creation date
- hasTitle: is a title defined for the file?
- hasDisplayDocTitle: is the DisplayDocTitle flag set?
- hasLang: has the file a defined language?
- InvalidLang: is the defined language valid?
- Form: does the file contain a form?
- xfa: does the file contain a dynamic XFA form?
- hasBookmarks: has the file bookmarks?
- hasXmp: does the file contain XMP metadata?
- PDFVersion: pdf version of the file?
- Creator: creator of the file (software)
- Producer: producer of the file (software)
- Pages: NR of pages

## License
This software is developed by the [Information and press service](https://sip.gouvernement.lu/en.html) of the luxembourgish government and licensed under the MIT license.