Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/vadimkantorov/natudump
Scraping LegiFrance naturalisation decrees for fun and OSINT profit
https://github.com/vadimkantorov/natudump
legifrance osint python selenium
Last synced: about 1 month ago
JSON representation
Scraping LegiFrance naturalisation decrees for fun and OSINT profit
- Host: GitHub
- URL: https://github.com/vadimkantorov/natudump
- Owner: vadimkantorov
- Created: 2022-09-08T14:08:08.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2023-05-27T21:09:42.000Z (over 1 year ago)
- Last Synced: 2024-04-14T03:11:50.729Z (8 months ago)
- Topics: legifrance, osint, python, selenium
- Language: Python
- Homepage:
- Size: 57.6 KB
- Stars: 6
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
This is example of scraping public LegiFrance registry's naturalisation decrees for research purposes only (`naturalisation par mariage` is not included in these decrees). Code license is MIT.
```shell
pip install selenium charset_normalizermkdir -p jo
# python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$(wslpath -a -w "$PWD")\\" # for WSL systems, must be on a NTFS drive
python3 natudump.py -o jo --years $(seq 2000 2021) --output-directory-prefix "$PWD/"
ls jo | wc -lmkdir -p txtjo
# https://github.com/pdfminer/pdfminer.six/issues/809
git clone --branch 20220524 --depth 1 https://github.com/pdfminer/pdfminer.six
PYTHONPATH="pdfminer.six:pdfminer.six/tools:$PYTHONPATH" find jo -name '*.pdf' -exec python3 -m pdf2txt {} -o txt{}.txt \;
ls txtjo | wc -lpython3 tabulate.py -i txtjo -o natufrance_2000_2021.tsv
grep 'Russie\|URSS\|U.R.S.S' natufrance_2000_2021.tsv | wc -lmkdir -p catjo
git clone --branch v0.4 --depth 1 https://github.com/pmaupin/pdfrw
rm $(PYTHONPATH="$PWD/pdfrw:$PYTHONPATH" find jo/ -type f -not -exec python3 -c 'import sys, pdfrw; pdfrw.PdfReader(sys.argv[1])' {} \; -print)
for years in $(seq 2000 2021); do PYTHONPATH="$PWD/pdfrw:$PYTHONPATH" python3 pdfrw/examples/cat.py jo/JORF_${years}*; mv cat.JORF_${years}*.pdf catjo; done
ls catjo | wc -lmkdir -p tarjo
for years in $(seq 2000 2021); do tar -cf tarjo/jo${years}.tar jo/*_${years}*; done
ls tarjo | wc -l
```