https://github.com/honzajavorek/czap
Scraping czap.cz data so you can filter available psychotherapists by any criteria you wish
https://github.com/honzajavorek/czap
czech czech-republic czechia git-scraping psychoterapists psychotherapy registry scraper scrapy
Last synced: about 2 months ago
JSON representation
Scraping czap.cz data so you can filter available psychotherapists by any criteria you wish
- Host: GitHub
- URL: https://github.com/honzajavorek/czap
- Owner: honzajavorek
- License: unlicense
- Created: 2024-02-25T13:53:12.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2024-04-30T04:51:36.000Z (over 1 year ago)
- Last Synced: 2024-05-02T01:14:55.698Z (over 1 year ago)
- Topics: czech, czech-republic, czechia, git-scraping, psychoterapists, psychotherapy, registry, scraper, scrapy
- Language: Python
- Homepage:
- Size: 1.26 MB
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# 💆 czap.cz members
Scraping [czap.cz members](https://czap.cz/adresar) so you can filter available psychotherapists by any criteria you wish:
- [Download CSV](http://honzajavorek.github.io/czap/items.csv)
- [Download JSON](https://honzajavorek.github.io/czap/items.json)I wanted to filter a list of Czech psychotherapists according to different criteria than those available at the [registry website](https://czap.cz/adresar). For example, the registry allows to filter by location, but only to the level of region. As there is 700+ therapists in Prague itself, it's not very useful.
## Monitoring changes
I don't think it's particularly useful to monitor changes in the registry, but I used [git scraping](https://simonwillison.net/2020/Oct/9/git-scraping/) nevertheless, because why not:
- [History of changes](https://github.com/honzajavorek/czap/commits/main/items.json)
- [Feed of changes](https://github.com/honzajavorek/czap/commits/main.atom) (aka RSS)## Notes on development
The scraper uses my favorite [Scrapy](https://docs.scrapy.org/) framework.
So far I scrape only a few fields.
If you want to build on top of the data and you're missing something, let me know in [issues](https://github.com/honzajavorek/czap/issues).
However, because I won't have time to add the fields, you better edit the code and add them yourself.The scraper first downloads all registry with a single request.
The data is encoded not as a JSON, but as a non-standard JavaScript mess.
I figured out the library `demjson3` can parse it, but it takes long minutes (e.g. 30 min) to get the result.
I added cache so that the parse result stays around at least for a day.That data contains some info about members.
It is structured, but it's in a very cryptic structure which needs to be reverse-engineered.
If you're the kind of person who is into such thing, feel free to add fields there.If you prefer good old HTML scraping, the scraper also makes requests to all individual member profile pages.
There you can use [Scrapy selectors](https://docs.scrapy.org/en/latest/topics/selectors.html) to add fields to the data.