https://github.com/honzajavorek/czap

Scraping czap.cz data so you can filter available psychotherapists by any criteria you wish
https://github.com/honzajavorek/czap

czech czech-republic czechia git-scraping psychoterapists psychotherapy registry scraper scrapy

Last synced: about 2 months ago
JSON representation

Scraping czap.cz data so you can filter available psychotherapists by any criteria you wish

Host: GitHub
URL: https://github.com/honzajavorek/czap
Owner: honzajavorek
License: unlicense
Created: 2024-02-25T13:53:12.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-04-30T04:51:36.000Z (over 1 year ago)
Last Synced: 2024-05-02T01:14:55.698Z (over 1 year ago)
Topics: czech, czech-republic, czechia, git-scraping, psychoterapists, psychotherapy, registry, scraper, scrapy
Language: Python
Homepage:
Size: 1.26 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# 💆 czap.cz members

Scraping [czap.cz members](https://czap.cz/adresar) so you can filter available psychotherapists by any criteria you wish:

- [Download CSV](http://honzajavorek.github.io/czap/items.csv)
- [Download JSON](https://honzajavorek.github.io/czap/items.json)

I wanted to filter a list of Czech psychotherapists according to different criteria than those available at the [registry website](https://czap.cz/adresar). For example, the registry allows to filter by location, but only to the level of region. As there is 700+ therapists in Prague itself, it's not very useful.

## Monitoring changes

I don't think it's particularly useful to monitor changes in the registry, but I used [git scraping](https://simonwillison.net/2020/Oct/9/git-scraping/) nevertheless, because why not:

- [History of changes](https://github.com/honzajavorek/czap/commits/main/items.json)
- [Feed of changes](https://github.com/honzajavorek/czap/commits/main.atom) (aka RSS)

## Notes on development

The scraper uses my favorite [Scrapy](https://docs.scrapy.org/) framework.

So far I scrape only a few fields.
If you want to build on top of the data and you're missing something, let me know in [issues](https://github.com/honzajavorek/czap/issues).
However, because I won't have time to add the fields, you better edit the code and add them yourself.

The scraper first downloads all registry with a single request.
The data is encoded not as a JSON, but as a non-standard JavaScript mess.
I figured out the library `demjson3` can parse it, but it takes long minutes (e.g. 30 min) to get the result.
I added cache so that the parse result stays around at least for a day.

That data contains some info about members.
It is structured, but it's in a very cryptic structure which needs to be reverse-engineered.
If you're the kind of person who is into such thing, feel free to add fields there.

If you prefer good old HTML scraping, the scraper also makes requests to all individual member profile pages.
There you can use [Scrapy selectors](https://docs.scrapy.org/en/latest/topics/selectors.html) to add fields to the data.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/honzajavorek/czap

Awesome Lists containing this project

README