Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sakhezech/cardscraper
Webscraping tool for generating Anki packages.
https://github.com/sakhezech/cardscraper
anki flashcards python scraping
Last synced: 4 days ago
JSON representation
Webscraping tool for generating Anki packages.
- Host: GitHub
- URL: https://github.com/sakhezech/cardscraper
- Owner: sakhezech
- License: mit
- Created: 2023-09-27T00:27:16.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2024-02-16T01:49:59.000Z (10 months ago)
- Last Synced: 2024-11-16T17:14:03.438Z (about 1 month ago)
- Topics: anki, flashcards, python, scraping
- Language: Python
- Homepage: https://pypi.org/project/cardscraper/
- Size: 101 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# cardscraper
Webscraping tool for generating Anki packages.
## Installation
From [PyPI](https://pypi.org/project/cardscraper/):
```sh
pip install cardscraper
```From git:
```sh
pip install git+https://github.com/sakhezech/cardscraper
```## Usage
`cardscraper ...` or `python -m cardscraper ...`
Generate a skeleton input file:
```sh
cardscraper init filename.yaml
```Edit it with your favorite text editor:
```sh
nvim filename.yaml
```Generate the package:
```sh
cardscraper gen filename.yaml
```For more info use `cardscraper -h`.
## Input files
You can generate a skeleton input file by using `cardscraper init filename.yaml`.
Here is a big self-explaining input file example:
```yaml
# here you can specify which function to use for each step
# (every one defaults to 'default')
meta:
# controls package details and package dumping
package: default
# controls deck creation
deck: default
# controls model creation
model: default
# controls scraping and note creation
scraping: default# anki package info
package:
# package name
name: package_name
# output folder (defaults to '.')
output: ./out/
# media folder (defaults to null)
# the directory will be walked recursively
# every pattern matched file will be added to the package as media
media: ./media/
# pattern to match files against for media (defaults to **/*.*)
pattern: "**/*.png"# anki deck info
deck:
# deck name
name: Deck
# deck id
# don't forget to make this value unique
id: 987# anki model info
model:
# model name
name: Model
# model id
# don't forget to make this value unique
id: 321
# card styling (defaults to '')
css: |
.question, .answer {
text-align: center;
}
.question {
font-size: 5rem;
font-weight: 700;
}
.answer {
font-size: 3rem;
}
# list of cards
templates:
# card name
- name: Front
# front side
qfmt: |
{{Question}}
# back side
afmt: |
{{FrontSide}}
{{Answer}}
# same here
- name: Back
qfmt: |
{{Answer}}
afmt: |
{{FrontSide}}
{{Question}}
# scraping info
scraping:
# list of urls to scrape
urls:
- https://www.scrapethissite.com/pages/simple/
# you can set your own custom user agent (defaults to null)
agent: Mozilla/5.0 (X11; Linux x86_64; rv:120.0) Gecko/20100101 Firefox/120.0
# list of queries
# each query selects an html element and lets you use its text in the templates
# each child query runs inside the parent one
queries:
# query name which you can use in the templates like {{Country}}
- name: Country
# css selector
query: .country
# you can select something specific from the query by providing a regex
# this is a python regex with re.DOTALL enabled i.e. '.' captures '\n'
# uses the first captured group
# (defaults to null)
regex: null
# if true: we select every instance and iterate over them
# if false: we only select the first one
# basically it's querySelector() vs querySelectorAll()
# (defaults to false)
many: true
children:
- name: Question
query: .country-info
many: false
regex: (Area .*)$
children: null
- name: Answer
query: .country-name
many: false
regex: null
children: null
```## Usage in code
It is possible to use cardscraper programmatically, but it is created to be used as a CLI application.
```py
import yaml
from cardscraper import (
Config,
generate_anki_package,
select_function_by_step_and_name,
write_package,
)
from genanki import Model, Noteif __name__ == '__main__':
with open('/path/to/config.yaml', 'r') as f:
config: Config = yaml.load(f, yaml.Loader)
# or you can make a config manuallyget_model = select_function_by_step_and_name('model', 'default')
get_deck = select_function_by_step_and_name('deck', 'default')
get_package = select_function_by_step_and_name('package', 'default')def get_notes(config: Config, model: Model) -> list[Note]:
notes = []
...
return notespackage, path = generate_anki_package(
config, get_model, get_notes, get_deck, get_package
)
write_package(package, path)
```## Plugin system
A plugin system is present in cardscraper. To expose your functions to cardscraper expose them in an entry point named `cardscraper.STEPNAME`.
This is how the default functions are exposed:
```toml
[project.entry-points.'cardscraper.model']
default = 'cardscraper.default:get_model'
[project.entry-points.'cardscraper.scraping']
default = 'cardscraper.default:get_notes'
[project.entry-points.'cardscraper.deck']
default = 'cardscraper.default:get_deck'
[project.entry-points.'cardscraper.package']
default = 'cardscraper.default:get_package'
```