Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/rafelafrance/angiospermtraiter
Using rule-based parsers to extract information from plant treatments
https://github.com/rafelafrance/angiospermtraiter
botany python spacy
Last synced: about 2 months ago
JSON representation
Using rule-based parsers to extract information from plant treatments
- Host: GitHub
- URL: https://github.com/rafelafrance/angiospermtraiter
- Owner: rafelafrance
- License: mit
- Created: 2024-08-16T15:17:59.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-11-21T16:25:21.000Z (2 months ago)
- Last Synced: 2024-11-21T17:30:38.278Z (2 months ago)
- Topics: botany, python, spacy
- Language: Python
- Homepage:
- Size: 746 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AngiospermTraiter ![Python application](https://github.com/rafelafrance/AngiospermTraiter/workflows/CI/badge.svg)
Extract traits about plants from treatments.
I should also mention that this repository builds upon other repositories:
- `traiter`: This is the base code for all the rule-based parsers (aka traiters) that I write. The details change but the underlying process is the same for all.
- `https://github.com/rafelafrance/traiter`## What I'm trying to accomplish
**Challenge**: Extract trait information from plant treatments. That is, if I'm given treatment text like: (Reformatted to emphasize targeted traits.)
**TODO**
## Rule-based parsing strategy
1. There is a lot of overlap in trait terms, for example `biseriate` is used for `perianth`, `androecium`, etc. Fortunately, each major plant section has its own paragraph, so I can split the text into paragraphs and parse each separately and with its own vocabulary and patterns.
2. I label terms using Spacy's phrase and rule-based matchers.
3. Then I match terms using rule-based matchers to yield a trait.For example, given the text: `Gynoecium 1–3–5(–6) carpelled.`:
- NOTE: Each web page refers to a specific taxonomic unit, in this case a family, so I know that from other information on the page, like the title.
1. First I recognize that this is a text paragraph dealing with gynoecia, so I use a parser tailored for those terms.
1. The first sentence in the paragraph contains the word `Gynoecium`.
2. I then recognize other various terms in the paragraph.
1. `(1–)3–5(–6)` is a numeric range term. These are integers and there are no units (like cm) making it a count range and not a measurement range like length or width.
- `1` = the minimum value seen
- `3` = the commonly seen low value
- `5` = the commonly seen high value
- `6` = the maximum value seen
2. `carpelled` is term applied to gynoecia.
3. The parser recognizes the ` ` pattern, and returns a carpel count for this plant taxon.There are, of course, complications and subtleties not outlined above, but you should get the gist of what is going on here.
## Install
You will need to have Python3.12+ installed, as well as pip, a package manager for Python.
You can install the requirements into your python environment like so:```bash
git clone https://github.com/rafelafrance/AngiospermTraiter.git
cd AngiospermTraiter
make install
```Every time you run any script in this repository, you'll have to activate the virtual environment once at the start of your session.
```bash
cd AngiospermTraiter
source .venv/bin/activate
```### Extract traits
You'll need to download some treatment web pages, one treatment per downloaded page.
The target data is generously provided in this [zip file](https://www.delta-intkey.com/angio/angiodata.zip) by DELTA IntKey.Example:
```bash
parse-treatments --treatment-dir /path/to/treatments --json-dir /path/to/output/traits --html-file /path/to/traits.html
```## Tests
There are tests which you can run like so:
```bash
make test
```