Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/rafelafrance/angiospermtraiter

Using rule-based parsers to extract information from plant treatments
https://github.com/rafelafrance/angiospermtraiter

botany python spacy

Last synced: about 2 months ago
JSON representation

Using rule-based parsers to extract information from plant treatments

Host: GitHub
URL: https://github.com/rafelafrance/angiospermtraiter
Owner: rafelafrance
License: mit
Created: 2024-08-16T15:17:59.000Z (5 months ago)
Default Branch: main
Last Pushed: 2024-11-21T16:25:21.000Z (2 months ago)
Last Synced: 2024-11-21T17:30:38.278Z (2 months ago)
Topics: botany, python, spacy
Language: Python
Homepage:
Size: 746 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# AngiospermTraiter ![Python application](https://github.com/rafelafrance/AngiospermTraiter/workflows/CI/badge.svg)

Extract traits about plants from treatments.

I should also mention that this repository builds upon other repositories:

- `traiter`: This is the base code for all the rule-based parsers (aka traiters) that I write. The details change but the underlying process is the same for all.
- `https://github.com/rafelafrance/traiter`

## What I'm trying to accomplish

**Challenge**: Extract trait information from plant treatments. That is, if I'm given treatment text like: (Reformatted to emphasize targeted traits.)

**TODO**

## Rule-based parsing strategy

1. There is a lot of overlap in trait terms, for example `biseriate` is used for `perianth`, `androecium`, etc. Fortunately, each major plant section has its own paragraph, so I can split the text into paragraphs and parse each separately and with its own vocabulary and patterns.
2. I label terms using Spacy's phrase and rule-based matchers.
3. Then I match terms using rule-based matchers to yield a trait.

For example, given the text: `Gynoecium 1–3–5(–6) carpelled.`:

- NOTE: Each web page refers to a specific taxonomic unit, in this case a family, so I know that from other information on the page, like the title.

1. First I recognize that this is a text paragraph dealing with gynoecia, so I use a parser tailored for those terms.
1. The first sentence in the paragraph contains the word `Gynoecium`.
2. I then recognize other various terms in the paragraph.
1. `(1–)3–5(–6)` is a numeric range term. These are integers and there are no units (like cm) making it a count range and not a measurement range like length or width.
- `1` = the minimum value seen
- `3` = the commonly seen low value
- `5` = the commonly seen high value
- `6` = the maximum value seen
2. `carpelled` is term applied to gynoecia.
3. The parser recognizes the ` ` pattern, and returns a carpel count for this plant taxon.

There are, of course, complications and subtleties not outlined above, but you should get the gist of what is going on here.

## Install

You will need to have Python3.12+ installed, as well as pip, a package manager for Python.
You can install the requirements into your python environment like so:

```bash
git clone https://github.com/rafelafrance/AngiospermTraiter.git
cd AngiospermTraiter
make install
```

Every time you run any script in this repository, you'll have to activate the virtual environment once at the start of your session.

```bash
cd AngiospermTraiter
source .venv/bin/activate
```

### Extract traits

You'll need to download some treatment web pages, one treatment per downloaded page.
The target data is generously provided in this [zip file](https://www.delta-intkey.com/angio/angiodata.zip) by DELTA IntKey.

Example:

```bash
parse-treatments --treatment-dir /path/to/treatments --json-dir /path/to/output/traits --html-file /path/to/traits.html
```

## Tests

There are tests which you can run like so:

```bash
make test
```