https://github.com/trendmicro/nlp-securespacy
https://github.com/trendmicro/nlp-securespacy
Last synced: 12 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/trendmicro/nlp-securespacy
- Owner: trendmicro
- License: other
- Created: 2024-04-26T07:12:10.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-05-08T18:33:53.000Z (about 2 years ago)
- Last Synced: 2025-03-16T06:26:53.459Z (about 1 year ago)
- Language: Python
- Size: 200 KB
- Stars: 2
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# securespacy
`securespacy` is a Python module that contains our custom tokenizer and named entity extractor for Spacy v3. The following named entities can be extracted by using `securespacy`:
- IP
- URL
- DOMAIN
- EMAIL
- MALWARE
- CVE
- HASH
- INTRUSION_SET
- CITY
- COUNTRY
`securespacy` uses Spacy's **Entity Ruler**, which is a rules-based matching approach in order to extract additional named entities from the text. In other words, this is a fancy way of saying that we're using regex and other static rules to detect entities, in order to complement Spacy's named entity recognition (NER) that uses trained language models.
## Installation
```bash
pip install git+https://github.com/trendmicro/NLP-SecureSpacy.git
```
## Usage
```
import spacy
import securespacy
from securespacy import tagger
from securespacy.tokenizer import custom_tokenizer
from securespacy.patterns import add_entity_ruler_pipeline
text = ('The quick brown fox owns the domain quickbrownfox[.]sh with the ip address 10.231.31.8 '
'with the server located in Manila, Philippines.')
nlp = spacy.load("en_core_web_sm")
nlp.tokenizer = custom_tokenizer(nlp)
add_entity_ruler_pipeline(nlp)
doc = nlp(text)
for ent in doc.ents:
print(f"{ent.label_:<15} {ent}")
DOMAIN quickbrownfox[.]sh
IP 10.231.31.8
CITY Manila
COUNTRY Philippines
```
## Flair Wrapper
securespacy can be used with Flair. The API is slightly different.
**N.B.** In order to accelerate `phrase_matcher()`, a dictionary will be written in `~/.tokenized_matcher.pickle`.
Delete the file to regenerate it when dictionary files are updated (usually when you update SecureSpacy.)
```python
from flair.models import SequenceTagger
from flair.data import Sentence
from securespacy.flair import SecureSpacyFlairWrapper
tagger = SequenceTagger.load('ner')
text = 'We were able to find a second variant (detected as Trojan.MacOS.GMERA.B) that was uploaded to VirusTotal.'
wrapper = SecureSpacyFlairWrapper()
sentence = Sentence(text, use_tokenizer=wrapper.tokenizer)
model.predict(sentence)
wrapper.phrase_matcher(sentence)
for ent in sentence.get_spans('ner'):
print(ent)
```
The type of sentence is `flair.data.sentence`.
## References
- https://spacy.io/usage/rule-based-matching#entityruler
## Maintenance
To import the latest data from [MITRE ATT&CK Techniques](https://github.com/mitre-attack/attack-stix-data/tree/master/enterprise-attack), download the latest JSON and run
`./src/securespacy/data/convert-mitre-enterprise.py`. Do a manual pass before mergin the converted files, as
short software names (such as `Net`, `at`, `at.exe`) can cause false classifications.
Merge `mitre-malware.txt` into the case-sensitive list `malware-cased.txt`.
## License
See LICENSE.