https://github.com/trendmicro/nlp-securespacy

Last synced: 12 months ago
JSON representation

Host: GitHub
URL: https://github.com/trendmicro/nlp-securespacy
Owner: trendmicro
License: other
Created: 2024-04-26T07:12:10.000Z (about 2 years ago)
Default Branch: main
Last Pushed: 2024-05-08T18:33:53.000Z (about 2 years ago)
Last Synced: 2025-03-16T06:26:53.459Z (about 1 year ago)
Language: Python
Size: 200 KB
Stars: 2
Watchers: 1
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # securespacy

`securespacy` is a Python module that contains our custom tokenizer and named entity extractor for Spacy v3. The following named entities can be extracted by using `securespacy`:

- IP

- URL

- DOMAIN

- EMAIL

- MALWARE

- CVE

- HASH

- INTRUSION_SET

- CITY

- COUNTRY

`securespacy` uses Spacy's **Entity Ruler**, which is a rules-based matching approach in order to extract additional named entities from the text. In other words, this is a fancy way of saying that we're using regex and other static rules to detect entities, in order to complement Spacy's named entity recognition (NER) that uses trained language models.

## Installation

```bash

pip install git+https://github.com/trendmicro/NLP-SecureSpacy.git

```

## Usage

```

import spacy

import securespacy

from securespacy import tagger

from securespacy.tokenizer import custom_tokenizer

from securespacy.patterns import add_entity_ruler_pipeline

text = ('The quick brown fox owns the domain quickbrownfox[.]sh with the ip address 10.231.31.8 '

'with the server located in Manila, Philippines.')

nlp = spacy.load("en_core_web_sm")

nlp.tokenizer = custom_tokenizer(nlp)

add_entity_ruler_pipeline(nlp)

doc = nlp(text)

for ent in doc.ents:

    print(f"{ent.label_:<15} {ent}")

DOMAIN          quickbrownfox[.]sh

IP              10.231.31.8

CITY            Manila

COUNTRY         Philippines

```

## Flair Wrapper

securespacy can be used with Flair. The API is slightly different.

**N.B.** In order to accelerate `phrase_matcher()`, a dictionary will be written in `~/.tokenized_matcher.pickle`.

Delete the file to regenerate it when dictionary files are updated (usually when you update SecureSpacy.)

```python

from flair.models import SequenceTagger

from flair.data import Sentence

from securespacy.flair import SecureSpacyFlairWrapper

tagger = SequenceTagger.load('ner')

text = 'We were able to find a second variant (detected as Trojan.MacOS.GMERA.B) that was uploaded to VirusTotal.'

wrapper = SecureSpacyFlairWrapper()

sentence = Sentence(text, use_tokenizer=wrapper.tokenizer)

model.predict(sentence)

wrapper.phrase_matcher(sentence)

for ent in sentence.get_spans('ner'):

    print(ent)

```

The type of sentence is `flair.data.sentence`.

## References

- https://spacy.io/usage/rule-based-matching#entityruler

## Maintenance

To import the latest data from [MITRE ATT&CK Techniques](https://github.com/mitre-attack/attack-stix-data/tree/master/enterprise-attack), download the latest JSON and run

`./src/securespacy/data/convert-mitre-enterprise.py`. Do a manual pass before mergin the converted files, as

short software names (such as `Net`, `at`, `at.exe`) can cause false classifications.

Merge `mitre-malware.txt` into the case-sensitive list `malware-cased.txt`.

## License

See LICENSE.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/trendmicro/nlp-securespacy

Awesome Lists containing this project

README