Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/pvnieo/searchy
Implementation of a search engine on the cacm and CS276 (Stanford) collections.
https://github.com/pvnieo/searchy
boolean-search cacm python-3 search-engine stanford-corpus vector-space-model
Last synced: 28 days ago
JSON representation
Implementation of a search engine on the cacm and CS276 (Stanford) collections.
- Host: GitHub
- URL: https://github.com/pvnieo/searchy
- Owner: pvnieo
- Created: 2017-12-17T15:31:04.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-08-17T12:22:47.000Z (over 5 years ago)
- Last Synced: 2024-08-27T14:16:51.470Z (5 months ago)
- Topics: boolean-search, cacm, python-3, search-engine, stanford-corpus, vector-space-model
- Language: Jupyter Notebook
- Homepage:
- Size: 36.3 MB
- Stars: 4
- Watchers: 2
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Moteur de recherche
[![Build Status](https://travis-ci.org/pvnieo/searchy.svg?branch=master)](https://travis-ci.org/pvnieo/searchy)
Implémentation d'un moteur de recherche pour une collection de fichiers.
## Installation
Searchy tourne sous python >= 3.6, utilisez pip pour installer les dépendances
```
pip3 install -r requirements.txt
```Installez les dépendances demandées par nltk avec la commande suivante:
```
python3 -c "import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet');"
```## Usage
Utilisez le script `searchy.py` pour indexer une collection:
```
usage: searchy.py [-h] [-q QUERY] [-m {bool,vect}]
[-n {cos,dice,jaccard,overlap}] [-t THRESHOLD]
[-w {f,tfidf,nf}] [-s] [-f] [--no-cache]
collectionBuilds a search engine on a collection of documents
positional arguments:
collection Path to collection file (CACM format), directory or
url to zipoptional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Execute a search query
-m {bool,vect}, --model {bool,vect}
Search engine model
-n {cos,dice,jaccard,overlap}, --norm {cos,dice,jaccard,overlap}
Vectorial search norm
-t THRESHOLD, --threshold THRESHOLD
Vectorial search norm threshold
-w {f,tfidf,nf}, --weighting {f,tfidf,nf}
Vectorial weighting method
-s, --silent Disable verbose mode
-f, --force Force re-indexing overwrite cache
--no-cache Disable disk cache
```## Exemple d'usage
### Model vectoriel
Les requêtes sont des phrases. Ici on chechre dans la collection CACM.
```
$ ./searchy.py data/CACM/cacm.all
```
```
Loading data/CACM/cacm.all
Using cache 64f76a63
documents 3204
tokens 113754
terms 5961
memory: 0.42 mb
🔍 > Processes and Proofs of Theorems and Programs
-----
3079. An Algorithm for Reasoning About Equality [93.99%]
-----
.T
An Algorithm for Reasoning About Equality
.W
A simple technique for reasoning about equalities
that is fast and complete for ground formulas
...
-----
3140. Social Processes and Proofs of Theorems and Programs [93.87%]
-----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics. Furthermore the absence
...total results: 260 2.94 s
```Pour charger la collection Stanford de manière rapide, vous pouvez la télécharger et l'extraire dans le dossier `dumps/pa1-data/pa1-data`
pour avoir une structure similaire à
```
dumps/pa1-data/pa1-data/0
dumps/pa1-data/pa1-data/1
...
dumps/pa1-data/pa1-data/9
```
Et puis charger la avec searchy:
```
$ ./searchy.py dumps/pa1-data
```Sinon on peut utiliser l'url directement comme argument ce qui fera l'opération précédente automatiquement.
```
$ ./searchy.py http://web.stanford.edu/class/cs276/pa/pa1-data.zip
```### Model booléen
Les requêtes doivent être au format booléen suivant: `(mot1 & mot2) | ~mot3`
les opérateurs booléen autorisés sont: `&` (et), `|` (ou), `~` (négation).```
$ ./searchy.py -m bool data/CACM/cacm.all
```
```
Loading data/CACM/cacm.all
Using cache 64f76a63
documents 3204
tokens 113754
terms 5961
memory: 0.42 mb
🔍 > processes & Proofs & theorems & programs
-----
3140. Social Processes and Proofs of Theorems and Programs [100.00%]
-----
.T
Social Processes and Proofs of Theorems and Programs
.W
It is argued that formal verifications of
programs, no matter how obtained, will not play the
same key role in the development of computer science and software
engineering as proofs do in mathematics. Furthermore the absence
of continuity, the inevitability of change, and the complexity of
specification of significantly many real programs make the form
al verification process difficult to justify and manage. It is felt
that ease of formal verification should not dominate program
language design.
.K
Formal mathematics, mathematical proofs,
program verification, program specification
2.10 4.6 5.24total results: 1 2.96 s
```