Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/abdullah0297445/jyi

This repository holds python script to scrape research articles from jyi.org and find which articles are most similar to each other.
https://github.com/abdullah0297445/jyi

cosine-similarity nltk pandas python3 selenium selenium-python webscraping

Last synced: 6 days ago
JSON representation

This repository holds python script to scrape research articles from jyi.org and find which articles are most similar to each other.

Host: GitHub
URL: https://github.com/abdullah0297445/jyi
Owner: Abdullah0297445
License: mit
Created: 2019-03-19T13:29:54.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-03-19T15:52:02.000Z (almost 6 years ago)
Last Synced: 2024-11-08T16:20:31.475Z (2 months ago)
Topics: cosine-similarity, nltk, pandas, python3, selenium, selenium-python, webscraping
Language: Python
Homepage:
Size: 158 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# JYI

This repository holds python script to scrape research articles from jyi.org and find which articles are most similar to each other.

Cosine similarity has been used to measure the similarity between articles. TF-IDF model has been used.

# About JYI

JYI is a student-led initiative to broaden the undergraduate scientific experience, allowing students to participate in the scientific review and publication processes of its peer-reviewed undergraduate journal. Incorporated as a non-profit, student-run corporation, JYI represents over 50 different academic institutions from over half a dozen countries.

# Requirements:
You need to install following python packages because this project depends on those.

1. Pandas
```python
pip install pandas
```
2. NLTK
```python
pip install nltk
```
After installing nltk you have to download all the necessary text data it provides like stopwords etc.
You can do that in 3 simple steps:

1.Open CMD

2.Write 'python' in the prompt so a python environement will start.

3.Write these two lines of code into the prompt.

```python
import nltk
nltk.download('all')
```
Wait till download is finished.

3. BeautifulSoup
```python
pip install beautifulsoup4
```
4. Scikit-Learn
```python
pip install scikit-learn
```
5. Selenium
```python
pip install selenium
```
After installing selenium you need to download its Chrome WebDriver which can be downloaded from:
http://chromedriver.chromium.org/downloads
Choose appropriate version of Chrome Driver according to the version of your google chrome browser.
You need to add the downloaded ChromeDriver EXE to your PATH variable.

Thats all for dependencies.

# Usage

usage: similarityscript.py [-h] [-i INPUTFILE] [-s SHEET] [-o OUTPUTFILE]

optional arguments:
-h, --help show this help message and exit
-i INPUTFILE, --inputfile INPUTFILE
Specify the input xlsx file path. E.g. C:\user\downloads\Excel.xlsx
-s SHEET, --sheet SHEET
Specify the sheet name in xlsx file. E.g. Dataset1
-o OUTPUTFILE, --outputfile OUTPUTFILE
Specify the directory you want to save output xlsx file. E.g. C:\user\downloads\

![](img/Example%20Usage.jpg)

If no input file is specified then this script tries to find a file named "input.xlsx" in script's directory- The current directory.

If no sheet name is specified then "Dataset1" is considered as default sheet name of the input xlsx file.

If no output path is specified then the output file is placed in the same folder as the script.

###### Example input and output XLSX files have been added along with python script.
This Script has been been tested on Windows 10 with Python 3.6.2