Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/george-gca/ai_papers_analysis
Do some analysis based on main AI conferences
https://github.com/george-gca/ai_papers_analysis
conferences data-analysis fasttext fasttext-embeddings fasttext-python python scikit-learn top2vec
Last synced: about 2 months ago
JSON representation
Do some analysis based on main AI conferences
- Host: GitHub
- URL: https://github.com/george-gca/ai_papers_analysis
- Owner: george-gca
- License: mit
- Created: 2022-11-03T20:53:17.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2024-10-28T20:24:57.000Z (2 months ago)
- Last Synced: 2024-10-28T21:26:35.242Z (2 months ago)
- Topics: conferences, data-analysis, fasttext, fasttext-embeddings, fasttext-python, python, scikit-learn, top2vec
- Language: Python
- Homepage: https://www.comet.com/george-gca/ai-papers/
- Size: 367 KB
- Stars: 0
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# AI Papers Analysis
Trying to understand trends in the latest AI papers.
## Requirements
[Docker](https://www.docker.com/) or, for local installation:
- Python 3.11+
- [Poetry](https://python-poetry.org/docs/)> Note: Poetry installation currently not working due to [a bug when installing fasttext](https://github.com/facebookresearch/fastText/pull/1292).
## Usage
To make it easier to run the code, with or without Docker, I created a few helpers. Both ways use `start_here.sh` as an entry point. Since there are a few quirks when calling the specific code, I created this file with all the necessary commands to run the code. All you need to do is to uncomment the relevant lines and run the script:
```bash
cluster_conferences=1
find_words_usage_over_conf=1
```### Running without Docker
You first need to install [Python Poetry](https://python-poetry.org/docs/). Then, you can install the dependencies and run the code:
```bash
poetry install
bash start_here.sh
```### Running with Docker
To help with the Docker setup, I created a `Dockerfile` and a `Makefile`. The `Dockerfile` contains all the instructions to create the Docker image. The `Makefile` contains the commands to build the image, run the container, and run the code inside the container. To build the image, simply run:
```bash
make
```To call `start_here.sh` inside the container, run:
```bash
make run
```## Data
The data used in this project is the result from running [AI Papers Search Tool](https://github.com/george-gca/ai_papers_search_tool). We need both the `data/` and `model_data/` directories.
## Code Explanation
All the work is done based on the abstracts of the papers. It uses the [fasttext](https://fasttext.cc/) library to build paper representations, [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to reduce the dimensionality of the data, and [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to cluster the papers.
### cluster_conference_papers.py
This script clusters the papers from a specific conference/year.
### cluster_conference_words.py
This script clusters the words from a specific conference/year.
### cluster_filtered_papers.py
This script clusters the papers that contain a specific word or similar words.
## Visualizing Data
The best way to visualize the embeddings is through the [Embedding Projector](https://projector.tensorflow.org/), which I use inside [Comet](https://www.comet.ml/). If you want to use Comet, just create a file named `.comet.config` in the root folder here, and add the following lines:
```config
[comet]
api_key=YOUR_API_KEY
```An example of these experiments logged in Comet can be found [here](https://www.comet.com/george-gca/ai-papers/). To visualize the embeddings, click on the experiment on the left, then navigate to `Assets & Artifacts`, open the `embeddings` directory, and click `Open in Embedding Projector`.
![How to open embedding](img/comet_embedding.png)
## TODO
- create n-gram from abstracts before clustering