{"id":16548655,"url":"https://github.com/george-gca/ai_papers_analysis","last_synced_at":"2026-04-29T10:05:08.799Z","repository":{"id":150309485,"uuid":"561505217","full_name":"george-gca/ai_papers_analysis","owner":"george-gca","description":"Do some analysis based on main AI conferences","archived":false,"fork":false,"pushed_at":"2024-10-28T20:24:57.000Z","size":376,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-06-10T21:10:53.943Z","etag":null,"topics":["conferences","data-analysis","fasttext","fasttext-embeddings","fasttext-python","python","scikit-learn","top2vec"],"latest_commit_sha":null,"homepage":"https://www.comet.com/george-gca/ai-papers/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/george-gca.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-11-03T20:53:17.000Z","updated_at":"2024-10-28T20:25:00.000Z","dependencies_parsed_at":"2023-11-15T17:35:55.186Z","dependency_job_id":"df3c6a03-f1a3-4137-9880-c18c9a585b45","html_url":"https://github.com/george-gca/ai_papers_analysis","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/george-gca/ai_papers_analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/george-gca%2Fai_papers_analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/george-gca%2Fai_papers_analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/george-gca%2Fai_papers_analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/george-gca%2Fai_papers_analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/george-gca","download_url":"https://codeload.github.com/george-gca/ai_papers_analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/george-gca%2Fai_papers_analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32420377,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["conferences","data-analysis","fasttext","fasttext-embeddings","fasttext-python","python","scikit-learn","top2vec"],"created_at":"2024-10-11T19:26:37.614Z","updated_at":"2026-04-29T10:05:08.765Z","avatar_url":"https://github.com/george-gca.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AI Papers Analysis\n\nTrying to understand trends in the latest AI papers.\n\n## Requirements\n\n[Docker](https://www.docker.com/) or, for local installation:\n\n- Python 3.11+\n- [Poetry](https://python-poetry.org/docs/)\n\n\u003e Note: Poetry installation currently not working due to [a bug when installing fasttext](https://github.com/facebookresearch/fastText/pull/1292).\n\n## Usage\n\nTo make it easier to run the code, with or without Docker, I created a few helpers. Both ways use `start_here.sh` as an entry point. Since there are a few quirks when calling the specific code, I created this file with all the necessary commands to run the code. All you need to do is to uncomment the relevant lines and run the script:\n\n```bash\ncluster_conferences=1\nfind_words_usage_over_conf=1\n```\n\n### Running without Docker\n\nYou first need to install [Python Poetry](https://python-poetry.org/docs/). Then, you can install the dependencies and run the code:\n\n```bash\npoetry install\nbash start_here.sh\n```\n\n### Running with Docker\n\nTo help with the Docker setup, I created a `Dockerfile` and a `Makefile`. The `Dockerfile` contains all the instructions to create the Docker image. The `Makefile` contains the commands to build the image, run the container, and run the code inside the container. To build the image, simply run:\n\n```bash\nmake\n```\n\nTo call `start_here.sh` inside the container, run:\n\n```bash\nmake run\n```\n\n## Data\n\nThe data used in this project is the result from running [AI Papers Search Tool](https://github.com/george-gca/ai_papers_search_tool). We need both the `data/` and `model_data/` directories.\n\n## Code Explanation\n\nAll the work is done based on the abstracts of the papers. It uses the [fasttext](https://fasttext.cc/) library to build paper representations, [t-SNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) to reduce the dimensionality of the data, and [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) to cluster the papers.\n\n### cluster_conference_papers.py\n\nThis script clusters the papers from a specific conference/year.\n\n### cluster_conference_words.py\n\nThis script clusters the words from a specific conference/year.\n\n### cluster_filtered_papers.py\n\nThis script clusters the papers that contain a specific word or similar words.\n\n## Visualizing Data\n\nThe best way to visualize the embeddings is through the [Embedding Projector](https://projector.tensorflow.org/), which I use inside [Comet](https://www.comet.ml/). If you want to use Comet, just create a file named `.comet.config` in the root folder here, and add the following lines:\n\n```config\n[comet]\napi_key=YOUR_API_KEY\n```\n\nAn example of these experiments logged in Comet can be found [here](https://www.comet.com/george-gca/ai-papers/). To visualize the embeddings, click on the experiment on the left, then navigate to `Assets \u0026 Artifacts`, open the `embeddings` directory, and click `Open in Embedding Projector`.\n\n![How to open embedding](img/comet_embedding.png)\n\n## TODO\n- create n-gram from abstracts before clustering\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorge-gca%2Fai_papers_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgeorge-gca%2Fai_papers_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgeorge-gca%2Fai_papers_analysis/lists"}