https://github.com/kernix13/github-readme-seo-analysis
A Jupyter Notebook GitHub README and Repo SEO Analysis to determine what makes a repo rank in the SERPS
https://github.com/kernix13/github-readme-seo-analysis
accessibility data-analysis readme seo seo-analysis
Last synced: 27 days ago
JSON representation
A Jupyter Notebook GitHub README and Repo SEO Analysis to determine what makes a repo rank in the SERPS
- Host: GitHub
- URL: https://github.com/kernix13/github-readme-seo-analysis
- Owner: Kernix13
- Created: 2026-04-19T17:23:33.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2026-05-11T17:13:21.000Z (about 1 month ago)
- Last Synced: 2026-05-11T19:15:47.622Z (about 1 month ago)
- Topics: accessibility, data-analysis, readme, seo, seo-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 354 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md
Awesome Lists containing this project
README
# GitHub README SEO: Data Analysis of What Makes Repos Rank
This project performs a GitHub README SEO Analysis using Jupyter Notebook and data from GitHub Explore & Google to determine the metrics needed to rank in the SERPs.
> [!NOTE]
> I am new to Data Analysis and Jupyter Notebook. I am in the early stages of this analysis and it will take me a long time to finish unless I get help.
## Overview
The goal of this project is to understand why certain GitHub repositories rank in Google & GitHub Explore search results while others do not.
I collected a dataset of repositories using 46 search phrases and recorded both Google rankings and GitHub Explore rankings. For each repository, I also gathered metrics related to README content, repository activity, and available SEO data like titles and meta descriptions.
The end goal is to turn these findings into practical insights that can be applied to improve repository discoverability. This includes refining my own repositories as well as sharing useful patterns, the dataset, and the results with other developers.
Table of Contents
1. [Key Questions](#key-questions)
1. [Key Findings](#key-findings)
1. [Data Sources](#data-sources)
1. [Methodology](#methodology)
1. [Visualizations](#visualizations)
1. [Data Dictionary](#data-dictionary)
1. [Project Structure](#project-structure)
1. [Tech Stack](#tech-stack)
1. [Installation](#installation)
1. [Usage](#usage)
1. [Future Improvements](#future-improvements)
1. [AI Usage](#ai-usage)
1. [Acknowledgments](#acknowledgments)
1. [Contributing](#contributing)
1. [License](#license)
## Key Questions
> 🚧 Section under construction (Too many questions?)
- Which factors are associated with a repository appearing in Google search results (SERPs)?
- Which factors are associated with higher rankings within Google SERPs and GitHub Explore?
- How closely do GitHub Explore rankings align with Google search rankings?
- Is there a relationship between README structure (e.g., H1 usage, table of contents, introduction) and ranking?
- Does the presence of a clear, descriptive introduction impact visibility or ranking?
- Do content characteristics (e.g., word count, links, images) correlate with ranking performance?
- Do broken or low-quality links (e.g., `http://localhost`) correlate with lower rankings?
- Which repository features within a developer's control appear most associated with higher rankings?
- How often do repositories use default titles (e.g., username/repository) versus descriptive titles? What causes that difference?
- Is GitHub "About" text reused in Google SEO titles or meta descriptions?
### Specific questions (remove this section later maybe)
1. Does about_text get reused in:
- seo_title
- meta_description
2. Do SERP fields reuse leading substrings of:
- about text
- README title
- Intro paragraphs (I need intro text)
3. Does Google fall back to:
- username/repo for SEO Title when no usable text exists?
4. Does having a good repo name with `-` as a separator result in a higher rank on average?
## Key Findings
Summarize the most important insights discovered during the analysis.
> 🚧 Section under construction
### 🔍 Google SERP Insights
- 🚧 Nothing yet (this sub-section may not be needed)
### 🔍 GitHub Explore Insights
- 🚧 Nothing yet (this sub-section may not be needed)
## Data Sources
The dataset for this project was created manually using a combination of Google search results and GitHub Explore.
Two primary data files were generated:
- `data/all_metrics.csv`
Contains repository-level data, including README structure and content metrics (e.g., headings, word count, links, images), repository metadata (stars, forks, contributors), and SEO-related fields where available.
- `data/search_ranks.csv`
Contains ranking data for each repository across multiple search phrases.
These datasets are joined using the `user_reponame` field to enable combined analysis of repository features and ranking performance (see `merged_data.csv`).
### 🗃️ APIs Used
- GitHub API: Used in `github_api.py` to collect repository metadata. This significantly reduced the need for manual data collection and improved consistency across records. More code could be added to get additional repo and README metrics.
## Methodology
> 🚧 Section under construction
- Data collection: search phrases on Google and GitHub Explore
- Data Processing and Transformation: ???
- Data Analysis: ???
> Repositories without a README file were excluded from content-based analysis where applicable, as key metrics (e.g., word count, structure, and links) could not be derived.
### 🗃️ Data Collection
Data was collected from both Google search results (SERPs) and GitHub Explore using a set of 46 targeted search phrases. A custom script (`github_api.py`) was used to retrieve repository metadata via the GitHub API.
For each search phrase:
- The top 10 results from GitHub Explore were recorded.
- The top results from Google search were collected (cutoff at 50), including variations where the term "github" was appended to the query.
This process resulted in:
- 335 unique repositories
- 455 total ranking records across all search phrases
### 🔧 Data Processing and Transformation
> 🚧 Section under construction
- Processing: organizing / filtering / restructuring, selecting columns, grouping/sorting, "prepare for analysis"
- Transformation: creating new variables/features, aggregations, encoding / scaling, "change the data into new forms"
### 📊 Data Analysis
> 🚧 Section under construction
> Visualize relationships between fields and rank positions to derive insights
The analysis focused on identifying relationships between repository and README features and their ranking positions in both Google and GitHub search results.
## Visualizations
> 🚧 Section under construction
Show key charts or plots.
Include screenshots of graphs from the notebook.
Explain what each chart demonstrates.
## Data Dictionary
Here are all the fields in `merged_data.csv`:
Data Dictionary fields
Field Name
Data Type
Description
user_reponame
str
The repo: user_name/repo_name
search_phrase
str
The search phrase used
explore_rank
int64
Position in GitHub EXplore results
google_rank
int64
Position in Google SERPs
source
str
- Google SERPs
- Google SERPs with "github" appended to search phrase
- GitHub Explore results
1st_el
str
1st text element in README
2nd_el
str
2nd text element in README
3rd_el
str
3rd text element in README
h1_ct
int64
# of H1 elements
h2_ct
int64
# of H2 elements
h3_ct
int64
# of H3 elements
toc
int64
- 0 = No table of contents
- 1 = Table of contents present
images
int64
# of images in README
alt_text_ct
int64
# of images with alt text
code_blocks
int64
# of code blocks in README
internal_links
int64
# of links to repo files
external_links
int64
# of links to external sites or repos
live_link
int64
- 0 = No link to live deploy
- 1 = Link to live deploy in sidebar
watchers
int64
# of repo watchers
contributors
int64
# of repo contributors
rank
int64
- 1 = Bad
- 2 = Good/okay
- 3 = Very Good
My opinion on the quality of the README
type
str
My main classification of the repo
type2
str
My sub-class for the repo
word_count
int64
README word count
forks
int64
# of forks for repo
stars
int64
# of stars for repo
topics
int64
# of topics in sidebar
about_text
str
The About (description) text for the repo
seo_title
str
The title from the Google SERPs
meta_desc
str
The description from the Google SERPs
title_text
str
The About text from the repo sidebar
intro_len
int64
The length of the intro text if good_intro = 1
good_intro
int64
- 0 = No
- 1 = Yes
My judgement based on the text elements, and the quality & length of the text at the top of the repo
primary_lang
str
Language used in search phrase
yr
int64
# of years since last update
mo
int64
# of months since last update
wk
int64
# of weeks since last update
has_blog
int64
- 0 = Repo owner has no blog/website
- 1 = Repo owner has blog/website
- 2 = Repo owner has posts on Hashnode, Medium, YouTube, etc.
## Project Structure
> Current structure as of 4-19-2026
```py
github-readme-seo-analysis/
│
├── .github/ # Issue & PR templates
│
├── data/ # All datasets used in analysis
│ ├── all_metrics.csv # Repo & README metrics
│ ├── merged_data.csv # The 2 csv files merged
│ └── search_ranks.csv # Google and GitHub Explore ranks + search phrases
│
├── notebooks/ # Jupyter notebooks for analysis
│ ├── 01-eda_overview.ipynb
│ ├── 02-google_rank.ipynb
│ └── 03-explore_rank.ipynb
│
├── src/ # Python scripts (data collection, processing)
│ └── github_api.py
│
├── venv/ # ???
│
├── visuals/ # Charts/images for README (optional but recommended)
│
├── .env # API keys
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── LICENSE # Add later
├── README.md # Project overview (SEO target)
└── requirements.txt
```
## Tech Stack
| Tool | Version |
| :--------------------------------------- | :------ |
| [Python](https://www.python.org/) | 3.14.0 |
| [Jupyter Notebook](https://jupyter.org/) | 7.4.5 |
| [Pandas](https://pandas.pydata.org/) | 3.0.1 |
| [Matplotlib](https://matplotlib.org/) | 3.10.6 |
| [Seaborn](https://seaborn.pydata.org/) | 0.13.2 |
| [NumPy](https://numpy.org/) | 2.4.3 |
## Installation
Follow these steps to set up the project locally.
1. Clone the repository:
```bash
git clone https://github.com/Kernix13/github-readme-seo-analysis
cd github-readme-seo-analysis
```
2. Create a Virtual Environment
```bash
# Linux/Mac Command
python3 -m venv venv
# GitBash Command (Windows)
python -m venv venv
```
3. Activate the virtual environment
```bash
# Linux/Mac Command
source venv/bin/activate
# GitBash Command (Windows)
source venv/Scripts/activate
```
4. Install dependencies
```bash
pip install -r requirements.txt
# register kernel (one-time)
python -m ipykernel install --user --name=venv --display-name "Python (venv)"
```
### ⚡ Quick Start (Windows)
```sh
git clone https://github.com/yourusername/github-readme-seo-analysis.git
cd github-readme-seo-analysis
python -m venv venv
source venv/Scripts/activate
pip install -r requirements.txt
```
### ⚡ Quick Start (Linux / macOS)
```sh
git clone https://github.com/yourusername/github-readme-seo-analysis.git
cd github-readme-seo-analysis
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```
## Usage
1. Start Jupyter:
```sh
# recommended:
jupyter lab
# or, if you prefer the classic interface:
jupyter notebook
```
2. Open the `.ipynb` notebook files in the browser interface and run the cells
3. Deactivate the virtual environment when done
```sh
deactivate
```
**Note**: If you are using Anaconda or another environment manager, you can open the notebook using your preferred tool (e.g., Anaconda Navigator or jupyter lab) after installing the required dependencies.
Running `jupyter notebook` does not work. To get `jupyter lab` to run I have to run `python -m ipykernel install --user --name=venv --display-name "Python (venv)"` - ChatGPT sucks! How do I stop the kernel from the browser or do I just run `deactivate`?
## Future Improvements
> 🚧 Section under construction
- I need more repos/examples and the need for contributors (only 337 repos)
- Maybe a related Web Dev project that converts your README to HTML then does an SEO analysis and Accessibility check with output that shows what you need and/or suggestions? Or run it through Lighthouse
## AI Usage
> 🚧 Section under construction
I am in the early stages of learning Python, so I used ChatGPT to write the code in `src/github_api.py` to speed up the process of collecting metrics for the repos. There is a list where you enter the username/reponame and the returns values are:
- README word count
- Number of repo forks
- Number of repo stars
- Number of repo topics
- About text
There are other fields I may be able to get but for now I get the rest of the metrics by going to the repo.
Repo-level metrics I should also get using the GitHub API are:
- Whether there is a live link in the sidebar or not
- The number of watchers
- The primary language IF it is part of the search query
- the year, month, week, or day since last update
README-level metrics I should also get using the GitHub API are:
- The "title" text (some READMEs do not have an H1 or H2 as the 1st heading)
- The number of internal links
- The number of external links
- The number of images (both `![]()` and ``)
- The number of images with alt text
- The count of H1, H2, and H3 elements (both `#` and `
`)
- Whether or not there is a Table of Contents or not
I also need the first elements that are text elements, ideally H1 followed by a paragraph followed by an H2, ignoring images. It would be hard to program that since I have seen other elements at the top of the repo, plus there are other issues. I am doing all of that manually.
I am also counting the number of code blocks which may or may not be useful. There are other metrics I am collecting that are subjective and would be difficult to add to a function.
## Acknowledgments
- [5 tips for making your GitHub profile page accessible](https://github.blog/developer-skills/github/5-tips-for-making-your-github-profile-page-accessible/): The article that got me thinking about repo SEO
- [Awesome SEO tools](https://github.com/serpapi/awesome-seo-tools): decent list of tools
- [GitHub Search Engine Optimization (SEO): how to rank your repository in GitHub search](https://www.markepear.dev/blog/github-search-engine-optimization): Good article on specifics for GitHub Explore rank
- [GitHub SEO: Rank your repo and get adoption in 2026](https://nakora.ai/blog/github-seo): excellent tips
- [GitHub Pages SEO Analyzer](https://www.jekyllpad.com/tools/github-pages-seo-analyzer): Enter your GitHub page URL to get a report
## Contributing
Contributions are welcome! If you'd like to help improve this project, please read our [contribution guidelines](./CONTRIBUTING.md) on how to get started, our workflow, and code style expectations.
## License
This project is licensed under the MIT License (coming soon).