https://github.com/kernix13/github-readme-seo-analysis

A Jupyter Notebook GitHub README and Repo SEO Analysis to determine what makes a repo rank in the SERPS
https://github.com/kernix13/github-readme-seo-analysis

accessibility data-analysis readme seo seo-analysis

Last synced: about 2 months ago
JSON representation

A Jupyter Notebook GitHub README and Repo SEO Analysis to determine what makes a repo rank in the SERPS

Host: GitHub
URL: https://github.com/kernix13/github-readme-seo-analysis
Owner: Kernix13
Created: 2026-04-19T17:23:33.000Z (3 months ago)
Default Branch: main
Last Pushed: 2026-05-11T17:13:21.000Z (2 months ago)
Last Synced: 2026-05-11T19:15:47.622Z (2 months ago)
Topics: accessibility, data-analysis, readme, seo, seo-analysis
Language: Jupyter Notebook
Homepage:
Size: 354 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Code of conduct: CODE_OF_CONDUCT.md

Awesome Lists containing this project

README

# GitHub README SEO: Data Analysis of What Makes Repos Rank

This project performs a GitHub README SEO Analysis using Jupyter Notebook and data from GitHub Explore & Google to determine the metrics needed to rank in the SERPs.

> [!NOTE]
> I am new to Data Analysis and Jupyter Notebook. I am in the early stages of this analysis and it will take me a long time to finish unless I get help.

## Overview

The goal of this project is to understand why certain GitHub repositories rank in Google & GitHub Explore search results while others do not.

I collected a dataset of repositories using 46 search phrases and recorded both Google rankings and GitHub Explore rankings. For each repository, I also gathered metrics related to README content, repository activity, and available SEO data like titles and meta descriptions.

The end goal is to turn these findings into practical insights that can be applied to improve repository discoverability. This includes refining my own repositories as well as sharing useful patterns, the dataset, and the results with other developers.

Table of Contents

1. [Key Questions](#key-questions)
1. [Key Findings](#key-findings)
1. [Data Sources](#data-sources)
1. [Methodology](#methodology)
1. [Visualizations](#visualizations)
1. [Data Dictionary](#data-dictionary)
1. [Project Structure](#project-structure)
1. [Tech Stack](#tech-stack)
1. [Installation](#installation)
1. [Usage](#usage)
1. [Future Improvements](#future-improvements)
1. [AI Usage](#ai-usage)
1. [Acknowledgments](#acknowledgments)
1. [Contributing](#contributing)
1. [License](#license)

## Key Questions

> 🚧 Section under construction (Too many questions?)

- Which factors are associated with a repository appearing in Google search results (SERPs)?
- Which factors are associated with higher rankings within Google SERPs and GitHub Explore?
- How closely do GitHub Explore rankings align with Google search rankings?
- Is there a relationship between README structure (e.g., H1 usage, table of contents, introduction) and ranking?
- Does the presence of a clear, descriptive introduction impact visibility or ranking?
- Do content characteristics (e.g., word count, links, images) correlate with ranking performance?
- Do broken or low-quality links (e.g., `http://localhost`) correlate with lower rankings?
- Which repository features within a developer's control appear most associated with higher rankings?
- How often do repositories use default titles (e.g., username/repository) versus descriptive titles? What causes that difference?
- Is GitHub "About" text reused in Google SEO titles or meta descriptions?

### Specific questions (remove this section later maybe)

1. Does about_text get reused in:
- seo_title
- meta_description
2. Do SERP fields reuse leading substrings of:
- about text
- README title
- Intro paragraphs (I need intro text)
3. Does Google fall back to:
- username/repo for SEO Title when no usable text exists?
4. Does having a good repo name with `-` as a separator result in a higher rank on average?

⇡ Back to Top

## Key Findings

Summarize the most important insights discovered during the analysis.

> 🚧 Section under construction

### 🔍 Google SERP Insights

- 🚧 Nothing yet (this sub-section may not be needed)

### 🔍 GitHub Explore Insights

- 🚧 Nothing yet (this sub-section may not be needed)

⇡ Back to Top

## Data Sources

The dataset for this project was created manually using a combination of Google search results and GitHub Explore.

Two primary data files were generated:

- `data/all_metrics.csv`
Contains repository-level data, including README structure and content metrics (e.g., headings, word count, links, images), repository metadata (stars, forks, contributors), and SEO-related fields where available.
- `data/search_ranks.csv`
Contains ranking data for each repository across multiple search phrases.

These datasets are joined using the `user_reponame` field to enable combined analysis of repository features and ranking performance (see `merged_data.csv`).

### 🗃️ APIs Used

- GitHub API: Used in `github_api.py` to collect repository metadata. This significantly reduced the need for manual data collection and improved consistency across records. More code could be added to get additional repo and README metrics.

⇡ Back to Top

## Methodology

> 🚧 Section under construction

- Data collection: search phrases on Google and GitHub Explore
- Data Processing and Transformation: ???
- Data Analysis: ???

> Repositories without a README file were excluded from content-based analysis where applicable, as key metrics (e.g., word count, structure, and links) could not be derived.

### 🗃️ Data Collection

Data was collected from both Google search results (SERPs) and GitHub Explore using a set of 46 targeted search phrases. A custom script (`github_api.py`) was used to retrieve repository metadata via the GitHub API.

For each search phrase:

- The top 10 results from GitHub Explore were recorded.
- The top results from Google search were collected (cutoff at 50), including variations where the term "github" was appended to the query.

This process resulted in:

- 335 unique repositories
- 455 total ranking records across all search phrases

### 🔧 Data Processing and Transformation

> 🚧 Section under construction

- Processing: organizing / filtering / restructuring, selecting columns, grouping/sorting, "prepare for analysis"
- Transformation: creating new variables/features, aggregations, encoding / scaling, "change the data into new forms"

### 📊 Data Analysis

> 🚧 Section under construction

> Visualize relationships between fields and rank positions to derive insights

The analysis focused on identifying relationships between repository and README features and their ranking positions in both Google and GitHub search results.

⇡ Back to Top

## Visualizations

> 🚧 Section under construction

Show key charts or plots.

Include screenshots of graphs from the notebook.

Explain what each chart demonstrates.

⇡ Back to Top

## Data Dictionary

Here are all the fields in `merged_data.csv`:

Data Dictionary fields

Field Name
Data Type
Description

user_reponame
str
The repo: `user_name/repo_name`

search_phrase
str
The search phrase used

explore_rank
int64
Position in GitHub EXplore results

google_rank
int64
Position in Google SERPs

source
str

Google SERPs

Google SERPs with "github" appended to search phrase

GitHub Explore results

1st_el
str
1st text element in README

2nd_el
str
2nd text element in README

3rd_el
str
3rd text element in README

h1_ct
int64
# of H1 elements

h2_ct
int64
# of H2 elements

h3_ct
int64
# of H3 elements

toc
int64

0 = No table of contents

1 = Table of contents present

images
int64
# of images in README

alt_text_ct
int64
# of images with alt text

code_blocks
int64
# of code blocks in README

internal_links
int64
# of links to repo files

external_links
int64
# of links to external sites or repos

live_link
int64

0 = No link to live deploy

1 = Link to live deploy in sidebar

watchers
int64
# of repo watchers

contributors
int64
# of repo contributors

rank
int64

My opinion on the quality of the README

1 = Bad

2 = Good/okay

3 = Very Good

type
str
My main classification of the repo

type2
str
My sub-class for the repo

word_count
int64
README word count

forks
int64
# of forks for repo

stars
int64
# of stars for repo

topics
int64
# of topics in sidebar

about_text
str
The About (description) text for the repo

seo_title
str
The title from the Google SERPs

meta_desc
str
The description from the Google SERPs

title_text
str
The About text from the repo sidebar

intro_len
int64
The length of the intro text if good_intro = 1

good_intro
int64

My judgement based on the text elements, and the quality & length of the text at the top of the repo

0 = No

1 = Yes

primary_lang
str
Language used in search phrase

yr
int64
# of years since last update

mo
int64
# of months since last update

wk
int64
# of weeks since last update

has_blog
int64

0 = Repo owner has no blog/website

1 = Repo owner has blog/website

2 = Repo owner has posts on Hashnode, Medium, YouTube, etc.

⇡ Back to Top

## Project Structure

> Current structure as of 4-19-2026

```py
github-readme-seo-analysis/
│
├── .github/ # Issue & PR templates
│
├── data/ # All datasets used in analysis
│ ├── all_metrics.csv # Repo & README metrics
│ ├── merged_data.csv # The 2 csv files merged
│ └── search_ranks.csv # Google and GitHub Explore ranks + search phrases
│
├── notebooks/ # Jupyter notebooks for analysis
│ ├── 01-eda_overview.ipynb
│ ├── 02-google_rank.ipynb
│ └── 03-explore_rank.ipynb
│
├── src/ # Python scripts (data collection, processing)
│ └── github_api.py
│
├── venv/ # ???
│
├── visuals/ # Charts/images for README (optional but recommended)
│
├── .env # API keys
├── .env.example
├── .gitignore
├── CONTRIBUTING.md
├── CODE_OF_CONDUCT.md
├── LICENSE # Add later
├── README.md # Project overview (SEO target)
└── requirements.txt
```

⇡ Back to Top

## Tech Stack

| Tool | Version |
| :--------------------------------------- | :------ |
| [Python](https://www.python.org/) | 3.14.0 |
| [Jupyter Notebook](https://jupyter.org/) | 7.4.5 |
| [Pandas](https://pandas.pydata.org/) | 3.0.1 |
| [Matplotlib](https://matplotlib.org/) | 3.10.6 |
| [Seaborn](https://seaborn.pydata.org/) | 0.13.2 |
| [NumPy](https://numpy.org/) | 2.4.3 |

⇡ Back to Top

## Installation

Follow these steps to set up the project locally.

1. Clone the repository:

```bash
git clone https://github.com/Kernix13/github-readme-seo-analysis
cd github-readme-seo-analysis
```

2. Create a Virtual Environment

```bash
# Linux/Mac Command
python3 -m venv venv

# GitBash Command (Windows)
python -m venv venv
```

3. Activate the virtual environment

```bash
# Linux/Mac Command
source venv/bin/activate

# GitBash Command (Windows)
source venv/Scripts/activate
```

4. Install dependencies

```bash
pip install -r requirements.txt

# register kernel (one-time)
python -m ipykernel install --user --name=venv --display-name "Python (venv)"
```

### ⚡ Quick Start (Windows)

```sh
git clone https://github.com/yourusername/github-readme-seo-analysis.git
cd github-readme-seo-analysis
python -m venv venv
source venv/Scripts/activate
pip install -r requirements.txt
```

### ⚡ Quick Start (Linux / macOS)

```sh
git clone https://github.com/yourusername/github-readme-seo-analysis.git
cd github-readme-seo-analysis
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
```

⇡ Back to Top

## Usage

1. Start Jupyter:

```sh
# recommended:
jupyter lab
# or, if you prefer the classic interface:
jupyter notebook
```

2. Open the `.ipynb` notebook files in the browser interface and run the cells
3. Deactivate the virtual environment when done

```sh
deactivate
```

Note: If you are using Anaconda or another environment manager, you can open the notebook using your preferred tool (e.g., Anaconda Navigator or jupyter lab) after installing the required dependencies.

Running `jupyter notebook` does not work. To get `jupyter lab` to run I have to run `python -m ipykernel install --user --name=venv --display-name "Python (venv)"` - ChatGPT sucks! How do I stop the kernel from the browser or do I just run `deactivate`?

⇡ Back to Top

## Future Improvements

> 🚧 Section under construction

- I need more repos/examples and the need for contributors (only 337 repos)
- Maybe a related Web Dev project that converts your README to HTML then does an SEO analysis and Accessibility check with output that shows what you need and/or suggestions? Or run it through Lighthouse

⇡ Back to Top

## AI Usage

> 🚧 Section under construction

I am in the early stages of learning Python, so I used ChatGPT to write the code in `src/github_api.py` to speed up the process of collecting metrics for the repos. There is a list where you enter the username/reponame and the returns values are:

- README word count
- Number of repo forks
- Number of repo stars
- Number of repo topics
- About text

There are other fields I may be able to get but for now I get the rest of the metrics by going to the repo.

Repo-level metrics I should also get using the GitHub API are:

- Whether there is a live link in the sidebar or not
- The number of watchers
- The primary language IF it is part of the search query
- the year, month, week, or day since last update

README-level metrics I should also get using the GitHub API are:

- The "title" text (some READMEs do not have an H1 or H2 as the 1st heading)
- The number of internal links
- The number of external links
- The number of images (both `![]()` and ``)
- The number of images with alt text
- The count of H1, H2, and H3 elements (both `#` and `

`)
- Whether or not there is a Table of Contents or not

I also need the first elements that are text elements, ideally H1 followed by a paragraph followed by an H2, ignoring images. It would be hard to program that since I have seen other elements at the top of the repo, plus there are other issues. I am doing all of that manually.

I am also counting the number of code blocks which may or may not be useful. There are other metrics I am collecting that are subjective and would be difficult to add to a function.

⇡ Back to Top

## Acknowledgments

- [5 tips for making your GitHub profile page accessible](https://github.blog/developer-skills/github/5-tips-for-making-your-github-profile-page-accessible/): The article that got me thinking about repo SEO
- [Awesome SEO tools](https://github.com/serpapi/awesome-seo-tools): decent list of tools
- [GitHub Search Engine Optimization (SEO): how to rank your repository in GitHub search](https://www.markepear.dev/blog/github-search-engine-optimization): Good article on specifics for GitHub Explore rank
- [GitHub SEO: Rank your repo and get adoption in 2026](https://nakora.ai/blog/github-seo): excellent tips
- [GitHub Pages SEO Analyzer](https://www.jekyllpad.com/tools/github-pages-seo-analyzer): Enter your GitHub page URL to get a report

⇡ Back to Top

## Contributing

Contributions are welcome! If you'd like to help improve this project, please read our [contribution guidelines](./CONTRIBUTING.md) on how to get started, our workflow, and code style expectations.

## License

This project is licensed under the MIT License (coming soon).

⇡ Back to Top

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kernix13/github-readme-seo-analysis

Awesome Lists containing this project

README