https://github.com/swamikannan/wiki-1-scraping-the-wikipedia-category-hierarchy

Scrape Category pages of wikipedia for Special Exports
https://github.com/swamikannan/wiki-1-scraping-the-wikipedia-category-hierarchy

category special-export wikipedia wikipedia-category wikipedia-scraper

Last synced: 4 months ago
JSON representation

Scrape Category pages of wikipedia for Special Exports

Host: GitHub
URL: https://github.com/swamikannan/wiki-1-scraping-the-wikipedia-category-hierarchy
Owner: SwamiKannan
License: mit
Created: 2023-12-08T07:47:37.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2023-12-26T07:46:47.000Z (almost 2 years ago)
Last Synced: 2025-03-03T16:48:36.577Z (8 months ago)
Topics: category, special-export, wikipedia, wikipedia-category, wikipedia-scraper
Language: Python
Homepage:
Size: 216 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# Processing Wikipedia data - I: Scraping the page names from Wikipedia category hierarchy

This git is specially for the purpose of populating a list of pages / categories that can be entered into Wikipedia's [Special Export](https://en.wikipedia.org/wiki/Special:Export) page to request XML files.

I was in the process of creating an AI assistant for Physics and was trying to download the requisite information from Wikipedia for this purpose. Wikipedia allows us to:

Download all article for a single category say Physics from their Special:Export page.

Download an unrelated bunch of pages in a single XML format from their Special:Export page.

Download the XML file for the current revision of a single article.

Just download the entire Wikipedia database and parse the entire Wikipedia database.

However, I needed, not just the articles that come under the "Physics" category, but also the articles that come under the subcategories of Physics e.g. the pages under Astrophysics or Physicists by nationality, etc. Parsing the whole database and then filtering out the various categories I wanted was troublesome. Hence, this git.

## How to run the library
### 1. Download the git file
```
git clone https://github.com/SwamiKannan/Scraping_Wikipedia_categories.git
```

### 2. Pip install the requirements
Through the command window, navigate to the git folder and run:
```
pip install -r requirements.txt
```
#### Note 1: This assumes that you have already python, and the pip and git libraries installed.
### 3. Decide your parameters
1. Get the URL from where you want to scrape the subcategories and pages. This URL must be a **category** page in Wikipedia i.e. URL of the format: **https://en.wikipedia.org/wiki/Category:**
2. Decide on the maximum number of sub-categories you would like to scrape (optional)
3. Decide on the maximum number of page names you would like to extract (optional)
4. Decide on the depth of the category tree that you would like to extract the page names for (depth is explained in the cover image above)

#### Note 2: If you provide (2), (3) and (4), which ever criteria is met first will halt the scraping
#### Note 3: If you do not provide (2) or (3) or (4) above, the script will keep running until all subcategories are exhausted. This is not recommended since within 7 levels of depth, you can go from Physics to Selena Gomez' We Own the Night Tour page as below:

### 4. Run the code below:
First navigate to the 'src' directory.
Then run the code below:
```
python get_pages.py "" -o (optional) -pl -cl -d
```

## Outputs:
A folder "data" in the chosen output directory (or in the root directory of the repository if no output directory provided)

category_names.txt - A text file containing the list of categories / sub-categories that have been identified

category_links.txt - A text file containing the list of categories / sub-categories **urls** that have been identified

page_names.txt - A text file containing the list of pages that have been populated

page_links.txt - A text file containing the list of page **urls** that have been populated

done_links.txt - A text file containing the list of categories that have been identified **and traversed**. This is a reference only if we want to restart the session with the same parent Category.

## Usage:
### Option 1: Through the browser
1a. Go to the Wikipedia's Export page

1b. Enter the details from category_names.txt or page_names.txt as below:

OR
### Option 2: Through Python
2a. Run the following code:
```
pip install requests
```
2b. Inside a python console, type the following code:
```
import requests

page_name = ""

url='https://en.wikipedia.org/wiki/Special:Export/'+page_name

response=requests.get(url)
if response.status_code==200:
content= response.content

if content:
with open(,'w',encoding='utf-8') as outfile:
outfile.write(content)
```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/swamikannan/wiki-1-scraping-the-wikipedia-category-hierarchy

Awesome Lists containing this project

README