Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/alvarofpp/network-from-wikipedia
Script to constructing a network from Wikipedia pages
https://github.com/alvarofpp/network-from-wikipedia
dataset graphml networkx networkx-graph python3 wikipedia
Last synced: 2 days ago
JSON representation
Script to constructing a network from Wikipedia pages
- Host: GitHub
- URL: https://github.com/alvarofpp/network-from-wikipedia
- Owner: alvarofpp
- License: other
- Created: 2021-08-22T18:18:33.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2024-03-08T00:13:03.000Z (8 months ago)
- Last Synced: 2024-03-08T01:26:24.790Z (8 months ago)
- Topics: dataset, graphml, networkx, networkx-graph, python3, wikipedia
- Language: Python
- Homepage:
- Size: 160 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Network from Wikipedia
`to_graphml.py` may be used to constructing a network from wikipedia pages.
This script is based on Colab from [ivanovitchm/network_analysis](https://github.com/ivanovitchm/network_analysis)
(Week 07 Directed networks: case study of Wikipedia pages).
Output file example [here](output.graphml).The snowballing process will be initialized from your `source` argument.
> "When you start the snowballing, you will eventually (and quite soon) bump
into the pages describing ISBN and ISSN numbers, the arXiv, PubMed, and the
like. Almost all other Wikipedia pages refer to one or more of those pages.
This hyper-connectedness transforms any network into a collection of almost
perfect gigantic stars, making all Wikipedia-based networks look similar. To
avoid the stardom syndrome, treat the known 'star' pages as stop words in
information retrieval—in other words, ignore any links to them.
Constructing the black list of stop words, STOPS, is a matter of trial and
error. We put thirteen subjects on it; you may want to add more when you
come across other “stars.” We also excluded pages whose names begin with
"List of", because they are simply lists of other
subjects." - Ivanovitch's jupyter notebookYou can change the `STOPS` words in [`constants.py`](https://github.com/alvarofpp/dataset-network-from-wikipedia/blob/main/utils/constants.py#L4).
## Examples of use
- [Javier Barriuso (Barri) - Graphs of the different leaders of each political party in Argentina.](https://x.com/BarriPdmx/status/1437720971746631680)
## Requirements
The script is written in Python. Dependant packages can be installed via:
```shell
pip install -r requirements.txt
```## How to run
Run `to_graphml.py` providing:
| Argument | Required | Description | Default value |
| -------- | -------- | ----------- | ------------- |
| `-s`
`--source` | Yes | Url or title from Wikipedia. | - |
| `-d`
`--degree` | No | Number of degree that will be used in the filter of nodes. Equal to or greater than this value. | `2` |
| `-l`
`--layers` | No | Number of search layers. | `2` |
| `-o`
`--output` | No | Output filename. | `'output'` |
| `-v`
`--verbose` | No | Increase output verbosity. | `False` |### Examples
Basic usage:
```shell
python3 to_graphml.py --source='Complex network'
# Or (tested only on `en.wikipedia`)
python3 to_graphml.py --source='https://en.wikipedia.org/wiki/Complex_network'
```Verbose mode:
```shell
python3 to_graphml.py --source='Complex network' --verbose
```Search deeper, filtering by more degree and changing the output file:
```shell
python3 to_graphml.py --source='Complex network' --layers=5 --degree=10 --output=graph_5l_10d
# The output file is `graph_5l_10d.graphml`
```## Contributing
Contributions are more than welcome. Fork, improve and make a pull request.
For bugs, ideas for improvement or other, please create an [issue](https://github.com/alvarofpp/dataset-network-from-wikipedia/issues).## License
This project is licensed under the GNU Affero General Public License - see
the [LICENSE.md](LICENSE) file for details.