Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/absurdprofit/multiuse-spider
A scraping algorithm that extracts all tags from a website and organizes the tags and their attributes in a dictionary. The implementation is written in python, long, repetitive, inefficient and could be done better but it does the job quite well in a short amount of time and works with any website it can get through to. Enjoy :)
https://github.com/absurdprofit/multiuse-spider
Last synced: 10 days ago
JSON representation
A scraping algorithm that extracts all tags from a website and organizes the tags and their attributes in a dictionary. The implementation is written in python, long, repetitive, inefficient and could be done better but it does the job quite well in a short amount of time and works with any website it can get through to. Enjoy :)
- Host: GitHub
- URL: https://github.com/absurdprofit/multiuse-spider
- Owner: absurdprofit
- Created: 2019-11-22T01:22:52.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2019-11-24T05:58:14.000Z (about 5 years ago)
- Last Synced: 2024-07-25T04:56:17.383Z (5 months ago)
- Language: Python
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Multiuse-Spider
A scraping algorithm that extracts all tags from a website and organizes the tags and their attributes in a dictionary. The implementation is written in python, long, repetitive, inefficient and could be done better but it does the job quite well in a short amount of time and works with any website it can get through to. Enjoy :)Object is instantiated with any url passed to it e.g. spider = Spider("https://google.com"), using the method scrape will scrape
the site and store the data in attribute parse_dict
parse_dict is structured in form
{
: [
{
:
}
]
}
Method sort_structure takes input data which is any list of attributes stored in a tag stored in parse_dict. Example call sort_structure(self.parse_dict['a'])
with self.parse_dict['a'] being list of attributes found in every a tag; i.e. {'a':
[
{'class': ['svg', 'btn'], 'href': '/modal', 'text': 'Text inside tag'},
{'class': ['btn'], 'href': '#', 'text: ''}
]
alpha returns data in form{
: {
: {
: [
]
}
}
}
with index-letter being the first letter of attributes it encounters, attrib-name being the name of attributes with index-letter as their first letter and value being a sorted list of all values found of said attribute
eg output using example input above: {
'a': {
'c': {
'class': '['btn', 'svg']
},
'h': {
'href': ['#', '/modal']
},
't': {
'text': ['Text inside tag', '']
}
}
}