Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/absurdprofit/multiuse-spider

A scraping algorithm that extracts all tags from a website and organizes the tags and their attributes in a dictionary. The implementation is written in python, long, repetitive, inefficient and could be done better but it does the job quite well in a short amount of time and works with any website it can get through to. Enjoy :)
https://github.com/absurdprofit/multiuse-spider

Last synced: 10 days ago
JSON representation

Host: GitHub
URL: https://github.com/absurdprofit/multiuse-spider
Owner: absurdprofit
Created: 2019-11-22T01:22:52.000Z (about 5 years ago)
Default Branch: master
Last Pushed: 2019-11-24T05:58:14.000Z (about 5 years ago)
Last Synced: 2024-07-25T04:56:17.383Z (5 months ago)
Language: Python
Size: 13.7 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Multiuse-Spider

A scraping algorithm that extracts all tags from a website and organizes the tags and their attributes in a dictionary. The implementation is written in python, long, repetitive, inefficient and could be done better but it does the job quite well in a short amount of time and works with any website it can get through to. Enjoy :)

Object is instantiated with any url passed to it e.g. spider = Spider("https://google.com"), using the method scrape will scrape

the site and store the data in attribute parse_dict

parse_dict is structured in form 

{

  : [

            {

              : 

             }

         ]

}

Method sort_structure takes input data which is any list of attributes stored in a tag stored in parse_dict. Example call sort_structure(self.parse_dict['a'])

		with self.parse_dict['a'] being list of attributes found in every a tag; i.e. {'a': 

													[

														{'class': ['svg', 'btn'], 'href': '/modal', 'text': 'Text inside tag'},

														{'class': ['btn'], 'href': '#', 'text: ''}

													]

		alpha returns data in form{

						: {

								: {

											: [

														

													]

										}

							}

					}

		with index-letter being the first letter of attributes it encounters, attrib-name being the name of attributes with index-letter as their first letter and value being a sorted list of all values found of said attribute

		eg output using example input above: {

							'a': {

								'c': {

									'class': '['btn', 'svg']

									},

								'h': {

									'href': ['#', '/modal']

									},

								't': {

									'text': ['Text inside tag', '']

									}

								}

							}