https://github.com/dataglyder/data-sources-and-sql

This repo touched on data sources and the relational data base
https://github.com/dataglyder/data-sources-and-sql

beautifulsoup4 csv data-cleaning data-collection functions json python3 regex sql sqlite3

Last synced: 5 months ago
JSON representation

This repo touched on data sources and the relational data base

Host: GitHub
URL: https://github.com/dataglyder/data-sources-and-sql
Owner: dataglyder
Created: 2025-06-17T21:09:40.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-08-20T22:30:49.000Z (11 months ago)
Last Synced: 2025-08-21T00:25:39.053Z (11 months ago)
Topics: beautifulsoup4, csv, data-cleaning, data-collection, functions, json, python3, regex, sql, sqlite3
Homepage:
Size: 23.4 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # Data Sources and SQL

Working with data is fun, but having access to quality and readily available data might sometimes be challenging. This article attempts to touch on few sources of data and how to store structured data.

### Data Sources

Data Scientist or Analyst often work in an environment where the organization has its own source of data for example, most retail companys store the data of their customers and busines transactions; these could be made availble to analyst when necessary. But in a situation where such is absent, the responsibility might lie on the analyst to source for data; ususally via the internet.

### Data Storage

Structured (tabular) data depending on their size could be stored in spreadsheet or relational database. Relational database could accommodate bigger data than  spreadsheet and they are relational because of their algorithm that allow interconnectivity among the tables.

### Web Scraping

Web scraping is the act of gatheing or collecting data from various websites. While some websites have heavy security around their data and prohibit unauthorized collection of them, others allow free scraping of thiers. It is responsible and ethical to check website rules before tampering with their data. For Data that are available in HTML(Hyper Text Markup Language) BeautifulSoup might be a good way to extract such data.

### Extracting HTML Tags with BautifulSoup

Tags in websites that are built in Hyper Text Markup Language (HTML) could be extracted with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc) Tags could be accompany by some unwanted texts that could also be separate with  [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc). Below is a demonstration of how this could be achieved. We'll be using BeautifulSoup to access data from [Books to scrape](https://books.toscrape.com/)- a website that allows free collection of its data.

### Connect to a website 

First we need to import the library that will help us connect to the website the "urllib.request"

***Ensure that all libraries have been downloaded before import***

```

import urllib.request

def open_webpage(url):

  #print(urllip.request.urlopen(url))    # To ascertain the connection was successful

  return urllip.request.urlopen(url)

# Let's insert the url to test our function

open_webpage("https://books.toscrape.com/")

```

### Access HTML Elements with BeautifuSoup

Now, let's view the HTML elements with BeautifulSoup.

```

from bs4 import BeautifulSoup as bee

def get_elements(elements):

  print(bee(elements, "url.parser"))    # optional 

  return bee(elements, "url.parser")

# Ready to test our function

get_elements(open_webpage("https://books.toscrape.com/"))

```

***Function Execution: For "get_elements()" to work, it has to first process "open_webpage()"; hence, "get_elements(open_webpage("https://books.toscrape.com/"))"***

***Let's combine both functions into one for readability***

```

import urllib.request

from bs4 import BautifulSoup

def open_webpg_get_elements(url):

  elements = bee(url.requests.url.open(url), "html.parser")

  print(elements)

  return elements

# Let's test our function and print some texts

open_webpg_get_elements("https://books.toscrape.com/")

```

**A Glimpse of the printed page**:

The printed data has a lot of tags with both wanted and unwanted texts.

### Extracting HTML Tags

One of the tags that I will like to get is the anchor tag. It houses the catalogue link of books and their categories. But then, it also need to be separated from unwanted text. Let's combine [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names)  and python [Regular Expresion](https://docs.python.org/3/library/re.html) to extract what we need.

```

import re

def anchor_tag(tags):

			"""

				Get the anchor tag

				View the anchor tag, optional

				Extract the category link

				Extract book category

				Optional to view the list

"""

  	tag = tags("a")                   

  	anchor_tag_ind = tag[3:53]

  	category_link = re.findall("', str(title_catalogue))

	title_links=[]

  	for links in title_catalogue:

		title_links.extend(re.findall('

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/dataglyder/data-sources-and-sql

Awesome Lists containing this project

README