https://github.com/dataglyder/data-sources-and-sql
This repo touched on data sources and the relational data base
https://github.com/dataglyder/data-sources-and-sql
beautifulsoup4 csv data-cleaning data-collection functions json python3 regex sql sqlite3
Last synced: about 2 months ago
JSON representation
This repo touched on data sources and the relational data base
- Host: GitHub
- URL: https://github.com/dataglyder/data-sources-and-sql
- Owner: dataglyder
- Created: 2025-06-17T21:09:40.000Z (10 months ago)
- Default Branch: main
- Last Pushed: 2025-08-20T22:30:49.000Z (7 months ago)
- Last Synced: 2025-08-21T00:25:39.053Z (7 months ago)
- Topics: beautifulsoup4, csv, data-cleaning, data-collection, functions, json, python3, regex, sql, sqlite3
- Homepage:
- Size: 23.4 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Data Sources and SQL
Working with data is fun, but having access to quality and readily available data might sometimes be challenging. This article attempts to touch on few sources of data and how to store structured data.
### Data Sources
Data Scientist or Analyst often work in an environment where the organization has its own source of data for example, most retail companys store the data of their customers and busines transactions; these could be made availble to analyst when necessary. But in a situation where such is absent, the responsibility might lie on the analyst to source for data; ususally via the internet.
### Data Storage
Structured (tabular) data depending on their size could be stored in spreadsheet or relational database. Relational database could accommodate bigger data than spreadsheet and they are relational because of their algorithm that allow interconnectivity among the tables.
### Web Scraping
Web scraping is the act of gatheing or collecting data from various websites. While some websites have heavy security around their data and prohibit unauthorized collection of them, others allow free scraping of thiers. It is responsible and ethical to check website rules before tampering with their data. For Data that are available in HTML(Hyper Text Markup Language) BeautifulSoup might be a good way to extract such data.
### Extracting HTML Tags with BautifulSoup
Tags in websites that are built in Hyper Text Markup Language (HTML) could be extracted with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc) Tags could be accompany by some unwanted texts that could also be separate with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc). Below is a demonstration of how this could be achieved. We'll be using BeautifulSoup to access data from [Books to scrape](https://books.toscrape.com/)- a website that allows free collection of its data.
### Connect to a website
First we need to import the library that will help us connect to the website the "urllib.request"
***Ensure that all libraries have been downloaded before import***
```
import urllib.request
def open_webpage(url):
#print(urllip.request.urlopen(url)) # To ascertain the connection was successful
return urllip.request.urlopen(url)
# Let's insert the url to test our function
open_webpage("https://books.toscrape.com/")
```
### Access HTML Elements with BeautifuSoup
Now, let's view the HTML elements with BeautifulSoup.
```
from bs4 import BeautifulSoup as bee
def get_elements(elements):
print(bee(elements, "url.parser")) # optional
return bee(elements, "url.parser")
# Ready to test our function
get_elements(open_webpage("https://books.toscrape.com/"))
```
***Function Execution: For "get_elements()" to work, it has to first process "open_webpage()"; hence, "get_elements(open_webpage("https://books.toscrape.com/"))"***
***Let's combine both functions into one for readability***
```
import urllib.request
from bs4 import BautifulSoup
def open_webpg_get_elements(url):
elements = bee(url.requests.url.open(url), "html.parser")
print(elements)
return elements
# Let's test our function and print some texts
open_webpg_get_elements("https://books.toscrape.com/")
```
**A Glimpse of the printed page**:
The printed data has a lot of tags with both wanted and unwanted texts.
### Extracting HTML Tags
One of the tags that I will like to get is the anchor tag. It houses the catalogue link of books and their categories. But then, it also need to be separated from unwanted text. Let's combine [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names) and python [Regular Expresion](https://docs.python.org/3/library/re.html) to extract what we need.
```
import re
def anchor_tag(tags):
"""
Get the anchor tag
View the anchor tag, optional
Extract the category link
Extract book category
Optional to view the list
"""
tag = tags("a")
anchor_tag_ind = tag[3:53]
category_link = re.findall("', str(title_catalogue))
title_links=[]
for links in title_catalogue:
title_links.extend(re.findall('