{"id":30217166,"url":"https://github.com/dataglyder/data-sources-and-sql","last_synced_at":"2026-02-09T13:31:57.167Z","repository":{"id":299730528,"uuid":"1003925265","full_name":"dataglyder/Data-Sources-and-SQL","owner":"dataglyder","description":"This repo touched on data sources and the relational data base","archived":false,"fork":false,"pushed_at":"2025-08-20T22:30:49.000Z","size":24,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-21T00:25:39.053Z","etag":null,"topics":["beautifulsoup4","csv","data-cleaning","data-collection","functions","json","python3","regex","sql","sqlite3"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dataglyder.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-17T21:09:40.000Z","updated_at":"2025-08-20T22:58:36.000Z","dependencies_parsed_at":"2025-08-13T01:12:42.753Z","dependency_job_id":"b2a81a9b-b3ef-4033-83f9-9bf02d0acc23","html_url":"https://github.com/dataglyder/Data-Sources-and-SQL","commit_stats":null,"previous_names":["dataglyder/data-sources-and-sql.io","dataglyder/data-sources-and-sql"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dataglyder/Data-Sources-and-SQL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataglyder%2FData-Sources-and-SQL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataglyder%2FData-Sources-and-SQL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataglyder%2FData-Sources-and-SQL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataglyder%2FData-Sources-and-SQL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dataglyder","download_url":"https://codeload.github.com/dataglyder/Data-Sources-and-SQL/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dataglyder%2FData-Sources-and-SQL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29266937,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T12:53:16.161Z","status":"ssl_error","status_checked_at":"2026-02-09T12:52:30.244Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["beautifulsoup4","csv","data-cleaning","data-collection","functions","json","python3","regex","sql","sqlite3"],"created_at":"2025-08-14T04:42:15.368Z","updated_at":"2026-02-09T13:31:57.160Z","avatar_url":"https://github.com/dataglyder.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Sources and SQL\nWorking with data is fun, but having access to quality and readily available data might sometimes be challenging. This article attempts to touch on few sources of data and how to store structured data.\n\n### Data Sources\nData Scientist or Analyst often work in an environment where the organization has its own source of data for example, most retail companys store the data of their customers and busines transactions; these could be made availble to analyst when necessary. But in a situation where such is absent, the responsibility might lie on the analyst to source for data; ususally via the internet.\n\n### Data Storage\nStructured (tabular) data depending on their size could be stored in spreadsheet or relational database. Relational database could accommodate bigger data than  spreadsheet and they are relational because of their algorithm that allow interconnectivity among the tables.\n\n### Web Scraping\nWeb scraping is the act of gatheing or collecting data from various websites. While some websites have heavy security around their data and prohibit unauthorized collection of them, others allow free scraping of thiers. It is responsible and ethical to check website rules before tampering with their data. For Data that are available in HTML(Hyper Text Markup Language) BeautifulSoup might be a good way to extract such data.\n\n### Extracting HTML Tags with BautifulSoup\nTags in websites that are built in Hyper Text Markup Language (HTML) could be extracted with [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc) Tags could be accompany by some unwanted texts that could also be separate with  [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc). Below is a demonstration of how this could be achieved. We'll be using BeautifulSoup to access data from [Books to scrape](https://books.toscrape.com/)- a website that allows free collection of its data.\n\n\n### Connect to a website \nFirst we need to import the library that will help us connect to the website the \"urllib.request\"\n***Ensure that all libraries have been downloaded before import***\n\n```\nimport urllib.request\n\ndef open_webpage(url):\n  #print(urllip.request.urlopen(url))    # To ascertain the connection was successful\n  return urllip.request.urlopen(url)\n\n# Let's insert the url to test our function\nopen_webpage(\"https://books.toscrape.com/\")\n```\n\n### Access HTML Elements with BeautifuSoup\nNow, let's view the HTML elements with BeautifulSoup.\n```\nfrom bs4 import BeautifulSoup as bee\n\ndef get_elements(elements):\n  print(bee(elements, \"url.parser\"))    # optional \n  return bee(elements, \"url.parser\")\n\n# Ready to test our function\nget_elements(open_webpage(\"https://books.toscrape.com/\"))\n```\n***Function Execution: For \"get_elements()\" to work, it has to first process \"open_webpage()\"; hence, \"get_elements(open_webpage(\"https://books.toscrape.com/\"))\"***\n\n\n***Let's combine both functions into one for readability***\n```\nimport urllib.request\nfrom bs4 import BautifulSoup\n\ndef open_webpg_get_elements(url):\n  elements = bee(url.requests.url.open(url), \"html.parser\")\n  print(elements)\n  return elements\n\n# Let's test our function and print some texts\nopen_webpg_get_elements(\"https://books.toscrape.com/\")\n\n```\n\n**A Glimpse of the printed page**:\nThe printed data has a lot of tags with both wanted and unwanted texts.\n\n### Extracting HTML Tags\nOne of the tags that I will like to get is the anchor tag. It houses the catalogue link of books and their categories. But then, it also need to be separated from unwanted text. Let's combine [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names)  and python [Regular Expresion](https://docs.python.org/3/library/re.html) to extract what we need.\n\n```\nimport re\n\ndef anchor_tag(tags):\n\n\t\t\t\"\"\"\n\t\t\t\tGet the anchor tag\n\t\t\t\tView the anchor tag, optional\n\t\t\t\tExtract the category link\n\t\t\t\tExtract book category\n\t\t\t\tOptional to view the list\n\n\"\"\"\n  \ttag = tags(\"a\")                   \n  \tanchor_tag_ind = tag[3:53]\n  \tcategory_link = re.findall(\"\u003ca.+category.+\", str(anchor_tag_ind))   \n\n  \tbook_category = []\n\n  \tfor anchor in anchor_tag_ind:\n    \tbook_category.append((anchor.next_element).strip())\n     \n\n  # Get other needed tags and return all extracted tags\n     \n  \tbook_price = re.findall('\u003cp.+(£\\\\d{2}\\\\.\\\\d{2})',str(tags))\n  \tratings = re.findall(r'\u003cp.+(star.+?)\"', str(tags))\n  \ttitle_catalogue = tag(\"h3\")\n  \ttitle =re.findall(r'title=(\".+?)\u003e', str(title_catalogue))\n\n\ttitle_links=[]\n\n  \tfor links in title_catalogue:\n\t\ttitle_links.extend(re.findall('\u003ca.+\\\\.html', str(links)))\n\t\n\t\nreturn category_link, book_category, title_links, title, book_price, ratings\n\n# Test the function\nanchor_tag(open_webpg_get_elements(\"https://books.toscrape.com/\"))\n\n```\n\n\n### Save Data for Future Use\nNow that we have our data ready, it can be saved in different ways for future use.\n\n### Save Data as Comma Separated Value CSV \nFor tha saving to be successfull, all data must be of equal length\n\n```\nimport csv\n\ndef csv_data(data):\n\tdata = (title_links, title, book_price, ratings)\n\tdata_heading = (\"catalogue_link\", \"price\", \"title\", \"ratings\")\n\n\ttitle_list=[]\n\tfor i in range(len(data)):\n\t\tdict_data = {data_heading[0]:data[0][i], data_heading[1]:data[1][i], data_heading:data[2][i],data_heading:data[3][i]}\n\t\ttitle_list.append(dict_data)\n\t\t\n     with open(\"catalogue.csv\", \"w\", newline='\\n', encoding=\"utf-8\") as csvfile:\n\t writer = csv.DictWriter(csvfile, fieldnames=[\"catalogue_link\", \"price\", \"title\", \"ratings\"])\n \t writer.writeheader()\n \t writer.writerows(title_list)\n\ncsv_data(open_webpage(anchor_tag(open_webpage(url)))\n\n#print(pd.read_csv(\"catalogue.csv\"))  # i.e to check the saved data\n\n\"\"\"\nUsing the Easier method: One can convert the data to a data frame and then\nuse .to_csv to save the data into csv format\n\n\"\"\"\n\nimport pandas as pd\n\ncatalogue=pd.DataFrame({\"catalogue_links\":cata_links, \"book_title\":title, \"book_price\":book_price, \"ratings\":ratings})\ncatalogue.to_csv\n# print(catalogue)\n\n```\n\n### Save Data in JSON Format\n\n```\nimport json\ndef json_file(data):\n\theader = [\"category_link\", \"category\",\"catalogue_link\", \"price\", \"title\", \"ratings\"]\n\tdict_json = {\"category_link\":category_link, \"category\":category_link, \"catalogue_link\":catalogue_link,\"price\":price, \"title\":title, \"ratings\":ratings}\n\n\twith open(\"catalogue.json\", \"w\", encoding=\"utf-8\") as json_catalogue:\n \t\tjson.dump(dict_json, json_catalogue, indent=3)\njson_file(anchor_tag(open_webpage(url)))\n\n#Check or open the file\nwith open(\"catalogue.json\", \"r\") as bk:\n \t\tbooks = json.load(bk)\n\n```\n### Save Data in Database using SQL\n\nHere, I will be using SQLite\n```\ndef save_to_database(data):\n\n\t\t\t\t\"\"\"\n\t\t\t\t\t\tChoose a database name and establish connection \n\t\t\t\t\t\tEnsure database is empy\n\t\t\t\t\t\tCreate taable and its fields - I will be creating two tables\n\t\t\t\t\t\tPrepare data i.e available in list of tuple(s)\n\t\t\t\t\t\tInsert prepared data into table(s)\n\t\t\t\t\t\tDon't forget to commit\n\n\"\"\"\n\n\timport sqlite3\n\n\tconnection=sqlite3.connect(\"dataBaseName.sqlite\")\n\tcorsor=connection.cursor()\n\n\tcorsor.execute( \"DROP TABLE If EXISTS bookCatalogue;\")\n\tcorsor.execute( \"DROP TABLE If EXISTS title_link;\")\n\n\n\tcorsor.execute(\"CREATE TABLE bookCatalogue(catalogue_link TEXT, category TEXT)\")\n\tcorsor.execute(\"CREATE TABLE titleLink(title_link TEXT, category TEXT)\")\n\n\ttable1_data=[]\n\tfor i in range(len(category_link)):\n\t\tbook_link=(category_link[i], book_category[i])\n\t\ttable1_data.append(book_link)\n \n \ttable2_data = []\n \tfor j in range(len(title_links)):\n \t\ttitles= (title_links[j], book_price[j], title[j], ratings[j])\n \t\ttable2_data.append(titles)\n\n\tcorsor.executemany(\"INSERT INTO catalogue_link VALUES(?,?)\", table1_data )\n\tcorsor.executemany(\"INSERT INTO titleLink VALUES(?,?,?,?)\", table2_data )\n\n\tconnection.commit()\n\n# Varify your data in the databse\n\n\tchecking= corsor.execute(\"SELECT category from bookCatalogue\")\n\tprint(c.fetchone()) # or fetchall() for the whole row\n\treturn checking\n\nsave_to_database(anchor_tag(open_webpage(url)))\n\n```\n***To be continued***\n\n### Extract Data with API\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataglyder%2Fdata-sources-and-sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdataglyder%2Fdata-sources-and-sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdataglyder%2Fdata-sources-and-sql/lists"}