{"id":16355279,"url":"https://github.com/mikewolfd/fmcsadatascraper","last_synced_at":"2025-12-03T06:30:16.254Z","repository":{"id":203584416,"uuid":"296748099","full_name":"mikewolfd/fmcsaDataScraper","owner":"mikewolfd","description":null,"archived":false,"fork":false,"pushed_at":"2020-09-21T01:48:16.000Z","size":29,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-12-29T04:32:25.549Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mikewolfd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-18T23:18:05.000Z","updated_at":"2024-09-15T00:14:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"cb6580b1-311f-417f-92be-191abe555206","html_url":"https://github.com/mikewolfd/fmcsaDataScraper","commit_stats":null,"previous_names":["mikewolfd/fmcsadatascraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2FfmcsaDataScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2FfmcsaDataScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2FfmcsaDataScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mikewolfd%2FfmcsaDataScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mikewolfd","download_url":"https://codeload.github.com/mikewolfd/fmcsaDataScraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239652985,"owners_count":19675008,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T01:40:22.617Z","updated_at":"2025-12-03T06:30:16.224Z","avatar_url":"https://github.com/mikewolfd.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# fmcsaDataScraper\n\nThis is a simple scraping tool to gather CarrierRegistration data from the ai.fmcsa.dot.gov website.\n\nData is accessible from mongodb or exported as an sqlite3 db file. \n\nThe csv is loaded using a pandas dataframe, multiprocessing is handled by concurent.futures, data is parsed by beautifulsoup (lxml), and is scraped into mongodb. Peewee is used as an ORM for sqlite3, although this can easily be modified to use postgresql or mysql. I use docker-compose to run a mongo instance with a locally mounted dir. \n\nAll values are striped of non-alphanumeric values, converted to snakecase, and numerical values are converted to ints.\n\nExample mongo object shape: \n\n    {'carrier_id': 1, 'cargo': {'general_freight': False, ...}, 'types': [{'vehicle_type': 'hazmat_cargo_tank_trailers',...}, ...], 'index': 0}\n\nThe SQL models are Carrier, CarrierFreight, and CarrierVehicles, indexed in type, with foreignkeys to Carrier. Carrier's primary key is the carrier_id from the csv.\n\nThe two functions in main.py are 'scrape' and 'build_sql':\n\n    'scrape' has multiple optional arguments:\n        quantity: an indexed and sliced amount of data based on the .csv index, defaults to all\n        max_tries: number of retry incase of connection or db access failure \n        cooldown: in minutes incase of retry\n        max_workers: the number of process workers accessing the site and loading data \n\n    'build_sql' accepts optional arguments such as max_workers.\n\nThe file layout:\n    \n    main.py - convenience functions \n    settings.py - env loading, processing\n    generate_data.py - multiProcessing, data management for mongodb loading, main entry points\n    mongo_storage.py - pymongo entry with convinence functions\n    sql_store.py - peewee/sqlite entry with models and object creation code\n    scrape.py - url request, text processing, object shaping code\n\n\nTo use:\n\n    Create a .env file using example.env with updated or modified settings.\n    Install packages from requirements.txt\n    Docker and docker-compose must be installed to use the bundled mongodb, to run use 'docker-compose up'\n\n    In a python terminal or notebook, launch the scrape function from main.py to load the mongo db.\n\n    If you believe there was a distruption or failed loading, run the fix_store function from generate_data.py, it will search mongodb for locked entries and attempt to reload them.\n\n    Run the build_sql from main.py to generate a sqlite3 file in the datadir.\n\nTODO:\n\n    testing\n    \n    csv exporting\n    \n    finish dockerizing the package\n    \n    memory optimization for the multiprocessing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikewolfd%2Ffmcsadatascraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmikewolfd%2Ffmcsadatascraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmikewolfd%2Ffmcsadatascraper/lists"}