{"id":20354815,"url":"https://github.com/datasets/opented","last_synced_at":"2025-04-12T02:37:05.461Z","repository":{"id":5632589,"uuid":"6840791","full_name":"datasets/opented","owner":"datasets","description":"Tenders Electronic Daily (TED) - OpenTED","archived":false,"fork":false,"pushed_at":"2024-10-25T14:26:11.000Z","size":18,"stargazers_count":9,"open_issues_count":1,"forks_count":9,"subscribers_count":21,"default_branch":"main","last_synced_at":"2025-04-12T02:37:01.505Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://opented.org/","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/datasets.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2012-11-24T15:01:22.000Z","updated_at":"2024-10-25T14:26:15.000Z","dependencies_parsed_at":"2024-10-24T23:57:44.929Z","dependency_job_id":"5ecdc100-2487-4bf2-8195-209c6f5c13ff","html_url":"https://github.com/datasets/opented","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets%2Fopented","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets%2Fopented/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets%2Fopented/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/datasets%2Fopented/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/datasets","download_url":"https://codeload.github.com/datasets/opented/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248507017,"owners_count":21115522,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T23:09:47.239Z","updated_at":"2025-04-12T02:37:05.427Z","avatar_url":"https://github.com/datasets.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003ca className=\"gh-badge\" href=\"https://datahub.io/core/opented\"\u003e\u003cimg src=\"https://badgen.net/badge/icon/View%20on%20datahub.io/orange?icon=https://datahub.io/datahub-cube-badge-icon.svg\u0026label\u0026scale=1.25\" alt=\"badge\" /\u003e\u003c/a\u003e\n\nProcessing code and information related to OpenTED (Tenders Electronic Daily).\n\n## Data Processing Pipeline\n\nStructured data is in a MongoDB at opented.org/opented\n\nUnstructured cached HTML pages are also in the that DB in a collection called dumps (in future this data should probably go direct to s3!).\n\n### 1. Get dumps onto s3\n\n#### Get data out of mongodb.\n\n    mongoexport --host opented.org --db opented --username iacc --password gohack --collection dumps --csv --fields \"zhtml,doc_id,timestamp\" | head -n 5000 \u003e cache/dumps.csv\n\n#### Decompress the HTML\n\n    python scripts/extract.py\n\nThis will produce a whole bunch of files in `cache/dumps`\n\n#### Upload the decompressed HTML to S3\n\n    s3cmd sync --acl-public cache/dumps/ s3://files.opented.org/scraped/\n\nYou will find the index of files at: http://files.opented.org.s3.amazonaws.com/scraped/index.json\n\n\n### 2. Scraping content\n\nNow it's time to scrape some content!\n\nWe've written a nodejs scraper. You will need to install the dependencies first:\n\n    npm install cheerio request\n\nThen do:\n\n    node scripts/scrape.js\n\nData will be written to cache/dumps/{\"{docid}\"}/extracted.json\n\n\n## Wishlist\n\nExtra fields to scrape:\n\n* VAT inclusion (string e.g. \"Including VAT\", \"Excluding VAT\", \"Including 10% VAT\", etc.)\n* Award criteria (string, often multi-line, outlining criteria for choosing this bidder)\n\nWe also need to cover the scenario of one contract having multiple winners.\n\nThis probably means we're aiming for three content tables in the end (possibly not with these names):\n\n* `contracts` (need a better name) - info about a specific \n* `companies` - info about a company\n* `wins` - the relation table, i.e. a record in this table has a contract_id and a company_id. Also we can add extra info here that applies to a contract–company relation, eg the proportion of the total contract fee won by this company.\n\n\n## UPDATE from Callum\n\nI attempted to get the entire database by running `mongoexport` overnight (piping through `pv` so I could see the progress), and this morning it's only at 43%, after running for 12.5 hours. I think it's stuck, it doesn't seem to be moving. \n\nI've cancelled this now in case it's DOSing the database. I could still run the Python decompression script on the dumps I've got and upload straight to S3 (could leave this running while I'm out today), but it might take ages. Let me know if you want me to try that, or something else - `callum.locke` at gmail. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasets%2Fopented","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdatasets%2Fopented","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdatasets%2Fopented/lists"}