{"id":16370758,"url":"https://github.com/integralist/sainsbury-scraper","last_synced_at":"2026-03-08T22:30:16.282Z","repository":{"id":57563724,"uuid":"47841229","full_name":"Integralist/Sainsbury-Scraper","owner":"Integralist","description":"Golang based console app for scraping contents from Sainsbury web page","archived":false,"fork":false,"pushed_at":"2015-12-21T12:30:19.000Z","size":20,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-31T13:43:12.394Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Integralist.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-12-11T17:34:17.000Z","updated_at":"2016-01-01T23:34:11.000Z","dependencies_parsed_at":"2022-09-16T13:30:32.553Z","dependency_job_id":null,"html_url":"https://github.com/Integralist/Sainsbury-Scraper","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Integralist%2FSainsbury-Scraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Integralist%2FSainsbury-Scraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Integralist%2FSainsbury-Scraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Integralist%2FSainsbury-Scraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Integralist","download_url":"https://codeload.github.com/Integralist/Sainsbury-Scraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239906645,"owners_count":19716581,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-11T03:05:53.273Z","updated_at":"2026-03-08T22:30:16.225Z","avatar_url":"https://github.com/Integralist.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"## How to build?\n\n```bash\ngo build -o ss\n```\n\nYou could use [Gox](http://github.com/mitchellh/gox) to more easily build the binary for multiple systems\n\n```bash\ngox -osarch=\"linux/amd64\" -osarch=\"darwin/amd64\" -osarch=\"windows/amd64\" -output=\"ss.{{.OS}}\"\n```\n\n## How to run compiled binary?\n\n```bash\nss\n```\n\n\u003e Note: there was no requirement at this stage to define any flag options\n\n## How to run the tests?\n\n```bash\ngo test -v ./...\n```\n\n## Architecture\n\n![Architecture](https://cloud.githubusercontent.com/assets/180050/11756388/72c1d13a-a051-11e5-860c-7a30bf3e3b49.png)\n\n## Dependencies\n\nThe main dependency is [goquery](https://github.com/PuerkitoBio/goquery/) which abstracts away a lot of the complexity of having to manually parse HTML content\n\nThe other dependency is [codegangsta/cli](https://github.com/codegangsta/cli) which abstracts away a lot of the boilerplate required for creating a console based application\n\n\u003e Note: I'm a big fan of Dave Cheney's [gb](https://getgb.io/) for managing vendored dependencies. Although the BBC prefers to use [Godep](https://godoc.org/github.com/tools/godep). I opted for neither as there were only two dependencies, and so it felt a little overkill for this small project. Once Go 1.6 is released hopefully we'll see an official/native implementation for vendored dependencies\n\n## Development\n\n1. Read-Me Driven Development\n2. Create CLI structure\n3. Define entry command\n4. Define 'retriever' package\n5. Define 'scraper' package\n\n### Retriever\n\nThe retriever should be handed a URL and return a Slice of sub page resource URLs, like so:\n\n```json\n[\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-apricot-ripe---ready-320g.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-avocado-xl-pinkerton-loose-300g.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-avocado--ripe---ready-x2.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-avocados--ripe---ready-x4.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-conference-pears--ripe---ready-x4-%28minimum%29.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-golden-kiwi--taste-the-difference-x4-685641-p-44.html\",\n  \"http://hiring-tests.s3-website-eu-west-1.amazonaws.com/2015_Developer_Scrape/sainsburys-kiwi-fruit--ripe---ready-x4.html\"\n]\n```\n\n\u003e Note: I use `.productInfo a` as my filter\n\n### Scraper\n\nThe scraper should be passed an Array of URLs (see above for example) so it can concurrently request each resource and parse it for the relevant information:\n\n- Resource size\n- Product title\n- Product unit size\n- Product description\n- Product size\n\nThe scraper should return a Struct with a field of `Items` which is assigned an Array of collated details and a field of `Total` which details the total cost. Once it's converted to JSON it'll look something like:\n\n\n```json\n{\n    \"results\": [\n        {\n            \"title\": \"Sainsbury's Apricot Ripe \\u0026 Ready x5\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"3.50\",\n            \"description\": \"Apricots\"\n        },\n        {\n            \"title\": \"Sainsbury's Avocado Ripe \\u0026 Ready XL Loose 300g\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"1.50\",\n            \"description\": \"Avocados\"\n        },\n        {\n            \"title\": \"Sainsbury's Avocado, Ripe \\u0026 Ready x2\",\n            \"size\": \"44kb\",\n            \"unitPrice\": \"1.80\",\n            \"description\": \"Avocados\"\n        },\n        {\n            \"title\": \"Sainsbury's Avocados, Ripe \\u0026 Ready x4\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"3.20\",\n            \"description\": \"Avocados\"\n        },\n        {\n            \"title\": \"Sainsbury's Conference Pears, Ripe \\u0026 Ready x4 (minimum)\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"1.50\",\n            \"description\": \"Conference\"\n        },\n        {\n            \"title\": \"Sainsbury's Golden Kiwi x4\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"1.80\",\n            \"description\": \"Gold Kiwi\"\n        },\n        {\n            \"title\": \"Sainsbury's Kiwi Fruit, Ripe \\u0026 Ready x4\",\n            \"size\": \"39kb\",\n            \"unitPrice\": \"1.80\",\n            \"description\": \"Kiwi\"\n        }\n    ],\n    \"total\": \"15.10\"\n}\n```\n\n\u003e Note:\n\u003e Changed `unit_price` to `unitPrice` as it's more idiomatic of JavaScript/JSON  \n\u003e snake_case is more a Ruby convention\n\nIf the code needs to be made more *reusable*, then we could also look to inject the 'filters' rather than hardcode them. This would allow the package to be reused on different page types.\n\n\u003e Note:\n\u003e I use a multitude of filters such as `h1`, `.pricePerUnit`, `productText` and `productDataItemHeader`.\n\n## Commit History\n\nFor the purposes of this quick test project I was committing straight to master (which in the real-world is a big no-no). At the BBC we have a specific git workflow for how we merge our PRs. Effectively we squash/rebase before cherry picking, while referencing issues/PRs allows us to close them dynamically upon push to master). [I've documented the workflow here](http://www.integralist.co.uk/posts/github-workflow.html)\n\n## Additional comments\n\n- After submitting this code I took a look back over it a few days later and realised that I had made a mistake in the concurrency implementation. In that I had used a WaitGroup even though pulling from the channel was already blocking. I fixed this by removing the WaitGroup, but then put it *back in* again when I realised what my original intention was: to use goroutines in order to speed up the 'getting' of an item as quickly as possible and THEN range over the closed channel in order to pull out all the values (rather than block each iteration with a channel pull). I timed the results and it seems that the latter implementation is significantly faster (worst case being: `2.337` vs `0.388`), but that is traded off with CPU which doubles in the latter implementation (a trade off I'm happy to make considering how cheap CPU is a scalable IaaS/AWS world)\n- In order to fulfil the requirement for displaying the size of the HTML page being linked to, I utilised a solution that resembled the decorator pattern (in spirit) in that it mimic'ed an internal function from the goquery dependency but enhanced it with the missing functionality. This allowed me to incorporate the additional requirement of calculating the response body size while utilising a similar interface. But doing so introduced an issue where by the response body needed to be read twice (once by my implementation and once again when delegating off to the goquery dependency). I mention this because I prefer to avoid code comments in favour of self-explanatory code, but in this instance I felt a code comment was justified in providing some extra clarity\n- I wrote tests for the Scraper package and tried (with what time I had left) to write a test for the Retriever package, but ran out of time. I've added some basic tests to the Retriever package in my spare time as I wanted to ensure I was able to verify the code did what was expected of it.\n- As part of the Retriever test I inlined a chunk of HTML rather than sticking it inside a fixture file. This was done to avoid possible performance concerns with io interaction within the tests (although any concerns I had were negligible)\n- I ended up spending a bit too much time trying to produce the price in the JSON object response as a float rather than a string. The issue I was having was with regards to floats rounding off the last zero (e.g. converting the string into a float would result in something like `15.10` being translated into `15.1`) which was misleading output I felt and so after trying quite a few work arounds, I had to settle on implementing it as a string type instead\n- Spent a bit of time investigating the Unicode code points being placed into the JSON output instead of the actual rune character being rendered (e.g. the Struct would show `\u0026` but when marshaled into JSON it would be transformed into the code point `\\u0026`). It seems that this is expected behaviour according to the Go documentation. If you paste the JSON output into a browser console then you'll find the code point is translated back to the actual rune character\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintegralist%2Fsainsbury-scraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fintegralist%2Fsainsbury-scraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fintegralist%2Fsainsbury-scraper/lists"}