{"id":16325491,"url":"https://github.com/tynes/document-enricher","last_synced_at":"2025-11-01T01:30:23.642Z","repository":{"id":104848465,"uuid":"71019520","full_name":"tynes/document-enricher","owner":"tynes","description":"Enrich documents with AlchemyAPI, store in elasticsearch then query for insights","archived":false,"fork":false,"pushed_at":"2016-10-17T22:23:09.000Z","size":3281,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-12-26T01:26:49.995Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tynes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-10-15T23:29:25.000Z","updated_at":"2016-10-15T23:29:54.000Z","dependencies_parsed_at":"2023-07-04T22:50:23.288Z","dependency_job_id":null,"html_url":"https://github.com/tynes/document-enricher","commit_stats":{"total_commits":21,"total_committers":1,"mean_commits":21.0,"dds":0.0,"last_synced_commit":"85643c703914e93fb05ca5f9eca06b36a19d836f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tynes%2Fdocument-enricher","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tynes%2Fdocument-enricher/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tynes%2Fdocument-enricher/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tynes%2Fdocument-enricher/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tynes","download_url":"https://codeload.github.com/tynes/document-enricher/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239242801,"owners_count":19606100,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-10T23:05:08.333Z","updated_at":"2025-11-01T01:30:23.520Z","avatar_url":"https://github.com/tynes.png","language":"JavaScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Document Enricher\nNote: Requires Node v6+\n\n## Overview\nThere are a series of npm scripts that will set up the pipeline.\nThe first script turns the text files found in ```data/raw/text_data``` into JSON files.\nThe second script will send the JSON files to AlchemyAPI for enrichment.\nThe third script will populate elasticsearch with the enriched data.\nThe fourth script can be used to query elasticsearch.\n\n## Environmental Variables\nCreate a .env file in the root of the project that looks like this:\n```\nBLUEMIX_API_KEY = 'FILL_ME_IN'\n```\nYou can get the Bluemix API Key by creating an account \n[here](http://www.ibm.com/cloud-computing/bluemix/).\n\n## Installation\nStart by cloning the repository and then install the dependencies with:  \n```\n$ npm i\n```\nThe formatted data is in the ```.gitignore```, so you must build it yourself.\nTo build the pre-enriched data, run the command:  \n```\n$ npm run build:pre-enriched\n```\nThere will now be a file ```data/formatted/pre_enriched.json```.\nThe data key contains an array of JSON objects, one for each text file\nin ```data/raw/text_data```. There are over 2000 of them.  \n\nYou can only have 2000 API calls for free per day, so I recommend building\nonly a sample of data. To do so, run the script:  \n```\n$ npm run build:pre-enriched-dev\n```\nThis will build a file called ```data/formatted/pre_enriched_sample.json```.\nIt will have 250 documents in it. In the future, it can be possible to have\n__n__ documents using [npm script arguments](https://docs.npmjs.com/cli/run-script).\n\nNext, the JSON objects must be enriched.  \nRun the script:  \n```\n$ npm run build:enriched-dev\n```\nThis will enrich the documents from ```data/formatted/pre_enriched_sample.json```\nand write them into a file called ```data/formatted/enriched_sample.json```.\nTo enrich all of the documents, you can edit the ```package.json``` file and change the\nscript ```build:enriched``` to ```node scripts/enricher/index.js```.\nRunning a combined call on that many documents is a lot of API calls...  \n\nSo now it is time to put the data into elasticsearch. Make sure to have elasticsearch\ninstalled and and running. The default URI is ```localhost:9200```. It is currently\nhard coded into ```scripts/db/connection.js``` but it could also be placed in the\n```.env``` file as an environmental variable.  \n\nTo bulk add data to elasticsearch, run the script:  \n```\n$ npm run build:db\n```\n\nThis will read the file ```data/formatted/enriched_sample.json``` and bulk add it to\nelasticsearch.\n\nTo query elasticsearch, there is a query script.\nIt currently only supports simple searching through\nthe types of entities in an article. More functionality planned\nfor the future. To search, run the script:  \n```\n$ npm run query -- QUERY_HERE\n```\n\nIt is important to place the query at the end of the script.\nThe script will print the docs that match.\n\nIt would also be possible to write the results of the query to a file.\n\n## Technologies\n- Node.js\n- AlchemyAPI combined call\n- Elasticsearch\n\n## Queries\nExample queries:\n- What are the top entities of type \"Person\" mentioned in the corpus?\nRun the command:\n```$ npm run query -- Person```  \nThis will print the full documents that are about people.\n\n## Dataset\nhttp://mlg.ucd.ie/datasets/bbc.html\n\n## Testing\n```\n$ npm test\n```\n\n## Issues\n- Sometimes a query will return too many documents","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftynes%2Fdocument-enricher","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftynes%2Fdocument-enricher","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftynes%2Fdocument-enricher/lists"}