{"id":21496587,"url":"https://github.com/mlibrary/text-assembler-django","last_synced_at":"2025-03-17T12:16:12.814Z","repository":{"id":41924485,"uuid":"221279879","full_name":"mlibrary/text-assembler-django","owner":"mlibrary","description":"LexisNexis API client; forked from https://gitlab.msu.edu/msu-libraries/public/text-assembler/","archived":false,"fork":false,"pushed_at":"2022-12-08T05:24:00.000Z","size":381,"stargazers_count":1,"open_issues_count":6,"forks_count":0,"subscribers_count":6,"default_branch":"cosign","last_synced_at":"2025-01-23T21:53:30.674Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mlibrary.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-11-12T18:01:42.000Z","updated_at":"2021-11-29T16:43:50.000Z","dependencies_parsed_at":"2023-01-24T12:15:46.065Z","dependency_job_id":null,"html_url":"https://github.com/mlibrary/text-assembler-django","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Ftext-assembler-django","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Ftext-assembler-django/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Ftext-assembler-django/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mlibrary%2Ftext-assembler-django/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mlibrary","download_url":"https://codeload.github.com/mlibrary/text-assembler-django/tar.gz/refs/heads/cosign","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244031154,"owners_count":20386534,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-23T16:17:31.796Z","updated_at":"2025-03-17T12:16:12.793Z","avatar_url":"https://github.com/mlibrary.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"Text Assembler\n=============\nThis is a web-based application that makes use of the Lexis Nexis API for searching and downloading from their \ndata set.\n\nContents\n--------\n* [Assumptions](#assumptions)\n* [Install \u0026 Setup](#install-setup)\n* [Applying Updates](#applying-updates)\n* [WSK to API Transition](#wsk-to-api-transition)\n* [Technical Overview](#technical-overview)\n\nAssumptions\n-----------\n* This application uses OAuth2 for authentication. Some form of authentication is required to obtain a unique userid. \nThis project does not include documentation for setting up an OAuth provider, just the code for using it as an \nOAuth client.\n* This application was built on Ubuntu 18.04 and has not been tested on other versions or distributions.\n* This application was built using Python 3.6 and has not been tested on other versions.\n* This application was built using MariaDB 10.2 and has not been tested on other versions or another DBMS.\n* The [development](https://gitlab.msu.edu/msu-libraries/public/text-assembler/tree/development) and [deploy](https://gitlab.msu.edu/msu-libraries/public/text-assembler/tree/deploy) branches of code are configured to perform [CI/CD](https://gitlab.msu.edu/msu-libraries/public/text-assembler/blob/master/.gitlab-ci.yml) steps with validate the code and deploy it. Others using this code will want to use the (master)[https://gitlab.msu.edu/msu-libraries/public/text-assembler/tree/master] branch, which will skip these steps. For more information on setting up CI/CD in your environment, see [this documentation](https://gitlab.msu.edu/msu-libraries/public/gitlab-ci-cd-guide).\n\nInstall \u0026 Setup\n---------------\nThese instructions have been automated as [an Ansible playbook](https://github.com/mlibrary/text-assembler-playbook).\n\n* Install the base software\n```\nsudo aptitude install python3\nsudo apt install python3-pip\nsudo apt-get install libmysqlclient-dev\nsudo apt-get install python3-venv\nsudo pip3 install Django\nsudo aptitude install libapache2-mod-wsgi-py3\n```\n\n* Checkout the code\n```\nmkdir /var/www/text-assembler\ngit clone git@gitlab.msu.edu:msu-libraries/public/text-assembler.git .\n```\n\n* Install dependencies\n```\ncd /var/www/text-assembler\npython3.6 -m venv ta_env\nsource ta_env/bin/activate\npip install -r requirements.txt\ndeactivate\n```\n* Update environment variables in `/etc/apache2/envvars`\n```\nexport LANG='en_US.UTF-8'\nexport LC_ALL='en_US.UTF-8'\n```\n* Configure the application:  \nCopy all of the `.example` files (find . -name '*.example') and make necessary changes.   \nThis will also involve creating a database and an application user, which will be parameters in the main config file.\n\n* Run the database setup\n```\ncd /var/www/text-assembler\nta_env/bin/python manage.py migrate\n```\n\n* Configure Apache:   \nCreate a new Apache configuration file using the below as an example\n```\n\u003cVirtualHost *:80\u003e\n        ServerName textassembler.lib.msu.edu\n        Redirect \"/\" \"https://textassembler.lib.msu.edu/\"\n\u003c/VirtualHost\u003e\n\u003cVirtualHost *:443\u003e\n        ServerName textassembler.lib.msu.edu\n\n        DocumentRoot /var/www/text-assembler\n\n        SSLEngine on\n        SSLCertificateFile /etc/ssl/private/textassembler.crt\n        SSLCertificateKeyFile /etc/ssl/private/textassembler.key\n\n        WSGIDaemonProcess textassembler.lib.msu.edu processes=2 threads=15 display-name=textassembler python-home=/var/www/text-assembler/ta_env\n        WSGIProcessGroup textassembler.lib.msu.edu\n        WSGIScriptAlias / /var/www/text-assembler/textassembler/wsgi.py\n\n        \u003cDirectory /var/www/text-assembler/textassembler\u003e\n                SetHandler wsgi-script\n                DirectoryIndex wsgi.py\n                Options +ExecCGI\n                Require all granted\n        \u003c/Directory\u003e\n\n        Alias /static/ /var/www/text-assembler/textassembler_web/static/\n        \u003cDirectory /var/www/text-assembler/textassembler_web/static\u003e\n                Options -Indexes\n                Require all granted\n        \u003c/Directory\u003e\n\n        ErrorLog ${APACHE_LOG_DIR}/textassembler-error.log\n        CustomLog ${APACHE_LOG_DIR}/textassembler-access.log combined\n\u003c/VirtualHost\u003e\n```\n\n* Set up the services:  \nInstalling the Text Assembler service to process the queue, zip compression handler, and deletion handler.\n```\ncp etc/init.d/* /etc/init.d/\nsudo chmod +x /etc/init.d/tassembler*\nsystemctl daemon-reload\nsystemctl enable tassemblerd\nsystemctl start tassemblerd\nsystemctl enable tassemblerzipd\nsystemctl start tassemblerzipd\nsystemctl enable tassemblerdeld\nsystemctl start tassemblerdeld\n```\n\n* Set up cron job to update Lexis Nexis sources and API limits on a regular basis (`/etc/crontab`)\n```\n@monthly    root        /var/www/text-assembler/ta_env/bin/python /var/www/text-assembler/manage.py update_sources\n```\n\n* Create an initial admin user to use the admin interface\n```\nmysql -h [DB_HOST] -p [DB_NAME] -e \"INSERT INTO textassembler_web_administrative_users (userid) VALUES ('[USERID]');\"\n```\n\n* Run the limits update to initialize the values in the database (unless you want to override them in the configs)\n```\n/var/www/text-assembler/ta_env/bin/python /var/www/text-assembler/manage.py update_limits\n```\n\nApplying Updates\n----------------\nAs improvements are made to the application, they will be pushed to the primary branch of this Git repository.\nIn order to apply changes made, here are the steps you will need to follow:\n\n* Pull the latest code from GitLab\n* If any of the .example files changed, compare them with you files to determine if there are changes you need to add.\nFor example, if the `textassembler.cfg.example` file changes, you will want to compare them to add/remove/change the \nfields indicated so it is up-to-date.\n* Run the database migrations to check for any changes: `/var/www/text-assembler/ta_env/bin/python /var/www/text-assembler/manage.py migrate`.\n* Restart Apache and the Text Assembler daemons: \n```\nsystemctl restart apache2\nsystemctl restart tassemberd\nsystemctl restart tassemberzipd\nsystemctl restart tassemberdeld\n```\n\n* Create an initial admin user to use the admin interface, if you have not already done so\n```\nmysql -h [DB_HOST] -p [DB_NAME] -e \"INSERT INTO textassembler_web_administrative_users (userid) VALUES ('[USERID]');\"\n```\n\n* If you have not already, run the limits update to initialize the values in the database (unless you want to override them in the configs)\n```\n/var/www/text-assembler/ta_env/bin/python /var/www/text-assembler/manage.py update_limits\n```\n\n* If you have not already, remove the cron rule to update API limits\n```\n# REMOVE\n@monthly    root        /var/www/text-assembler/ta_env/bin/python /var/www/text-assembler/manage.py update_limits\n```\n\nWSK to API Transition\n--------------------\nThis section includes steps for moving over in-progress searches running on a site \nusing the existing Lexis Nexis WSK. If this does not apply to you, please skip \nover this section.\n\n* In the existing system, mark the searches as completed so they stop processing.\n* Add a README.txt to the top level directory of each of the searches stating \nsomething along these lines:\n```\nProcessing for this search has been terminated due to the migration away from the LexisNexis WSK system \non to the new API system. Due to differnces in the interaction with this system, it was not possible \nfor the search to simply be restarted on the new system. As a result, these are the downloads that \nhave already been obtained from the old WSK system. If you require further information for your \nresearch, please restart a new search on the new site (https://textassembler.lib.msu.edu).\n\nFor any technical questions, contact Megan Schanz (schanzme@msu.edu).\n\nCompleted downloading results between: [Start Date Range - Date completed processing up to]\n```\n* Make sure the Text Assembler processor is not running\n```\nsystemctl stop tassemblerd\n```\n* Create a new record in the Text Assembler database for the search providng as many of the filters as you can\n```\nINSERT INTO textassembler_web_searches\n() \nVALUES ();\nINSERT INTO textassembler_web_searches \n(userid, date_submitted, update_date, query, date_completed, \nnum_results_downloaded, num_results_in_search, skip_value, \ndate_started_compression, date_completed_compression, \nuser_notified, run_time_seconds, date_started, retry_count) \nVALUES\n('userid','2019-02-20', NOW(), 'search query', NOW(), \n33840, 33840, 0,\nNOW(), NOW(), \n1, 0, '2019-02-20',0);\n\n\n# Using the search_id created from the previous insert to fill in \n# the following queries\nINSERT INTO textassembler_web_filters\n(search_id_id, filter_name, filter_value) \nVALUES\n(1, 'Date', 'gt 2019-06-10');\n\nINSERT INTO textassembler_web_download_formats\nSELECT format_id_id, 1\nFROM textassembler_web_available_formats\nWHERE format_name = 'HTML';\n```\n\n* Compress any incomplete searches in the old system\n* Move the searches to the new location with the new naming convention: [STORAGE_LOCATION]/[search_id]/[search_id].zip\n* The Text Assembler processor can be restarted again\n\nTechnical Overview\n------------------\nThis section breaks down what Text Assembler does behind the scenes.\n\n### Web Application ([code](textassembler_web/views.py))\nThis is the interface that users interact with which allows them to preview searches to refine them, queue them for full \ndownload, and then allows the abilit to download their full text results when it has completed processing.\n\nThe on-demand searches on the Search page is limited by the number of searches we are allowed to do per minute/hour/day \nwithin the Lexis Nexis API (the exact numbers depend on your license agreement). If the PREVIEW_FORMAT is set to\nFULL_TEXT, then it will also apply the minute/hour/day download limits (but not the time window limitations). When it \ndoes this, it will actually make 2 API calls because the first will be a regular search which will return the post filters \nfor further user refinement and the second call will be a download call to get the full text results.\n\nThe My Searches page shows the searches that users have saved. They are given the option to delete searches (which works \non in-progress searches to cancel them) and to download them once complete. Search results should be stored on a shared \ndrive with a mount point on the server as these can take considerable amount of space until they are completed and compressed.\n\nEstimates are given on the Searches and My Searches page to give users an idea of how long searches will potentially take. \nThis is calculated based on the limitations we have on the API, given the assumption that we could always download faster than \nthe limitations we have. So if we are limited to 1,000 downloads a day and there are 3 other searches in the queue with \neach 5,000 results... it would take a new search with 5,000 results 20 days to complete (5000 * 4 / 1000).\n\nWhen searches are deleted, it will delete the record from the database as well as removing the files for it on the \nserver. It will create a historical record in the `historical_searches` table of the database. This is used \nonly for reporting purposes (i.e. to get the number of searches ran over the year, or the number of documents downloaded). \n\n### Queue Processor (tassemblerd, [code](textassembler_processor/management/commands/process_queue.py))\nThis is the daemon process that does the bulk of the work. It will continually run on the server checking if there are \nsearches in the queue that need results downloaded for them still, and if there are, it will verify that we have available\ndownloads remaining with the Lexis Nexis API (by counting the number of calls we've made in our log table will all our API calls). \n\nAssuming we're able to download, it will retrieve the next 10 results for the search, save them to the server, and update the \nsearch position in the database (using the skip field). If the search is complete (based on the number of results field \nreturned from the API), it will mark it as download completed so that the compression processor will pick it up to \nzip the results. \n\nIf downloads were not available, it will wait for a period of time and then re-check again (using database \ncalls). \n\nIf there are no items in the queue, it will just wait until there are.\n\n\n### Compression Processor (tassemblerzipd, [code](textassembler_processor/management/commands/compress_searches.py))\nThis is the daemon process that will continually check for searches that have had all of their results already downloaded \nand are just waiting for their files to be compressed for the user. \n\nIt will loop continually checking for items to compress. When it find one, it will compress the files into a single zip, \nand then it will remove the original un-compressed files. If configured, it will email the user who initiated the \nsearch to notify them that it has completed.\n\n\n### Deletion Processor (tassemblerdeld, [code](textassembler_processor/management/commands/delete_searches.py))\nThis is the daemon process that checks for searches that are old enough to be deleted. It bases this off of the date the \ncompression was completed or the date the search failed (if it is a failed search). The number of months it waits \nbefore deleting items is set to 3 in the config file by default.\n\nIt will delete the files from the server and delete the search record from the database.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibrary%2Ftext-assembler-django","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmlibrary%2Ftext-assembler-django","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmlibrary%2Ftext-assembler-django/lists"}