{"id":15018253,"url":"https://github.com/altomator/en-data_mining","last_synced_at":"2025-04-09T19:50:29.322Z","repository":{"id":154891456,"uuid":"48642878","full_name":"altomator/EN-data_mining","owner":"altomator","description":"Data Mining Historical Newspaper Metadata (METS/ALTO formats)","archived":false,"fork":false,"pushed_at":"2022-08-29T14:55:14.000Z","size":61783,"stargazers_count":25,"open_issues_count":2,"forks_count":4,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-03-23T21:45:49.695Z","etag":null,"topics":["alto","alto-xml","basex","data-mining","digital-humanities","digital-libraries","digital-library","metadata","mets-xml","ocr","perl-script","xml"],"latest_commit_sha":null,"homepage":"http://altomator.github.io/EN-data_mining/","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/altomator.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2015-12-27T11:39:37.000Z","updated_at":"2025-02-04T17:56:58.000Z","dependencies_parsed_at":"2024-01-15T04:12:34.869Z","dependency_job_id":null,"html_url":"https://github.com/altomator/EN-data_mining","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altomator%2FEN-data_mining","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altomator%2FEN-data_mining/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altomator%2FEN-data_mining/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/altomator%2FEN-data_mining/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/altomator","download_url":"https://codeload.github.com/altomator/EN-data_mining/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248103832,"owners_count":21048239,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["alto","alto-xml","basex","data-mining","digital-humanities","digital-libraries","digital-library","metadata","mets-xml","ocr","perl-script","xml"],"created_at":"2024-09-24T19:51:44.274Z","updated_at":"2025-04-09T19:50:29.317Z","avatar_url":"https://github.com/altomator.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"## EN-data_mining\n*Data Mining Historical Newspapers Metadata (Europeana Newspaper Project)*\n\n### Synopsis\nNewspapers from European digital librabries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using the METS/ALTO formats.\n\nFrom each digital document is derived a set of bibliographical metadata (date of publication, title) and quantitative metadata related to content and layout (number of pages, articles, words, illustrations, etc.). Shell and XSLT or Perl scripts are used to extract some metadata from METS manifest or from ALTO files.\n\n[Detailled presentation](http://altomator.github.io/EN-data_mining/)\n\n---\n \n### Installation\nYou can use a XSLT stylesheet (called with DOS scripts) or a Perl script (faster).\n\nSample documents are stored in the \"DOCS\" folder. The scripts have been designed for the [CCS](https://content-conversion.com/wp-content/uploads/2014/09/CCS-METS-ALTO-Info_basic_20140909.pdf) METS/ALTO profil, but this can be easily fixed.\n\nThe metadata are generated in a \"STATS\" folder.\n\n#### XSLT\nTwo DOS shell scripts :\n- batch-EN.bat\n- xslt.cmd\n\nTwo XSLT stylesheets:\n- analyseAltosCCS.xsl\n- calculeStatsMETS_CSV.xsl\n\nThe XSLT are runned with Xalan-Java. Path to the Java binary must be set in xslt.cmd.\n\nFor each document, its metadata are stored in the STATS folder under two formats :\n- XML (raw metadata, with detailled values for each page)\n- CSV (metadata at the issue level)\n\nAn aggregated file (metadata.csv) contains all the CSV metadata.\n\n\n##### Test\n1. Open a DOS terminal.\n2. Change dir to the batch folder\n3. \u003ebatch-EN.bat \n\n#### Perl script\nFaster and richer (more metadata) than the XSLT scripts.\n\n- One Perl script: extractMD.pl \n- One shell script (Bash): batch.sh (runs the Perl script and packages the results files)\n\nFor each document, metadata are stored in the STATS folder (available formats : XML, JSON, CSV, txt)\n\n\n##### Test\n1. Open a shell terminal (Linux, Mac OS X).\n2. Change dir to the batch folder\n3. \u003eperl extractMD.pl DOCS xml json csv\n\n\n\n### Charts\n\nSee [here](http://altomator.github.io/EN-data_mining/).\n\n(Made with [Highcharts](www.highcharts.com))\n\n![](http://altomator.github.io/EN-data_mining/Charts/Samples/words-JDPL.jpg)\n\n*Journal des débats politiques et littéraires* : [Number of words per page](http://altomator.github.io/EN-data_mining/Charts/Words/timeline-words-JDPL_complete_interactive.htm)  (complete dataset, interactive timeline)\n\n\n\n### Datasets\nThe complete set of derived data contains about 5,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) of Gallica (www.gallica.fr) press collections:\n- *Le Matin*\n- *Le Gaulois*\n- *Le Petit journal illustré*\n- *Le Journal des débats politiques et littéraires*\n- *Le Petit Parisien*\n- *Ouest-Eclair*\n\nThe datasets (XML, CSV or JSON formats) are publicly available [here](http://altomator.github.io/EN-data_mining)\n\n### API \nXQuery based HTTP APIs to request [BaseX](http://basex.org/) XML databases:\n- findIllustratedPages: look for graphical pages (at least one illustration and a small word density)\n- findCaptionedIllustrations: look in the illustrations captions (to be used on the \"captions\" dataset)\n\n##### Test\n1. Install BaseX.\n2. Import one (or all) the datasets in a BaseX database.\n3. Launch the BaseX HTTP server (bin/basexhttp)\n4. Say to BaseX where are your XQuery files: in the .basex config file, edit RESTPATH. Eg RESTPATH=$home/BaseXWeb\n4. Store your XQuery files (.xq) in the $RESTPATH folder\n5. Fix the database name in the XQuery files (last lines of the scripts)\n6. Open a web browser and test the service: http://localhost:8984/rest lists the available databases and http://localhost:8984/rest/database_name gives the content of a database (first connection: ID=admin, passwd=admin)\n7. Test the API: http://localhost:8984/rest?run=findCaptionedIllustrations.xq\u0026fromDate=1886-01-01\u0026keyword=statue.*libert%C3%A9\n\n\n## License\nCC0\n\n\u003ca href=\"http://creativecommons.org/publicdomain/zero/1.0/\"\u003e\u003cimg src=\"https://camo.githubusercontent.com/4df6de8c11e31c357bf955b12ab8c55f55c48823/68747470733a2f2f6c6963656e7365627574746f6e732e6e65742f702f7a65726f2f312e302f38387833312e706e67\" alt=\"CC0\" data-canonical-src=\"https://licensebuttons.net/p/zero/1.0/88x31.png\" style=\"max-width:100%;\"\u003e\u003c/a\u003e\n\nThis work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faltomator%2Fen-data_mining","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faltomator%2Fen-data_mining","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faltomator%2Fen-data_mining/lists"}