{"id":26153349,"url":"https://github.com/kata-ai/indosum","last_synced_at":"2025-06-20T07:32:57.672Z","repository":{"id":49828963,"uuid":"148427383","full_name":"kata-ai/indosum","owner":"kata-ai","description":"A benchmark dataset for Indonesian text summarization.","archived":false,"fork":false,"pushed_at":"2019-03-20T02:33:59.000Z","size":104,"stargazers_count":77,"open_issues_count":0,"forks_count":15,"subscribers_count":6,"default_branch":"master","last_synced_at":"2025-04-14T06:46:15.914Z","etag":null,"topics":["indonesian","indonesian-language","natural-language-processing","text-summarization"],"latest_commit_sha":null,"homepage":"https://github.com/kata-ai/indosum","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kata-ai.png","metadata":{"files":{"readme":"README.rst","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2018-09-12T05:41:33.000Z","updated_at":"2024-12-07T05:17:24.000Z","dependencies_parsed_at":"2022-08-25T18:20:12.361Z","dependency_job_id":null,"html_url":"https://github.com/kata-ai/indosum","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/kata-ai/indosum","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kata-ai%2Findosum","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kata-ai%2Findosum/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kata-ai%2Findosum/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kata-ai%2Findosum/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kata-ai","download_url":"https://codeload.github.com/kata-ai/indosum/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kata-ai%2Findosum/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260901071,"owners_count":23079701,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["indonesian","indonesian-language","natural-language-processing","text-summarization"],"created_at":"2025-03-11T07:56:46.121Z","updated_at":"2025-06-20T07:32:52.661Z","avatar_url":"https://github.com/kata-ai.png","language":"Python","funding_links":[],"categories":["Text Summarization","自然語言處理-印度尼西亞"],"sub_categories":["資料集"],"readme":"Text Summarization\n++++++++++++++++++\n\nThis repository contains the code for our work:\n\nKurniawan, K., \u0026 Louvan, S. (2018). IndoSum: A New Benchmark Dataset for Indonesian Text Summarization. In 2018 International Conference on Asian Language Processing (IALP) (pp. 215–220). Bandung, Indonesia: IEEE. https://doi.org/10.1109/IALP.2018.8629109\n\nRequirements\n============\n\nCreate a virtual environment from ``environment.yml`` file using conda::\n\n    $ conda env create -f environment.yml\n\nTo run experiments with NeuralSum [CL16]_, Tensorflow is also required.\n\nDataset\n=======\n\nGet the dataset from https://drive.google.com/file/d/1OgYbPfXFAv3TbwP1Qcwt_CC9cVWSJaco/view.\n\nPreprocessing for NeuralSum\n---------------------------\n\nFor NeuralSum, the dataset should be further preprocessed using ``prep_oracle_neuralsum.py``::\n\n    $ ./prep_oracle_neuralsum.py -o neuralsum train.01.jsonl\n\nThe command will put the oracle files for NeuralSum under ``neuralsum`` directory. Invoke the script with ``-h/--help`` to see its other options.\n\nRunning experiments\n===================\n\nThe scripts to run the experiments are named ``run_\u003cmodel\u003e.py``. For instance, to run an experiment using LEAD, the script to use is ``run_lead.py``. All scripts use `Sacred \u003chttps://sacred.readthedocs.io\u003e`_ so you can invoke each with ``help`` command to see its usage. The experiment configurations are fully documented. Run ``./run_\u003cmodel\u003e.py print_config`` to print all the available configurations and their docs.\n\nTraining a model\n----------------\n\nTo train a model, for example the naive Bayes model, run ``print_config`` command first to see the available configurations::\n\n    $ ./run_bayes.py print_config\n\nThis command will give an output something like::\n\n    INFO - summarization-bayes-testrun - Running command 'print_config'\n    INFO - summarization-bayes-testrun - Started\n    Configuration (modified, added, typechanged, doc):\n      cutoff = 0.1                       # proportion of words with highest TF-IDF score to be considered important words\n      idf_path = None                    # path to a pickle file containing the IDF dictionary\n      model_path = 'model'               # where to load or save the trained model\n      seed = 313680915                   # the random seed for this experiment\n      corpus:\n        dev = None                       # path to dev oracle JSONL file\n        encoding = 'utf-8'               # file encoding\n        lower = True                     # whether to lowercase words\n        remove_puncts = True             # whether to remove punctuations\n        replace_digits = True            # whether to replace digits\n        stopwords_path = None            # path to stopwords file, one per each line\n        test = 'test.jsonl'              # path to test oracle JSONL file\n        train = 'train.jsonl'            # path to train oracle JSONL file\n      eval:\n        delete_temps = True              # whether to delete temp files after finishes\n        on = 'test'                      # which corpus set the evaluation should be run on\n        size = 3                         # extract at most this number of sentences as summary\n      summ:\n        path = 'test.jsonl'              # path to the JSONL file to summarize\n        size = 3                         # extract at most this number of sentences as summary\n    INFO - summarization-bayes-testrun - Completed after 0:00:00\n\nSo, to train the model on a train corpus in ``/tmp/train.jsonl`` and save the model to ``/tmp/models/bayes.model``, simply run::\n\n    $ ./run_bayes.py train with corpus.train=/tmp/train.jsonl model_path=/tmp/models/bayes.model\n\nEvaluating a model\n------------------\n\nEvaluating an unsupervised model is simple. For example, to evaluate a LEAD-N summarizer::\n\n    $ ./run_lead.py evaluate with corpus.test=/tmp/test.jsonl\n\nThis command will print an output like this::\n\n    INFO - run_experiment - Running command 'evaluate'\n    INFO - run_experiment - Started\n    INFO - read_jsonl - Reading test JSONL file from /tmp/test.jsonl\n    INFO - evaluate - References directory: /var/folders/p9/4pp5smf946q9xtdwyx792cn40000gn/T/tmp7jct3ede\n    INFO - evaluate - Hypotheses directory: /var/folders/p9/4pp5smf946q9xtdwyx792cn40000gn/T/tmpnaqoav4o\n    INFO - evaluate - ROUGE scores: {'ROUGE-1-R': 0.71752, 'ROUGE-1-F': 0.63514, 'ROUGE-2-R': 0.62384, 'ROUGE-2-F': 0.5502, 'ROUGE-L-R': 0.70998, 'ROUGE-L-F': 0.62853}\n    INFO - evaluate - Deleting temporary files and directories\n    INFO - run_experiment - Result: 0.63514\n    INFO - run_experiment - Completed after 0:00:11\n\nEvaluating a trained model is done similarly with ``model_path`` configuration is set to the path to the saved model.\n\nSetting up Mongodb observer\n---------------------------\n\nSacred allows the experiments to be observed and saved to a Mongodb database. The experiment scripts above can readily be used for this, simply set two environment variables ``SACRED_MONGO_URL`` and ``SACRED_DB_NAME`` to your Mongodb authentication string and database name (to save the experiments into) respectively. Once set, the experiments will be saved to the database. Use ``-u`` flag when invoking the experiment script to disable saving.\n\nReproducing results\n-------------------\n\nAll best configurations obtained from tuning on the development set are saved as Sacred's named configurations. This makes it easy to reproduce our results. For instance, to reproduce our LexRank result on fold 1, simply run::\n\n    ./run_lexrank.py evaluate with tuned_on_fold1 corpus.test=test.01.jsonl\n\nSince the best configuration is named as ``tuned_on_fold1``, the command above will use that configuration and evaluate the model on the test set. In general, all run scripts have ``tuned_on_foldX`` named configuration, where ``X`` is the fold number. For ``run_neuralsum.py`` though, there are other named configurations, namely ``emb300_on_foldX`` and ``fasttext_on_foldX``, referring to the scenario of using word embedding size of 300 and fastText pretrained embedding respectively. Some run scripts do not have such named configurations; that is because their hyperparameters were not tuned/they do not have any.\n\nLicense\n=======\n\nApache License, Version 2.0.\n\nCitation\n========\n\nIf you're using our code or dataset, please cite::\n\n    @inproceedings{kurniawan2018,\n      place={Bandung, Indonesia},\n      title={IndoSum: A New Benchmark Dataset for Indonesian Text Summarization},\n      url={https://ieeexplore.ieee.org/document/8629109},\n      DOI={10.1109/IALP.2018.8629109},\n      booktitle={2018 International Conference on Asian Language Processing (IALP)},\n      publisher={IEEE},\n      author={Kurniawan, Kemal and Louvan, Samuel},\n      year={2018},\n      month={Nov},\n      pages={215-220}\n    }\n\n\n.. [CL16] Cheng, J., \u0026 Lapata, M. (2016). Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 484–494). Berlin, Germany: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/P16-1046\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkata-ai%2Findosum","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkata-ai%2Findosum","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkata-ai%2Findosum/lists"}