{"id":15571402,"url":"https://github.com/hatamiarash7/ir-system","last_synced_at":"2025-03-29T06:29:00.676Z","repository":{"id":73239716,"uuid":"117328904","full_name":"hatamiarash7/IR-System","owner":"hatamiarash7","description":"IR System for Reuters DB","archived":false,"fork":false,"pushed_at":"2023-12-15T20:30:08.000Z","size":8905,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-03T18:57:54.228Z","etag":null,"topics":["data-analysis","data-mining","ir","python"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hatamiarash7.png","metadata":{"files":{"readme":"README.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-01-13T08:54:45.000Z","updated_at":"2018-01-13T09:00:19.000Z","dependencies_parsed_at":"2024-10-02T18:00:25.816Z","dependency_job_id":"3e4cac97-c420-4f6a-8dad-800a66d1eef0","html_url":"https://github.com/hatamiarash7/IR-System","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2FIR-System","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2FIR-System/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2FIR-System/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hatamiarash7%2FIR-System/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hatamiarash7","download_url":"https://codeload.github.com/hatamiarash7/IR-System/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246149443,"owners_count":20731356,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-mining","ir","python"],"created_at":"2024-10-02T18:00:17.862Z","updated_at":"2025-03-29T06:29:00.630Z","avatar_url":"https://github.com/hatamiarash7.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n          Reuters-21578 text categorization test collection\n                        Distribution 1.0\n                       README file (v 1.2)\n                        26 September 1997\n\n                         David D. Lewis\n                      AT\u0026T Labs - Research     \n                     lewis@research.att.com\n\nI. Introduction\n\n   This README describes Distribution 1.0 of the Reuters-21578 text\ncategorization test collection, a resource for research in information\nretrieval, machine learning, and other corpus-based research.\n\n\nII. Copyright \u0026 Notification \n\n   The copyright for the text of newswire articles and Reuters\nannotations in the Reuters-21578 collection resides with Reuters Ltd.\nReuters Ltd. and Carnegie Group, Inc. have agreed to allow the free\ndistribution of this data *for research purposes only*.  \n   If you publish results based on this data set, please acknowledge\nits use, refer to the data set by the name \"Reuters-21578,\nDistribution 1.0\", and inform your readers of the current location of\nthe data set (see \"Availability \u0026 Questions\").\n\n\nIII. Availability \u0026 Questions\n\n   The Reuters-21578, Distribution 1.0 test collection is available\nfrom David D. Lewis' professional home page, currently:\n             http://www.research.att.com/~lewis\n\nBesides this README file, the collection consists of 22 data files, an\nSGML DTD file describing the data file format, and six files\ndescribing the categories used to index the data.  (See Sections VI\nand VII for more details.)  Some additional files, which are not part\nof the collection but have been contributed by other researchers as\nuseful resources are also included.  All files are available\nuncompressed, and in addition a single gzipped Unix tar archive of the\nentire distribution is available as reuters21578.tar.gz.\n\n   The text categorization mailing list, DDLBETA, is a good place to\nsend questions about this collection and other text categorization\nissues. You may join the list by writing David Lewis at\nlewis@research.att.com.\n\n\nIV. History \u0026 Acknowledgements\n\n   The documents in the Reuters-21578 collection appeared on the\nReuters newswire in 1987.  The documents were assembled and indexed\nwith categories by personnel from Reuters Ltd. (Sam Dobbins, Mike\nTopliss, Steve Weinstein) and Carnegie Group, Inc. (Peggy Andersen,\nMonica Cellio, Phil Hayes, Laura Knecht, Irene Nirenburg) in 1987.  \n\nIn 1990, the documents were made available by Reuters and CGI for\nresearch purposes to the Information Retrieval Laboratory (W.  Bruce\nCroft, Director) of the Computer and Information Science Department at\nthe University of Massachusetts at Amherst.  Formatting of the\ndocuments and production of associated data files was done in 1990 by\nDavid D.  Lewis and Stephen Harding at the Information Retrieval\nLaboratory.\n\nFurther formatting and data file production was done in 1991 and 1992\nby David D. Lewis and Peter Shoemaker at the Center for Information\nand Language Studies, University of Chicago.  This version of the data\nwas made available for anonymous FTP as \"Reuters-22173, Distribution\n1.0\" in January 1993. From 1993 through 1996, Distribution 1.0 was\nhosted at a succession of FTP sites maintained by the Center for\nIntelligent Information Retrieval (W. Bruce Croft, Director) of the\nComputer Science Department at the University of Massachusetts at\nAmherst.\n\nAt the ACM SIGIR '96 conference in August, 1996 a group of text\ncategorization researchers discussed how published results on\nReuters-22173 could be made more comparable across studies.  It was\ndecided that a new version of collection should be produced with less\nambiguous formatting, and including documentation carefully spelling\nout standard methods of using the collection.  The opportunity would\nalso be used to correct a variety of typographical and other errors in\nthe categorization and formatting of the collection.\n\nSteve Finch and David D. Lewis did this cleanup of the collection\nSeptember through November of 1996, relying heavily on Finch's\nSGML-tagged version of the collection from an earlier study.  One\nresult of the re-examination of the collection was the removal of 595\ndocuments which were exact duplicates (based on identity of timestamps\ndown to the second) of other documents in the collection. The new\ncollection therefore has only 21,578 documents, and thus is called the\nReuters-21578 collection.  This README describes version 1.0 of this\nnew collection, which we refer to as \"Reuters-21578, Distribution\n1.0\".\n\nIn preparing the collection and documentation we have benefited from\ndiscussions with Eric Brown, William Cohen, Fred Damerau, Yoram\nSinger, Amit Singhal, and Yiming Yang, among many others.\n\nWe thank all the people and organizations listed above for their\nefforts and support, without which this collection would not exist.\n\nA variety of other changes were also made in going from Reuters-22173\nto Reuters-21578:\n\n   1. Documents were marked up with SGML tags, and a corresponding\nSGML DTD was produced, so that the boundaries of important sections of\ndocuments (e.g. category fields) are unambiguous.\n   2. The set of categories that are legal for each of the five\ncontrolled vocabulary fields was specified. All category names not\nlegal for a field were corrected to a legal category, moved to their\nappropriate field, or removed, as appropriate.\n   3. Documents were given new ID numbers, in chronological order, and\nare collected 1000 to a file in order by ID (and therefore in order\nchronologically). \n\n\nV. What is a Text Categorization Test Collection and Who Cares? \n\n   *Text categorization* is the task of deciding whether a piece of\ntext belongs to any of a set of prespecified categories.  It is a\ngeneric text processing task useful in indexing documents for later\nretrieval, as a stage in natural language processing systems, for\ncontent analysis, and in many other roles [LEWIS94d].\n\n   The use of standard, widely distributed test collections has been a\nconsiderable aid in the development of algorithms for the related task\nof *text retrieval* (finding documents that satisfy a particular\nuser's information need, usually expressed in an textual request).\nText retrieval test collections have allowed the comparison of\nalgorithms developed by a variety of researchers around the world.\n(For more on text retrieval test collections see SPARCKJONES76.)\n\n   Standard test collections have been lacking, however, for text\ncategorization. Few data sets have been used by more than one\nresearcher, making results hard to compare.  The Reuters-22173 test\ncollection has been used in a number of published studies since it was\nmade available, and we believe that the Reuters-21578 collection will\nbe even more valuable.\n\n   The collection may also be of interest to researchers in machine\nlearning, as it provides a classification task with challenging\nproperties. There are multiple categories, the categories are\noverlapping and nonexhaustive, and there are relationships among the\ncategories.  There are interesting possibilities for the use of domain\nknowledge.  There are many possible feature sets that can be extracted\nfrom the text, and most plausible feature/example matrices are large\nand sparse.  There is even some temporal structure to the data\n[LEWIS94b], though problems with the indexing and the uneven\ndistribution of stories within the timespan covered may make this\ncollection a poor one to explore temporal issues.\n\n\nVI. Formatting \n\n     The Reuters-21578 collection is distributed in 22 files. Each of\nthe first 21 files (reut2-000.sgm through reut2-020.sgm) contain 1000\ndocuments, while the last (reut2-021.sgm) contains 578 documents.  \n\n     The files are in SGML format.  Rather than going into the details\nof the SGML language, we describe here in an informal way how the SGML\ntags are used to divide each file, and each document, into sections.\nReaders interested in more detail on SGML are encouraged to pursue\none of the many books and web pages on the subject.\n\n     Each of the 22 files begins with a document type declaration line:\n               \u003c!DOCTYPE lewis SYSTEM \"lewis.dtd\"\u003e\n\nThe DTD file lewis.dtd is included in the distribution.  Following the\ndocument type declaration line are individual Reuters articles marked\nup with SGML tags, as described below.\n\n\n   VI.A. The REUTERS tag:\n\n    Each article starts with an \"open tag\" of the form\n\n    \u003cREUTERS TOPICS=?? LEWISSPLIT=?? CGISPLIT=?? OLDID=?? NEWID=??\u003e\n\nwhere the ?? are filled in an appropriate fashion.  Each article ends\nwith a \"close tag\" of the form:\n\n     \u003c/REUTERS\u003e\n\nIn all cases the \u003cREUTERS\u003e and \u003c/REUTERS\u003e tags are the only items\non their line.  \n\n     Each REUTERS tag contains explicit specifications of the values\nof five attributes, TOPICS, LEWISSPLIT, CGISPLIT, OLDID, and NEWID.\nThese attributes are meant to identify documents and groups of \ndocuments, and have the following meanings: \n\n     1. TOPICS : The possible values are YES, NO, and BYPASS:\n        a. YES indicates that *in the original data* there was at\nleast one entry in the TOPICS fields.\n        b. NO indicates that *in the original data* the story had no\nentries in the TOPICS field.\n        c. BYPASS indicates that *in the original data* the story was\nmarked with the string \"bypass\" (or a typographical variant on that\nstring).\n     This poorly-named attribute unfortunately is the subject of much\nconfusion. It is meant to indicate whether or not the document had\nTOPICS categories *in the raw Reuters-22173 dataset*.  The sole use of\nthis attribute is to defining training set splits similar to those\nused in previous research. (See the section on training set splits.)\nThe TOPICS attribute does **NOT** indicate anything about whether or\nnot the Reuters-21578 document has any TOPICS categories.  (Version\n1.0 of this document was errorful on this point.)  That can be\ndetermined by actually looking at the TOPICS field. A story with\nTOPICS=\"YES\" can have no TOPICS categories, and a story with\nTOPICS=\"NO\" can have TOPICS categories.\n     Now, a reasonable (though not certain) assumption is that for all\nTOPICS=\"YES\" stories the indexer at least thought about whether the\nstory belonged to a valid TOPICS category.  Thus, the TOPICS=\"YES\"\nstories with no topics can reasonably be considered negative examples\nfor all 135 valid TOPICS categories.\n     TOPICS=\"NO\" stories are more problematic in their interpretation.\nSome of them presumedly result because the indexer made an explicit\ndecision that they did not belong to any of the 135 valid TOPICS\ncategories.  However, there are many cases where it is clear that a\nstory should belong to one or more TOPICS categories, but for some\nreason the category was not assigned.  There appear to be certain time\nintervals where large numbers of such stories are concentrated,\nsuggesting that some parts of the data set were simply not indexed, or\nnot indexed for some categories or category sets.  Also, in a few\ncases, the indexer clearly meant to assign TOPICS categories, but put\nthem in the wrong field.  These cases have been corrected in the\nReuters-21578 data, yielding stories that have TOPICS categories, but\nwhere TOPICS=\"NO\", because the the category was not assigned in the\nraw version of the data.\n     \"BYPASS\" stories clearly were not indexed, and so are useful only\nfor general distributional information on the language used in the\ndocuments.\n\n     2. LEWISSPLIT : The possible values are TRAINING, TEST, and\nNOT-USED.  TRAINING indicates it was used in the training set in the\nexperiments reported in LEWIS91d (Chapters 9 and 10), LEWIS92b,\nLEWIS92e, and LEWIS94b.  TEST indicates it was used in the test set\nfor those experiments, and NOT-USED means it was not used in those\nexperiments.\n\n     3. CGISPLIT : The possible values are TRAINING-SET and\nPUBLISHED-TESTSET indicating whether the document was in the training\nset or the test set for the experiments reported in HAYES89 and\nHAYES90b.\n\n     4. OLDID : The identification number (ID) the story had in the\nReuters-22173 collection.\n\n     5. NEWID : The identification number (ID) the story has in the\nReuters-21578, Distribution 1.0 collection.  These IDs are assigned to\nthe stories in chronological order.\n\nIn addition, some REUTERS tags have a sixth attribute, CSECS, which\ncan be ignored.  \n\nThe use of these attributes is critical to allowing comparability\nbetween different studies with the collection, and is discussed\nfurther in Section VIII.\n\n\n  VI.B. Document-Internal Tags \n\n     Just as the \u003cREUTERS\u003e and \u003c/REUTERS\u003e tags serve to delimit\ndocuments within a file, other tags are used to delimit elements\nwithin a document.  We discuss these in the order in which they\ntypically appear, though the exact order should not be relied upon in\nprocessing. In some cases, additional tags occur within an element\ndelimited by these top level document-internal tags.  These are\ndiscussed in this section as well.\n\n     We specify below whether each open/close tag pair is used exactly\nonce (ONCE) per a story, or a variable (VARIABLE) number of times\n(possibly zero).  In many cases the start tag of a pair appears only\nat the beginning of a line, with the corresponding end tag always\nappearing at the end of the same line.  When this is the case, we\nindicate it with the notation \"SAMELINE\" below, as an aid to those\nprocessing the files without SGML tools.  \n\n     1. \u003cDATE\u003e, \u003c/DATE\u003e [ONCE, SAMELINE]: Encloses the date and time\nof the document, possibly followed by some non-date noise material.\n\n     2. \u003cMKNOTE\u003e, \u003c/MKNOTE\u003e [VARIABLE] : Notes on certain hand\ncorrections that were done to the original Reuters corpus by Steve\nFinch.\n\n     3. \u003cTOPICS\u003e, \u003c/TOPICS\u003e [ONCE, SAMELINE]: Encloses the list of\nTOPICS categories, if any, for the document. If TOPICS categories are\npresent, each will be delimited by the tags \u003cD\u003e and \u003c/D\u003e.\n     \n     4. \u003cPLACES\u003e, \u003c/PLACES\u003e [ONCE, SAMELINE]: Same as \u003cTOPICS\u003e\nbut for PLACES categories.\n\n     5. \u003cPEOPLE\u003e, \u003c/PEOPLE\u003e [ONCE, SAMELINE]: Same as \u003cTOPICS\u003e\nbut for PEOPLE categories.\n\n     6. \u003cORGS\u003e, \u003c/ORGS\u003e [ONCE, SAMELINE]: Same as \u003cTOPICS\u003e but\nfor ORGS categories.\n\n     7. \u003cEXCHANGES\u003e, \u003c/EXCHANGES\u003e [ONCE, SAMELINE]: Same as\n\u003cTOPICS\u003e but for EXCHANGES categories.\n\n     8. \u003cCOMPANIES\u003e, \u003c/COMPANIES\u003e [ONCE, SAMELINE]: These tags always\nappear adjacent to each other, since there are no COMPANIES categories\nassigned in the collection.\n    \n     9. \u003cUNKNOWN\u003e, \u003c/UNKNOWN\u003e [VARIABLE]: These tags bracket control\ncharacters and other noisy and/or somewhat mysterious material in the\nReuters stories.\n\n     10. \u003cTEXT\u003e, \u003c/TEXT\u003e [ONCE]: We have attempted to delimit all the\ntextual material of each story between a pair of these tags.  Some\ncontrol characters and other \"junk\" material may also be included.\nThe whitespace structure of the text has been preserved. The \u003cTEXT\u003e\ntag has the following attribute:\n\n        a. TYPE: This has one of three values: NORM, BRIEF, and\nUNPROC.  NORM is the default value and indicates that the text of the\nstory had a normal structure. In this case the TEXT tag appears simply\nas \u003cTEXT\u003e.  The tag appears as \u003cTEXT TYPE=\"BRIEF\"\u003e when the story is a\nshort one or two line note.  The tags appears as \u003cTEXT TYPE=\"UNPROC\"\u003e\nwhen the format of the story is unusual in some fashion that limited\nour ability to further structure it.\n\nThe following tags optionally delimit elements inside the TEXT\nelement. Not all stories will have these tags:\n\n        a. \u003cAUTHOR\u003e, \u003c/AUTHOR\u003e : Author of the story. \n        b. \u003cDATELINE\u003e, \u003c/DATELINE\u003e : Location the story\noriginated from, and day of the year. \n        c. \u003cTITLE\u003e, \u003c/TITLE\u003e : Title of the story. We have attempted\nto capture the text of stories with TYPE=\"BRIEF\" within a \u003cTITLE\u003e\nelement.\n        d. \u003cBODY\u003e, \u003c/BODY\u003e : The main text of the story.\n\n\nVII. Categories \n\n   A test collection for text categorization contains, at minimum, a\nset of texts and, for each text, a specification of what categories\nthat text belongs to.  For the Reuters-21578 collection the documents\nare Reuters newswire stories, and the categories are five different\nsets of content related categories.  For each document, a human\nindexer decided which categories from which sets that document\nbelonged to.  The category sets are as follows:\n\n              Number of    Number of Categories   Number of Categories \nCategory Set  Categories     w/ 1+ Occurrences      w/ 20+ Occurrences  \n************  **********   ********************   ******************** \nEXCHANGES        39                32                       7\nORGS             56                32                       9\nPEOPLE          267               114                      15\nPLACES          175               147                      60\nTOPICS          135               120                      57\n\n\nThe TOPICS categories are economic subject categories.  Examples\ninclude \"coconut\", \"gold\", \"inventories\", and \"money-supply\".  This\nset of categories is the one that has been used in almost all previous\nresearch with the Reuters data. HAYES90b discusses some examples of\nthe policies (not always obvious) used by the human indexers in\ndeciding whether a document belonged to a particular TOPIC category.\n\nThe EXCHANGES, ORGS, PEOPLE, and PLACES categories correspond to named\nentities of the specified type.  Examples include \"nasdaq\"\n(EXCHANGES), \"gatt\" (ORGS), \"perez-de-cuellar\" (PEOPLE), and\n\"australia\" (PLACES). Typically a document assigned to a category from\none of these sets explicitly includes some form of the category name\nin the document's text. (Something which is usually not true for\nTOPICS categories.)  However, not all documents containing a named\nentity corresponding to the category name are assigned to these\ncategory, since the entity was required to be a focus of the news\nstory [HAYES90b]. Thus these proper name categories are not as simple\nto assign correctly as might be thought.\n\nReuters-21578, Distribution 1.0 includes five files\n(all-exchanges-strings.lc.txt, all-orgs-strings.lc.txt,\nall-people-strings.lc.txt, all-places-strings.lc.txt, and\nall-topics-strings.lc.txt) which list the names of *all* legal\ncategories in each set.  A sixth file, cat-descriptions_120396.txt\ngives some additional information on the category sets.\n\nNote that a sixth category field, COMPANIES, was present in the\noriginal Reuters materials distributed by Carnegie Group, but no\ncompany information was actually included in these fields. In the\nReuters-21578 collection this field is always empty.\n\nIn the table above we note how many categories appear in at least 1 of\nthe 21,578 documents in the collection, and how many appear at least\n20 of the documents.  Many categories appear in no documents, but we\nencourage researchers to include these categories when evaluating the\neffectiveness of their categorization system. \n\nAdditional details of the documents, categories, and corpus\npreparation process appear in LEWIS92b, and at greater length in\nSection 8.1 of LEWIS91d.\n\nVIII. Using Reuters-21578 for Text Categorization Research\n\n     In testing a method for text categorization it is important that\nknowledge of the nature of the test data not unduly influence the\ndevelopment of the system, or the performance obtained will be\nunrealistically high.  One way of dealing with this is to divide a set\nof data into two subsets: a training set and a test set.  An\nexperimenter then develops a categorization system by automated\ntraining on the training set only, and/or by human knowledge\nengineering based on examination of the training set only.  The\ncategorization system is then tested on the previously unexamined test\nset.  A number of variations on this basic theme are possible---see\nWEISS91 for a good discussion.\n\n     Effectiveness results can only be compared between studies that\nthe same training and test set (or that use cross-validation\nprocedures).  One problem with the Reuters-22173 collection was that\nthe ambiguity of formatting and annotation led different researchers\nto use different training/test divisions. This was particularly\nproblematic when researchers attempted to remove documents that \"had\nno TOPICS\", as there were several definitions of what this meant.\n\n     To eliminate these ambiguities from the Reuters-21578 collection\nwe specify exactly which articles are in each of the recommended\ntraining sets and test sets by specifying the values those articles\nwill have on the TOPICS, LEWISSPLIT, and CGISPLIT attributes of the\nREUTERS tags.  We strongly encourage that all studies on Reuters-21578\nuse one of the following training test divisions (or use multiple\nrandom splits, e.g. cross-validation):\n\nVIII.A. The Modified Lewis (\"ModLewis\") Split:\n\n Training Set (13,625 docs): LEWISSPLIT=\"TRAIN\";  TOPICS=\"YES\" or \"NO\"\n Test Set (6,188 docs):  LEWISSPLIT=\"TEST\"; TOPICS=\"YES\" or \"NO\"\n Unused (1,765): LEWISSPLIT=\"NOT-USED\" or TOPICS=\"BYPASS\"\n\nThis replaces the 14704/6746 split (723 unused) of the Reuters-22173\ncollection, which was used in LEWIS91d (Chapters 9 and 10), LEWIS92b,\nLEWIS92c, LEWIS92e, and LEWIS94b. Note the following:\n\n      1. The duplicate documents removed in forming Reuters-21578 are\nof course not present. \n      2. The documents with TOPICS=\"BYPASS\" are not used, since\nsubsequent analysis strongly indicates that they were not categorized\nby the indexers.  \n      3. The 1,765 unused documents should not be tested on and should\nnot be used for supervised learning.  However, they may useful as\nadditional information on the statistical distribution of words,\nphrases, and other features that might used to predict categories.\n\nThis split assigns documents from April 7, 1987 and before to the\ntraining set, and documents from April 8, 1987 and after to the test\nset.\n\nWARNING: Given the many changes in going from Reuters-22173 to\nReuters-21578, including correction of many typographical errors in\ncategory labels, results on the ModLewis split cannot be compared\nwith any published results on the Reuters-22173 collection!\n\n\nVIII.B. The Modified Apte (\"ModApte\") Split :\n\n Training Set (9,603 docs): LEWISSPLIT=\"TRAIN\";  TOPICS=\"YES\"\n Test Set (3,299 docs): LEWISSPLIT=\"TEST\"; TOPICS=\"YES\"\n Unused (8,676 docs):   LEWISSPLIT=\"NOT-USED\"; TOPICS=\"YES\"\n                     or TOPICS=\"NO\" \n                     or TOPICS=\"BYPASS\"\n\nThis replaces the 10645/3672 split (7,856 not used) of the\nReuters-22173 collection.  These are our best approximation to the\ntraining and test splits used in APTE94 and APTE94b. Note the\nfollowing:\n\n      1. As with the ModLewis, those documents removed in forming\nReuters-21578 are not present, and BYPASS documents are not used.  \n      2. The intent in APTE94 and APTE94b was to use the Lewis split,\nbut restrict it to documents with at least one TOPICS categories.\nHowever, but it was not clear exactly what Apte, et al meant by having\nat least one TOPICS category (e.g. how was \"bypass\" treated, whether\nthis was before or after any fixing of typographical errors, etc.). We\nhave encoded our interpretation in the TOPICS attribute.  ***Note\nthat, as discussed above, some TOPICS=\"YES\" stories have no TOPICS\ncategories, and a few TOPICS=\"NO\" stories have TOPICS\ncategories. These facts are irrelevant to the definition of the\nsplit.*** If you are using a learning algorithm that requires each\ntraining document to have at least TOPICS category, you can screen out\nthe training documents with no TOPICS categories. Please do NOT screen\nout any of the 3,299 documents - that will make your results\nincomparable with other studies.\n\n      3. As with ModLewis, it may be desirable to use the 8,676 Unused\ndocuments for gathering statistical information about feature\ndistribution.\n\nAs with ModLewis, this split assigns documents from April 7, 1987 and\nbefore to the training set, and documents from April 8, 1987 and after\nto the test set.  The difference is that only documents with at least\none TOPICS category are used.  The rationale for this restriction is\nthat while some documents lack TOPICS categories because no TOPICS\napply (i.e. the document is a true negative example for all TOPICS\ncategories), it appears that others simply were never assigned TOPICS\ncategories by the indexers. (Unfortunately, the amount of time that\nhas passed since the collection was created has made it difficult to\nestablish exactly what went on during the indexing.)\n\nWARNING: Given the many changes in going from Reuters-22173 to\nReuters-21578, including correction of many typographical errors in\ncategory labels, results on the ModApte split cannot be compared\nwith any published results on the Reuters-22173 collection!\n\n\nVIII.C. The Modified Hayes (\"ModHayes\") Split: \n Training Set (20856 docs): CGISPLIT=\"TRAINING-SET\"\n Test Set (722 docs): CGISPLIT=\"PUBLISHED-TESTSET\"\n Unused (0 docs)\n\nThis is the best approximation we have to the training and test splits\nused in HAYES89, HAYES90b, and Chapter 8 of LEWIS91d.  It replaces the\n21450/723 split of the Reuters-22173 collection.  Note the following:\n\n      1. As with the other splits, the duplicate documents removed in\nforming Reuters-21578 are not present. \n\n      2. \"Training\" in HAYES89 and HAYES90b was actually done by human\nbeings looking at the documents and writing categorization rules. \nWe can not be sure which of the document files were actually looked\nat.  \n\n      3. We specify that the BYPASS stories and the TOPICS=NO stories\nare part of the training set, since they were used during manual\nknowledge engineering in the original Hayes experiments. That does not\nmean researchers are obliged to give these stories to, for instance, a\nsupervised learning algorithm.  As mentioned in the other splits, they\nmay be more useful for getting distributional information about\nfeatures.\n \nThere are a number of problems with the ModHayes split that make it\nless than desirable for text categorization research, including\nunusual distribution of categories, pairs of near-duplicate documents,\nand chronological burstiness.  (See [LEWIS90b, Ch. 8] for more\ndetails.)\n\nDespite these problems, this split is of interest because it provides\nthe ability to compare results with those of the CONSTRUE system\n[HAYES89, HAYES90b].  Comparison of results on the ModHayes split with\npreviously published results on the original Hayes split in HAYES89\nand HAYES90b (and LEWIS90b, Ch. 8) is possible, though the following\npoints should be taken into account:\n\n   1. The testset we provide in the ModHayes split has one fewer\ndocument than the one Hayes used. The document that was removed\n(OLDID=\"22026\") was a timestamp duplicate of the document with\nOLDID=\"22027\" and NEWID=\"13234\". So in computing effectiveness\nmeasures for comparison with HAYES89/90b, the document with\nNEWID=\"13234\" should be counted twice.\n\n   2. The documents in the Hayes testset had relatively few errors and\nanomalies in their categorization. And the errors which we did find\nand correct appear unlikely to have affected the original Hayes\nresults. In particular, it appears that the only errors in the TOPICS\nfield were the addition of a few invalid categories that were not\nevaluated on.  However, for completeness we list the changes in the\nHayes testset documents made going from Reuters-22173 to Reuters-21578\n(all documents are referred to by their NEWID):\n\n   Removal of invalid TOPIC \"loan\" : 13234, 16946, 17111, 17112, 17207,\n17217, 17228, 17234, 17271, 17310\n\n   Removal of invalid TOPIC \"gbond\" : 17138, 17260\n\n   Removal of invalid TOPIC \"tbill\" : 17258\n\n   Removal of invalid TOPIC \"cbond\" : 17024\n\n   Removal of invalid TOPIC \"fbond\" : 17087\n\n   Correction of invalid PEOPLE mancera to mancera-aguayo: 17142,\n17149, 17154, 17177, 17187\n\n   Correction of invalid PEOPLE andriesssen to andriessen : 17366\n\n   Correction of invalid PLACES \"ivory\" and \"coast\" to single correct\nPLACE \"ivory-coast\": 18383\n\n    3. The effectiveness measures used in HAYES89 and HAYES90b were\nsomewhat nonstandard. See Ch. 8 of LEWIS91d for a discussion.\n\n\nVIII.D. Other Splits\n  \n     We strongly encourage researchers to use one (or more) of the\nabove splits for their experiments (or use cross-validation on one of\nthe sets of documents defined in the above splits).  We recommend the\nModified Apte (\"ModApte\") Split for research on predicting the TOPICS\nfield, since the evidence is that a significant number of documents\nthat should have TOPICS do not.  The ModLewis split can be used if the\nresearcher has a strong need to test the ability of a system to deal\nwith examples belonging to no category. While it is likely that some\nof these examples should indeed belong to a category, the ModLewis\nsplit is at least better than the corresponding split from\nReuters-22173, in that it eliminates the \"bypass\" stories.\n\n     We in particular encourage you to resist the following\ntemptations:\n     1. Defining new splits based on whether or not the documents\nactually have any TOPICS categories.  (See the discussion of the\nModApte split.) \n     2. Testing your system only on the \"easy\" categories.  This is a\ntemptation we have succumbed to in the past, but will resist in the\nfuture.  Yes, we know that some of the 135 TOPICS categories have few\nor no positive training examples or few or no positive test examples\nor both.  Yes, purely supervised learning systems will do very badly\non these categories.  Knowledge-based systems, on the other hand,\nmight do well on them, while doing poorly in comparison with\nsupervised learning on categories with lots of positive\nexamples. These comparisons are of great interest.  Of course, it's of\ngreat interest to *in addition* analyze subsets of categories\n(e.g. lots of positive examples vs. few positive examples, etc.). \n \n     Note that one strategy we considered and rejected is to assume\nthat documents which have no TOPICS but do have categories in other\nfields (PLACES, etc.) could be assumed to belong to no TOPICS\ncategories. This does not appear to be a safe assumption - we have\nfound a number of examples of documents with PLACES but no TOPICS when\nthere are TOPICS that clearly apply.\n\nIX. Feature Sets in Text Categorization \n\n   For many text categorization methods, particularly those using\nstatistical classification techniques, it is convenient to represent\ndocuments not as a sequence of characters, but rather as a tuple of\nnumeric or binary feature values.  For instance, the value of feature\nFi for a document Dj might be 1 if the string of characters\n\"financial\" occurred in the document with whitespace on either side,\nand 0 otherwise.  Or the value of Fi for Dj might be the number of\noccurrences of \"financial\" in document Dj.  In information retrieval\nsuch features are often called \"indexing terms\" and one often speaks\nof a term being \"present\" in a document, to mean that the feature\ntakes on a non-default value. (Usually, but not always, any value but\n0 is non-default.)\n\n  Comparisons between text categorization methods that represent\ndocuments as feature tuples are aided by ensuring that the same tuple\nrepresentation is used with all methods, thus avoiding conflating\ndifferences in feature extraction with differences in, say, machine\nlearning methods.  For that reason, the Reuters-22173 distribution\nincluded not only the formatted text of the Reuters stories, but also\nfeature tuple representations of the stories in each of two feature\nsets, one based on words and one based on noun phrases.  Surprisingly,\nalmost no use was made of these files by other researchers, so we have\nnot included files of this sort in the Reuters-21578 distribution.\n\n    However, we are willing to make available as part of the\ndistribution any tuple representations of this sort that researchers\nwant to contribute. (Contact lewis@research.att.com if you would like\nto do this.) Perhaps the ideal situation would be if someone with a\nstrong interest in feature set formation produced tuples based on a\nhigh quality set of features which other researchers interested only\nin learning algorithms could make use of.\n\n\nX. Bibliography\n\n[This needs to be updated.]\n\n@article{APTE94\n ,author = \"Chidanand Apt{\\'{e}} and Fred Damerau and Sholom M. Weiss\"\n ,title = \"Automated Learning of Decision Rules for Text Categorization\"\n ,journal = \"ACM Transactions on Information Systems\"\n ,year = 1994\n , note = \"To appear.\"\n }\n\n@inproceedings{APTE94b\n ,author = \"Chidanand Apt{\\'{e}} and Fred Damerau and Sholom M. Weiss\"\n ,title = \"Toward Language Independent Automated Learning of Text Categorization Models\"\n ,booktitle = sigir94\n ,year = 1994\n ,note = \"To appear.\"\n }\n\n@inproceedings{HAYES89\n,author = \"Philip J. Hayes and Peggy M. Anderson and Irene B. Nirenburg and \nLinda M. Schmandt\"\n,title = \"{TCS}: A Shell for Content-Based Text Categorization\"\n,booktitle = \"IEEE Conference on Artificial Intelligence Applications\"\n,year = 1990\n}\n\n@inproceedings{HAYES90b\n,author = \"Philip J. Hayes and Steven P. Weinstein\"\n,title = \"{CONSTRUE/TIS:} A System for Content-Based Indexing of a \nDatabase of News Stories\"\n,booktitle = \"Second Annual Conference on Innovative Applications of\nArtificial Intelligence\"\n,year = 1990\n}\n\n@incollection{HAYES92 \n ,author = \"Philip J. Hayes\"\n ,title = \"Intelligent High-Volume Text Processing using Shallow,\nDomain-Specific Techniques\" \n ,booktitle = \"Text-Based Intelligent Systems\"\n ,publisher = \"Lawrence Erlbaum\"\n ,address =  \"Hillsdale, NJ\"\n ,year = 1992\n ,editor = \"Paul S. Jacobs\"\n}\n\n@inproceedings{LEWIS91c \n  ,author = \"David D. Lewis\" \n  ,title = \"Evaluating Text Categorization\" \n  ,booktitle = \"Proceedings of Speech and Natural Language Workshop\" \n  ,year = 1991 \n  ,month = feb \n  ,organization = \"Defense Advanced Research Projects Agency\" \n  ,publisher = \"Morgan Kaufmann\" \n  ,pages = \"312--318\" \n\n}\n\n@phdthesis{LEWIS91d\n,author = \"David Dolan Lewis\"\n,title = \"Representation and Learning in Information Retrieval\"\n,school = \"Computer Science Dept.; Univ. of Massachusetts; Amherst, MA 01003\"\n,year = 1992\n,note = \"Technical Report 91--93.\"\n}\n\n@inproceedings{LEWIS91e\n,author = \"David D. Lewis\"\n,title = \"Data Extraction as Text Categorization: An Experiment with\nthe {MUC-3} Corpus\"\n,booktitle = \"Proceedings of the Third Message Understanding Evaluation\nand Conference\"\n,year = 1991\n,month = may\n,organization = \"Defense Advanced Research Projects Agency\"\n,publisher = \"Morgan Kaufmann\"\n,address = \"Los Altos, CA\"\n\n}\n\n@inproceedings{LEWIS92b\n ,author = \"David D. Lewis\"\n ,title = \"An Evaluation of Phrasal and Clustered Representations on a Text\nCategorization Task\"\n ,booktitle = \"Fifteenth Annual International ACM SIGIR Conference on\nResearch and Development in Information Retrieval\"\n ,year = 1992\n ,pages = \"37--50\"\n}\n\n@inproceedings{LEWIS92d \n,author = \"David D. Lewis and Richard M. Tong\"\n,title = \"Text Filtering in {MUC-3} and {MUC-4}\"\n,booktitle = \"Proceedings of the Fourth Message Understanding Conference ({MUC-4})\"\n,year = 1992\n,month = jun\n,organization = \"Defense Advanced Research Projects Agency\"\n,publisher = \"Morgan Kaufmann\"\n,address = \"Los Altos, CA\"\n}\n\n@inproceedings{LEWIS92e\n,author = \"David D. Lewis\" \n,title = \"Feature Selection and Feature Extraction for Text Categorization\"\n,booktitle = \"Proceedings of Speech and Natural Language Workshop\"\n,year = 1992\n,month = feb \n,organization = \"Defense Advanced Research Projects Agency\"\n,publisher = \"Morgan Kaufmann\"\n,pages = \"212--217\"\n}\n\n@inproceedings{LEWIS94b\n ,author = \"David D. Lewis and Marc Ringuette\"\n ,title = \"A Comparison of Two Learning Algorithms for Text Categorization\"\n ,booktitle = \"Symposium on Document Analysis and Information Retrieval\"\n ,year = 1994\n ,organization = \"ISRI; Univ. of Nevada, Las Vegas\"\n ,address = \"Las Vegas, NV\"\n ,month = apr\n ,pages = \"81--93\"\n}\n\n@article{LEWIS94d\n, author       = \"David D. Lewis and Philip J. Hayes\"\n, title        = \"Guest Editorial\"\n, journal      = \"ACM Transactions on Information Systems\"\n, year         = 1994 \n, volume       = 12\n, number       = 3\n, pages        = \"231\"\n, month        = jul\n}\n\n@article{SPARCKJONES76\n,author = \"K. {Sparck Jones} and  C. J. {van Rijsbergen}\"\n,title =  \"Information Retrieval Test Collections\"\n,journal = \"Journal of Documentation\"\n,year = 1976\n,volume = 32\n,number = 1\n,pages = \"59--75\"\n  }\n\n@book{WEISS91\n ,author = \"Sholom M. Weiss and Casimir A. Kulikowski\"\n ,title = \"Computer Systems That Learn\" \n ,publisher = \"Morgan Kaufmann\"\n ,year = 1991\n ,address = \"San Mateo, CA\"\n }\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhatamiarash7%2Fir-system","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhatamiarash7%2Fir-system","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhatamiarash7%2Fir-system/lists"}