{"id":19811323,"url":"https://github.com/diging/tethne-services","last_synced_at":"2026-06-06T22:32:06.699Z","repository":{"id":68005788,"uuid":"76480892","full_name":"diging/tethne-services","owner":"diging","description":"Tools to enhance metadata-based analysis in Tethne.","archived":false,"fork":false,"pushed_at":"2016-12-29T23:57:58.000Z","size":1849,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":4,"default_branch":"master","last_synced_at":"2025-01-11T07:13:18.059Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/diging.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":"authors/__init__.py","dei":null,"publiccode":null,"codemeta":null}},"created_at":"2016-12-14T17:15:09.000Z","updated_at":"2016-12-19T21:36:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"a789dbe3-346b-4c5a-a76c-fb83b87b3157","html_url":"https://github.com/diging/tethne-services","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diging%2Ftethne-services","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diging%2Ftethne-services/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diging%2Ftethne-services/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/diging%2Ftethne-services/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/diging","download_url":"https://codeload.github.com/diging/tethne-services/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241156630,"owners_count":19919338,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-12T09:25:46.146Z","updated_at":"2026-06-06T22:32:06.626Z","avatar_url":"https://github.com/diging.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tethne-services\nTools to enhance metadata-based analysis in Tethne.\n\n\n### Clone this Repo\nClone this repository into whatever directory you'd like to work on it from:\n\n```bash\ngit clone https://github.com/diging/tethne-services.git\n```\n\n### Install the following\n*   [Python v2.7.12](https://www.python.org/downloads/release/python-2712/)\n*   [Tethne](http://pythonhosted.org/tethne/)\n    *   `pip install tethne`\n*   [pandas v0.19.0](http://pandas.pydata.org/)\n    *   `pip install pandas`\n*   [scikit-learn v0.18.1](http://scikit-learn.org/stable/)\n    *   `pip install -U scikit-learn`\n*   [numpy v1.11.3](http://www.numpy.org/)\n    *   `pip install numpy`\n*   [fuzzywuzzy v0.14.0](https://pypi.python.org/pypi/fuzzywuzzy)\n    *   `pip install fuzzywuzzy`\n\n\n### APIs\n\nIn addition to the below examples, an example showing basic classification of 2 author-paper instances can be found [here](https://github.com/diging/tethne-services/tree/master/classificationmodels#serialized-models)\n\nThis package will expose the following APIs:\n\n* `CorpusParser.py`: \nThis module is responsible for parsing a `Tethne` corpus object and returning a pandas DataFrame with 14 columns. \nEach row in the DataFrame is an Author-Paper instance. Also, please note that each row in the DataFrame is assigned an index.\nThe index is generated using (concatenation of)the following:\n                    1. Author Last Name         \n                    2. Author First Name             \n                    3. WOS ID\n\n    * An example usage of this API is shown below.\n    \n        ```python\n        from authors.paperinstances import CorpusParser\n        from tethne.readers import wos\n        datapath = \"/Users/aosingh/tethne-services/tests/data/Boyer_Barbara.txt\"\n        corpus = wos.read(datapath)\n        corpus_parser = CorpusParser(tethne_corpus=corpus)\n        df = corpus_parser.parse() # final pandas DataFrame of Author-Paper instances.\n        \n        #This is how the indices look like for each row\n        u'BOYERBCWOS:000077556600009',\n        u'BOYERBCWOS:000086633600013',\n        u'BOYERBCWOS:000171953200027',\n        u'BOYERBCWOS:000186338800013',\n        u'BOYERBCWOS:A1971I198200009',\n        u'BOYERBCWOS:A1972L780300002',\n        u'BOYERBCWOS:A1981MP70900043',\n        u'BOYERBCWOS:A1982QN98300013',\n        u'BOYERBCWOS:A1983RR86600053',\n        u'BOYERBCWOS:A1984TR20700060',\n        \n        #The DataFrame returned has the following colmns\n        [\"WOSID\", \"DATE\", \"TITLE\", \"LASTNAME\", \"FIRSTNAME\", \"JOURNAL\", \"EMAILADDRESS\",\n        \"PUBLISHER\", \"SUBJECT\", \"WC\", \"AUTHOR_KEYWORDS\", \"INSTITUTE\", \"AUTH_LITERAL\", \"CO-AUTHORS\"]\n        \n        ```\n     \n* `InitialCluster.py`: \nInitialCluster, as the name suggests, groups Author-Paper instances by similar author names. In this process, we\ndo not use any classification or machine learning approach. Initial Clustering is done to limit the size of comparisons when we perform the actual classification.\nFor example: It is not efficient to compare papers by the authors 'BRUCE WAYNE' and 'CLARK KENT' using the classification model. We know they are 2 different people.\nWhile we build this initial cluster, we group together author_literals which are similar and have higher probability of actually belonging to the same cluster.\n\n    * An example usage of this API is shown below \n    \n        ```python\n        from authors.Cluster import InitialCluster\n        from tethne.readers import wos\n        datapath = './data/Albertini_David.txt'\n        corpus = wos.read(datapath)\n        initial_cluster_instance = InitialCluster(corpus=corpus)\n        clusters = initial_cluster_instance.build()\n        \n        #This is how the dictionary of initial clusters will look like:\n        \n         {u'AALBERGJ': {u'AALBERGJ', u'VALBERGPA'},\n          u'ABOULGHARM': {u'ABOULGHARM'},\n          u'ABSEDANNIE': {u'ABSEDANNIE'},\n          u'AHUJAKAMAL': {u'AHUJAKAMAL'},\n          u'AINSLIEA': {u'AINSLIEA']},\n          u'AKKOYUNLUGOKHAN': {u'AKKOYUNLUGOKHAN'},\n          u'ALBERTINDF': {u'ALBERTINDF',\n                     u'ALBERTINID',\n                     u'ALBERTINID F',\n                     u'ALBERTINIDAVID',\n                     u'ALBERTINIDAVID F',\n                     u'ALBERTINIDF'},\n           u'ALECCIC':{u'ALECCIC'},\n           u'ALEXANDREHENRI': {u'ALEXANDREHENRI'},\n           u'ALIKANIMINA': {u'ALIKANIMINA', u'GALIANIDALIA'},\n           u'ALLWORTHAE': {u'ALLWORTHAE'},\n           u'ANDERSENCY': {u'ANDERSENCY', u'ANDERSONE', u'ANDERSONR'}}\n        \n        \n        ```\n\n* `IdentityCluster.py`: \nIdentityCluster class uses Machine Learning(RandomForestClassifier) to cluster paper instances belonging to the same\nauthor. We first build an initial cluster using the class `InitialCluster`. Please read the example below to understand it's usage\n\n    * An example usage of this API is shown below \n    \n        ```python\n        \"\"\"\n        Algorithm: \n        \n        STEP 1 : Parse the TETHNE corpus and return a pandas DataFrame of Author-paper instances. \n                 Please note that, an index is assigned to each Author-Paper instance.\n                 The index is generated using (concatenation of)the following:\n                    1. WOS ID\n                    2. Author Last Name\n                    3. Author First Name\n\n        STEP 2 : Use the `IdentityCluster` class to group instances belonging to the same class(Basically, \n                 build a dictionary).   \n              \n        STEP 3 : Return the dictionary with LABEL as keys and a set of pandas DataFrame indexes as values. \n        These indices are the same which are created in the STEP 1 of the algorithm\n        \"\"\"\n        \n         from authors.cluster import IdentityCluster\n         from tethne.readers import wos\n         from authors.paperinstances import CorpusParser\n         datapath = \"/Users/aosingh/tethne-services/tests/data/Boyer_Barbara.txt\"\n         corpus = wos.read(datapath)\n\n         corpus_parser = CorpusParser(tethne_corpus=corpus)\n         df = corpus_parser.parse() # STEP 1 in the algorithm\n\n         identity_cluster_instance = IdentityCluster(corpus=corpus)\n         identity_clusters = identity_cluster_instance.build() # STEPS 2 and 3 in the algorithm\n        \n        #This is how the dictionary of final identity clusters will look like:\n        {u'ARNOLDJM': set([u'ARNOLDJMWOS:A1986E918400022',\n                               u'ARNOLDJMWOS:A1988N184500004']),\n             u'BOYERB': set([u'BOYERBCWOS:000076265300004',\n                             u'BOYERBCWOS:000077556600009',\n                             u'BOYERBCWOS:000086633600013',\n                             u'BOYERBCWOS:000171953200027',\n                             u'BOYERBCWOS:000186338800013',\n                             u'BOYERBCWOS:A1971I198200009',\n                             u'BOYERBCWOS:A1972L780300002',\n                             u'BOYERBCWOS:A1981MP70900043',\n                             u'BOYERBCWOS:A1982QN98300013',\n                             u'BOYERBCWOS:A1983RR86600053',\n                             u'BOYERBCWOS:A1984TR20700060',\n                             u'BOYERBCWOS:A1984TR20700065',\n                             u'BOYERBCWOS:A1985AUC5500038',\n                             u'BOYERBCWOS:A1985AUC5500046',\n                             u'BOYERBCWOS:A1986A349100017',\n                             u'BOYERBCWOS:A1986C019700010',\n                             u'BOYERBCWOS:A1986E918400022',\n                             u'BOYERBCWOS:A1986E918400023',\n                             u'BOYERBCWOS:A1987G340700002',\n                             u'BOYERBCWOS:A1988N184500004',\n                             u'BOYERBCWOS:A1988R225500053',\n                             u'BOYERBCWOS:A1989CH57500002',\n                             u'BOYERBCWOS:A1991GV28500052',\n                             u'BOYERBCWOS:A1992KC97700038',\n                             u'BOYERBCWOS:A1992KC97700042',\n                             u'BOYERBCWOS:A1995RP17800035',\n                             u'BOYERBCWOS:A1995TA77100017',\n                             u'BOYERBCWOS:A1996VQ71700035',\n                             u'BOYERBCWOS:A1996VT14600003',\n                             u'BOYERBWOS:A1996UQ10700011']),\n             u'HENRYJJ': set([u'HENRYJJWOS:000077556600009',\n                              u'HENRYJQWOS:000076265300004',\n                              u'HENRYJQWOS:000086633600013',\n                              u'HENRYJQWOS:A1995TA77100017',\n                              u'HENRYJQWOS:A1996VQ71700035',\n                              u'HENRYJQWOS:A1996VT14600003']),\n             u'HILLSD': set([u'HILLSDWOS:000171953200027', u'HILLSDWOS:000186338800013']),\n             u'KAPLANIM': set([u'KAPLANIMWOS:A1988R225500053',\n                               u'KAPLANIMWOS:A1992KC97700042']),\n             u'LADURNERP': set([u'LADURNERPWOS:A1996UQ10700011']),\n             u'LANDOLFAM': set([u'LANDOLFAMAWOS:A1985AUC5500046',\n                                u'LANDOLFAMAWOS:A1986E918400022',\n                                u'LANDOLFAMWOS:A1988N184500004']),\n             u'MAIRG': set([u'MAIRGWOS:A1996UQ10700011']),\n             u'MARTINDALEMQ': set([u'MARTINDALEMQWOS:000077556600009',\n                                   u'MARTINDALEMQWOS:000086633600013',\n                                   u'MARTINDALEMQWOS:A1995TA77100017',\n                                   u'MARTINDALEMQWOS:A1996VQ71700035',\n                                   u'MARTINDALEMQWOS:A1996VT14600003']),\n             u'PALASZEWSKIPP': set([u'PALASZEWSKIPPWOS:A1983RR86600053']),\n             u'REITERD': set([u'REITERDWOS:A1996UQ10700011']),\n             u'RIEGERR': set([u'RIEGERRWOS:A1996UQ10700011']),\n             u'ROONEYLM': set([u'ROONEYLMWOS:A1984TR20700065']),\n             u'SALVENMOSERW': set([u'SALVENMOSERWWOS:A1996UQ10700011']),\n             u'SANTOSKA': set([u'SANTOSKAWOS:A1988R225500053']),\n             u'SMITHGW': set([u'SMITHGWWOS:A1982QN98300013'])}\n        ```\n \n\n \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdiging%2Ftethne-services","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdiging%2Ftethne-services","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdiging%2Ftethne-services/lists"}