{"id":22147223,"url":"https://github.com/centerforopenscience/sharepa","last_synced_at":"2026-03-10T11:32:04.896Z","repository":{"id":57466650,"uuid":"36872524","full_name":"CenterForOpenScience/sharepa","owner":"CenterForOpenScience","description":"A python client for browsing and analyzing SHARE data (https://osf.io/share)","archived":false,"fork":false,"pushed_at":"2017-03-24T20:08:59.000Z","size":203,"stargazers_count":8,"open_issues_count":1,"forks_count":7,"subscribers_count":3,"default_branch":"develop","last_synced_at":"2024-11-15T18:39:49.636Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/CenterForOpenScience.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG","contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2015-06-04T13:41:00.000Z","updated_at":"2016-12-05T00:35:17.000Z","dependencies_parsed_at":"2022-09-19T07:52:20.678Z","dependency_job_id":null,"html_url":"https://github.com/CenterForOpenScience/sharepa","commit_stats":null,"previous_names":["fabianvf/sharepa"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fsharepa","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fsharepa/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fsharepa/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/CenterForOpenScience%2Fsharepa/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/CenterForOpenScience","download_url":"https://codeload.github.com/CenterForOpenScience/sharepa/tar.gz/refs/heads/develop","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227642219,"owners_count":17797850,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-12-01T23:14:48.391Z","updated_at":"2025-12-13T23:25:14.420Z","avatar_url":"https://github.com/CenterForOpenScience.png","language":"Python","readme":"# sharepa\n\n```master``` build status: [![Build Status](https://travis-ci.org/CenterForOpenScience/sharepa.svg?branch=master)](https://travis-ci.org/CenterForOpenScience/sharepa)\n\n\n```develop``` build status: [![Build Status](https://travis-ci.org/CenterForOpenScience/sharepa.svg?branch=develop)](https://travis-ci.org/CenterForOpenScience/sharepa)\n\n\n[![Coverage Status](https://coveralls.io/repos/CenterForOpenScience/sharepa/badge.svg?branch=develop)](https://coveralls.io/r/CenterForOpenScience/sharepa?branch=develop)\n[![Code Climate](https://codeclimate.com/github/fabianvf/sharepa/badges/gpa.svg)](https://codeclimate.com/github/fabianvf/sharepa)\n\nA python client for browsing and analyzing SHARE data (http://share-research.readthedocs.io/en/latest/), gathered with the SHARE Processing Pipeline (https://github.com/CenterForOpenScience/SHARE). It builds heavily (almost completely) on the [elasticsearch-dsl](https://github.com/elastic/elasticsearch-dsl-py) package for handling Elasticsearch querying and aggregations, and contains some additional utilities to help with graphing and analyzing the data.\n\nUse Binder to run some SHARE data tutorials online! Click here: [![Binder](http://mybinder.org/badge.svg)](http://mybinder.org:/repo/erinspace/share_tutorials)\n\n## Installation\nYou can install sharepa using pip (inside a virtualenv):\n\n    pip install git+https://github.com/CenterForOpenScience/sharepa@develop\n\n**note** The version above will work with SHARE v2's elasticsearch API. To install the version to run with V1 of the SHARE API, run ```pip install sharepa```\n\n## Getting Started\nHere are some basic searches to get started parsing through SHARE data.\n\n### Basic Search\nA basic search will provide access to all documents in SHARE - in 10 document slices.\n\n#### Count\nYou can use sharepa and the basic search to get the total number of documents in SHARE\n```\nfrom sharepa import basic_search\n\n\nprint(basic_search.count())\n```\n\n#### Iterating through results\nExecuting the basic search will send the actual basic query to the SHARE API and then let you iterate through results\n\n```\nresults = basic_search.execute()\n\nfor hit in results:\n    print(hit.title)\n```\n\nIf we don't want 10 results, or we want to offset the results, we can use slices\n```\nresults = basic_search[5:10].execute()\nfor hit in results:\n    print(hit.title)\n```\n\n## Advanced Search\nYou can make your own search object, which allows you to pass in custom queries for certain terms or SHARE fields. Queries are formed using [lucene query syntax](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax). \n\n```\nfrom sharepa import ShareSearch\n\nmy_search = ShareSearch()\n\nmy_search = my_search.query(\n    'exists', # Type of query, will accept a field to check if exists\n    field='tags', # This lucene query string will find all documents that have tags\n)\n```\n\nThis type of query accepts a 'exists'. Other options include a [match query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html), a [multi-match query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html), a [bool query](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html), and any other query structure available in the elasticsearch API.\n\nWe can see what that query that we're about to send to elasticsearch by using the pretty print helper function:\n\n```\nfrom sharepa.helpers import pretty_print\n\n\npretty_print(my_search.to_dict())\n```\n\n```\n{\n    \"query\": {\n        \"exists\": {\n            \"field\": \"tags\"\n        }\n    }\n}\n```\n\n\nWhen you execute that query, you can then iterate through the results the same way that you could with the simple search query.\n```\nnew_results = my_search.execute()\nfor hit in new_results:\n    print(hit.title)\n```\n\n\n## Aggregations for data analysis\nWhile searching for individual results is useful, sharepa also lets you make aggregation queries that give you results across the entirety of the SHARE dataset at once. This is useful if you're curious about the completeness of data sets. For example, we can find the number of documents per source that are missing titles. \n\nWe can add an aggregation to my_search that will give us the number of documents per source that meet that previously defined search query (in our case, items that don't have tags). Here's what adding that aggregation will look like - \n\n```\nmy_search.aggs.bucket(\n    'sources',  # Every aggregation needs a name\n    'terms',  # There are many kinds of aggregations, terms is a pretty useful one though\n    field='sources',  # We store the source of a document in the sources field\n    size=0,  # These are just to make sure we get numbers for all the sources, to make it easier to combine graphs\n    min_doc_count=0\n)\n```\n\nWe can see which query is actually going to be sent to elasticsearch by printing out the query.\n\n```\npretty_print(my_search.to_dict())\n```\n\n```\n{\n    \"query\": {\n        \"exists\": {\n            \"field\": \"tags\"\n        }\n    },\n    \"aggs\": {\n        \"sources\": {\n            \"terms\": {\n                \"field\": \"_type\", \n                \"min_doc_count\": 0, \n                \"size\": 0\n            }\n        }\n    }\n}\n```\n\nThis is the actual query that will be sent to the SHARE API. You can see that it added a section called \"aggs\" to the basic query that we made earlier.\n\nYou can access the aggregation data for basic plotting, and analysis, by accessing the bucket \n\n## Basic Plotting\nSharepa has some basic functions to get you started making plots using [matplotlib](http://matplotlib.org/) and [pandas](http://pandas.pydata.org/).\n\nRaw sharepa data is in the same format as elasticsearch results, represented as a nested structure. To convert the data into a format that pandas can recognize, we have to convert it into a dataframe.\n\n### Creating a dataframe from sharepa data\nWe can use the bucket_to_dataframe function to convert the elasticsearch formatted data into a pandas dataframe. To do this, we pass the title of the new column we want created, and the place to find the nested aggregation data.\n\nLet's re-execute the my_search command including the updated query and update the new_results variable.\n\n```\nnew_results = my_search.execute()\n```\n\nTo convert these results to a pandas dataframe, we'll look within the appropriate results search bucket, in this case within ```new_results.aggregations.sourceAgg.buckets```\n\n```\nfrom sharepa import bucket_to_dataframe\nfrom matplotlib import pyplot\n\nmy_data_frame = bucket_to_dataframe('# documents by source - No Tags', new_results.aggregations.sources.buckets)\nmy_data_frame.plot(kind='bar')\npyplot.show()\n```\n\nThis will create a bar graph showing all of the sources, and document counts for each source matching our query of items that do not have tags.\n\nYou can also sort the data based on a certain column, in this case, '# documents by source - No Tags'\n\n```\nmy_data_frame.sort(ascending=False, columns='# documents by source - No Tags').plot(kind='bar')\npyplot.show()\n```\n\n\n## Advanced Aggregations\n\nLet's make a more interesting aggregation. Let's look at the documents that are missing titles, by source.\n\n```\nfrom elasticsearch_dsl import F, Q\n\nmy_search.aggs.bucket(\n    'missingTitle',  # Name of the aggregation\n    'filters', # We'll want to filter all the documents that have titles\n    filters={ \n        'missingTitle': F(  # F defines a filter\n            'fquery',  # This is a query filter which takes a query and filters document by it\n            query=Q(  # Q can define a query\n                'query_string', # The type of aggregation\n                query='NOT title:*',  # This will match all documents that don't have content in the title field\n                analyze_wildcard=True,\n            )\n        ) \n    }\n).metric(  # but wait, that's not enough! We need to break it down by source as well\n    'sourceAgg',\n    'terms',\n    field='sources',\n    size=0,\n    min_doc_count=0\n)\n```\n\nWe can check out what the query looks like now: \n```\npretty_print(my_search.to_dict()) \n```\n\n```\n{\n    \"query\": {\n        \"query_string\": {\n            \"analyze_wildcard\": true, \n            \"query\": \"NOT tags:*\"\n        }\n    }, \n    \"aggs\": {\n        \"sources\": {\n            \"terms\": {\n                \"field\": \"sources\", \n                \"min_doc_count\": 0, \n                \"size\": 0\n            }\n        }, \n        \"missingTitle\": {\n            \"aggs\": {\n                \"sourceAgg\": {\n                    \"terms\": {\n                        \"field\": \"sources\", \n                        \"min_doc_count\": 0, \n                        \"size\": 0\n                    }\n                }\n            }, \n            \"filters\": {\n                \"filters\": {\n                    \"missingTitle\": {\n                        \"fquery\": {\n                            \"query\": {\n                                \"query_string\": {\n                                    \"query\": \"NOT title:*\", \n                                    \"analyze_wildcard\": true\n                                }\n                            }\n                        }\n                    }\n                }\n            }\n        }\n    }\n}\n```\n\nWow this query has gotten big! Good thing we don't have to define it by hand.\n\nNow we just need to execute the search:\n```\nmy_results = my_search.execute()\n```\n\nLet's check out the results, and make sure that there are indeed no tags.\n\n```\nfor hit in my_results:\n    print(hit.title, hit.get('tags'))  # we can see there are no tags in our results\n```\n\nLet's pull out those buckets and turn them into dataframes for more analysis\n\n```\nmissing_title = bucket_to_dataframe('missingTitle', my_results.aggregations.missingTitle.buckets.missingTitle.sourceAgg.buckets)\nmatches = bucket_to_dataframe('matches', my_results.aggregations.sources.buckets)\n```\n\nIt'd be great if we could merge this dataframe with another that has information about all of the documents. Luckilly we have a built in function that will give us that data frame easily, called source_counts. \n\nWe can use that dataframe and merge it with our newly created one:\n\n```\nfrom sharepa.helpers import source_counts\nfrom sharepa.analysis import merge_dataframes\n\n\nmerged = merge_dataframes(source_counts(), matches,  missing_title)\n```\n\nWe can also easily do computations on these columns, and add those to the dataframe. Here's a way to get a pandas dataframe with a column for a percent from each source that is missing tags and a title:\n\n```\nmerged['percent_missing_tags_and_title'] = (merged.missingTitle / merged.total_source_counts) * 100\npyplot.show()\n```\n\n## Examples\n\nThe following examples cover some of the more common use cases of sharepa. They are by no means exhaustive, for more information see the elasticsearch and elasticsearch-dsl documentation.\n\n# Query examples\nQueries and Filters are very similar, and have many overlaping search types (e.g. filter by range vs query by range)\nQueries sort returned hits by relevance (using the \\_score feild), filters ignore revelence and just find documents that match the search criteria given.\n\nFrom Elastic search docs: \n``\n    As a general rule, queries should be used instead of filters:\n    -for full text search\n    -where the result depends on a relevance score\n``\n\nEx: Lets get all the documents with titles containing the word 'cell' with regex:\n```\nmy_search = ShareSearch() #create search object\nmy_search = my_search.query(\n    \"regexp\", #the first arg in a query or filter is the type of filter/query to be employed\n    title='.*cell.*' #then come the arguments, these are different depending on type of query is used, but generally: name_of_the_feild_to_be_operated_on='argument_values'\n)\n```\n\nEx: Or we can get all documents from MIT:\n\n```\nmy_search = ShareSearch() #create search object\nmy_search = my_search.query(\n    \"match\", #the first arg in a query or filter is the type of filter/query to be employed\n    source='mit' #then come the arguments, these are different depending on type of query is used, but generally: name_of_the_feild_to_be_operated_on='argument_values'\n)\n```\n\nFor more information on query types, see the [elasticsearch docs](https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-queries.html)\n\n# Filters\nFrom Elastic search docs:\n``\nAs a general rule, filters should be used instead of queries:\n - for binary yes/no searches\n - for queries on exact values\n``\n\nFor more filter types see: [Elasticsearch Filter Docs](https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-filters.html)\n\nEx: Applying a filter to a search. Here, results will only contain hits between 14-06-01 and 15-06-01\n\n```\nmy_search = ShareSearch() #create search object\nmy_search = my_search.filter( #apply filter to search\n    \"range\", #applied a range type filter\n    providerUpdatedDateTime={ #the feild in the data we compare\n        'gte':'2014-01-01', #hits must be greater than or equal to this date and...\n        'lte':'2015-01-01' #hits must be less than or equal to this date\n    }\n)\n```\nEx: We can add a second filter to the first, now hits will match both filters (date range and tags that start with 'ba').\nNote: there are many ways to write filters/queries depending on the level of abstraction you want from elasticsearch.\n\n```\n# Here is a pure elasticsearch-dsl filter\nmy_search = my_search.filter(\n     \"prefix\",\n     tags=\"ba\"\n)\n\n# Here is the same search as a mix of elasticsearch-dsl and elasticsearch where the args are input as a dictionary a la elasticsearch\nmy_search = my_search.filter(\n     \"prefix\",\n     **{\"tags\": \"ba\"}\n)\n\n# We can also match elasticsearch syntax exactly, and input the raw dictionary into the filter method\nmy_search = my_search.filter(\n    {\n        \"prefix\": {\"tags\": \"ba\"}\n    }\n)\n```\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcenterforopenscience%2Fsharepa","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcenterforopenscience%2Fsharepa","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcenterforopenscience%2Fsharepa/lists"}