{"id":25158294,"url":"https://github.com/gbburleigh/upworkscraper","last_synced_at":"2025-04-30T10:42:49.118Z","repository":{"id":245607131,"uuid":"412604879","full_name":"gbburleigh/UpworkScraper","owner":"gbburleigh","description":"A library for scraping and monitoring forum activity for Upwork.com, to observe patterns of censorship by moderators.","archived":false,"fork":false,"pushed_at":"2021-10-05T19:28:52.000Z","size":228,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-06-23T01:38:54.198Z","etag":null,"topics":["python","selenium","sql","sqlite","upwork","webscraping"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gbburleigh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-01T20:11:23.000Z","updated_at":"2024-06-23T01:38:57.998Z","dependencies_parsed_at":"2024-06-23T01:55:20.817Z","dependency_job_id":null,"html_url":"https://github.com/gbburleigh/UpworkScraper","commit_stats":null,"previous_names":["gbburleigh/upworkscraper"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gbburleigh%2FUpworkScraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gbburleigh%2FUpworkScraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gbburleigh%2FUpworkScraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gbburleigh%2FUpworkScraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gbburleigh","download_url":"https://codeload.github.com/gbburleigh/UpworkScraper/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":237894308,"owners_count":19383167,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["python","selenium","sql","sqlite","upwork","webscraping"],"created_at":"2025-02-09T01:49:37.985Z","updated_at":"2025-02-09T01:49:38.503Z","avatar_url":"https://github.com/gbburleigh.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# UpworkScraper\n\nDataset of threads, posts, and users taken from Upwork's community forum, along with the scraper used to aggregate the data.\n\n\u003ch1\u003eAnalyzing Data\u003c/h1\u003e\n\nDataset is written to a sqlite database that can be queried directly for further analysis. Python's sqlite package offers a direct interface that we can leverage to pull specific queries with. \n\n\u003cul\u003e\n  \u003cli\u003eFirst, place upwork.db in the working directory. Import the sqlite3 package, and any desired data management packages (for this example, we'll write to .csv). Make a connection to the database after importing sqlite3.\n    '''\n    import sqlite3\n    \n    conn = sqlite3.connect('upwork.db')\n    cur = conn.cursor()\n    '''\n  \u003c/li\u003e\n  \u003cli\u003e\u003ci\u003ecur\u003c/i\u003e allows us to make queries to the database. We'll start by fetching posts from threads with the title 'New to Upwork'.\n  '''\n  #Instantiate our writer and query\n  query = 'New to Upwork'\n  \n  #Track fetched tids to avoid double counting\n  done = []\n  with open (f'{query} Query Results.csv', 'w') as fp:\n        writer = csv.writer(fp)\n        writer.writerow([\"Thread Title\", \"URL\", \"Post Author\", \"Post Author Rank\" \"Post Date\", \"Post Text\", \"Edit Status\"])\n        \n        #Format our query\n        #query = ('%' + query + '%', )\n        \n        #Make our query\n        cur.execute(f\"SELECT title, thread_url, tid FROM threads WHERE title LIKE ?\", (query, ))\n  '''\n  Note that the query can be formatted with additional '%' symbols to allow strings that aren't exactly the same but contain that text to be matched (i.e. \"I'm   New to Upwork\" would be a match in this example). See \u003ca href=https://www.sqlitetutorial.net/sqlite-like/\u003ehere\u003c/a\u003e for further documentation of the LIKE clause. Also note that formatting sqlite queries in Python can be unusual; putting in variables in strings must be done with the ? character, and referencing the given variables in a tuple. So in our \u003ci\u003ecur.execute\u003c/i\u003e call, the first argument is our query string and the second argument is the tuple containing the variable to substitute (query, ). \n  \u003c/li\u003e\n  \n  \u003cli\u003eAfter making the query, we can fetch the data from the cursor and put it in a variable. Depending on the SELECT statement you make, the data returned can be different as well as the order you choose to select them in.\n    '''\n    #Every thread we found that matches our query\n    rows = cur.fetchall()\n  \n    #For each thread\n    for row in rows:\n        #Fetch the thread info so we can get it's posts\n        thread_title = row[0]\n        thread_url = row[1]\n        tid = row[2]\n    '''\n\u003c/li\u003e\n\u003cli\u003eWe'll apply the same techniques to fetch all of the posts and users for each thread.\n  '''\n  if tid not in done:\n      cur.execute(f\"SELECT message_text, author_id, post_date, edit_status FROM posts p WHERE p.tid=?\", (tid, ))\n      post_rows = cur.fetchall()\n      for post in post_rows:\n          text = post[0]\n          aid = post[1]\n          date = post[2]\n          edit_status = post[3]\n          cur.execute(f\"SELECT user_name, user_rank FROM users WHERE uid=?\", (aid, ))\n          info = cur.fetchall()[0]\n          author = info[0]\n          rank = info[1]\n          writer.writerow([thread_title, thread_url, author, rank, date, text, edit_status])\n      done.append(tid)\n  '''\n\u003c/li\u003e\n \u003c/ul\u003e\n Other options include pandas or native Python dictionaries(not recommended) for data storage, which can be applied similarly to the csv writer in this example. \n \n \u003ch2\u003eDatabase Schema\u003c/h2\u003e\n \n In order to relate entries in each of the three tables (\u003cit\u003ethreads\u003c/it\u003e, \u003cit\u003eposts\u003c/it\u003e, \u003cit\u003eusers\u003c/it\u003e), the database uses a system of ids for all entries, hashed from the information stored with it. As shown above, obtaining the id of a given element can be used to find associated entries in other tables. You can search for posts or threads made by a given user id, look for all posts under a given thread id, or count the number of threads made in a given category. \n \n \u003cimg src=dbSchema.png\u003e\n\n\u003ch3\u003eAdditional Queries\u003c/h3\u003e\n\nTimestamps can be compared to tally posts falling in some range.\n\nSELECT COUNT(*) FROM posts p WHERE post_date \u003e '2021-03-01' AND post_date \u003c '2021-03-30';\n\nSELECT COUNT(*) FROM posts p WHERE post_date \u003e '2021-03-01' AND post_date \u003c '2021-03-30' AND edit_status!='Unedited';\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgbburleigh%2Fupworkscraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgbburleigh%2Fupworkscraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgbburleigh%2Fupworkscraper/lists"}