Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/shaikhsajid1111/twitter-scraper-selenium

Python's package to scrap Twitter's front-end easily
https://github.com/shaikhsajid1111/twitter-scraper-selenium

automation contribution-welcome csv hacktoberfest json open-source pypi python python3 selenium social-media tweets twitter twitter-api twitter-bot twitter-hashtag twitter-profile twitter-profiles twitter-scraper web-scraping

Last synced: about 21 hours ago
JSON representation

Python's package to scrap Twitter's front-end easily

Awesome Lists containing this project

README

        

Twitter scraper selenium


Python's package to scrape Twitter's front-end easily with selenium.

[![PyPI license](https://img.shields.io/pypi/l/ansicolortags.svg)](https://opensource.org/licenses/MIT) [![Python >=3.8](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/release/python-360/)
[![Maintenance](https://img.shields.io/badge/Maintained-Yes-green.svg)](https://github.com/shaikhsajid1111/facebook_page_scraper/graphs/commit-activity)

Table of Contents

Table of Contents



  1. Getting Started



  2. Usage







  3. Privacy

  4. License





Prerequisites


  • Internet Connection

  • Python 3.6+

  • Chrome or Firefox browser installed on your machine



  • Installation


    Installing from the source


    Download the source code or clone it with:

    ```
    git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium
    ```

    Open terminal inside the downloaded folder:


    ```
    python3 setup.py install
    ```


    Installing with PyPI

    ```
    pip3 install twitter-scraper-selenium
    ```




    Usage


    Available Function In this Package - Summary


    Function Name
    Function Description
    Scraping Method
    Scraping Speed

    scrape_profile()
    Scrape's Twitter user's profile tweets
    Browser Automation
    Slow

    get_profile_details()
    Scrape's Twitter user details.
    HTTP Request
    Fast

    scrape_profile_with_api()
    Scrape's Twitter tweets by twitter profile username. It expects the username of the profile
    Browser Automation & HTTP Request
    Fast


    Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.







    To scrape twitter profile details:


    ```python
    from twitter_scraper_selenium import get_profile_details

    twitter_username = "TwitterAPI"
    filename = "twitter_api_data"
    browser = "firefox"
    headless = True
    get_profile_details(twitter_username=twitter_username, filename=filename, browser=browser, headless=headless)

    ```
    Output:
    ```js
    {
    "id": 6253282,
    "id_str": "6253282",
    "name": "Twitter API",
    "screen_name": "TwitterAPI",
    "location": "San Francisco, CA",
    "profile_location": null,
    "description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "entities": {
    "url": {
    "urls": [{
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "expanded_url": "https:\/\/developer.twitter.com",
    "display_url": "developer.twitter.com",
    "indices": [
    0,
    23
    ]
    }]
    },
    "description": {
    "urls": []
    }
    },
    "protected": false,
    "followers_count": 6133636,
    "friends_count": 12,
    "listed_count": 12936,
    "created_at": "Wed May 23 06:01:13 +0000 2007",
    "favourites_count": 31,
    "utc_offset": null,
    "time_zone": null,
    "geo_enabled": null,
    "verified": true,
    "statuses_count": 3656,
    "lang": null,
    "contributors_enabled": null,
    "is_translator": null,
    "is_translation_enabled": null,
    "profile_background_color": null,
    "profile_background_image_url": null,
    "profile_background_image_url_https": null,
    "profile_background_tile": null,
    "profile_image_url": null,
    "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    "profile_banner_url": null,
    "profile_link_color": null,
    "profile_sidebar_border_color": null,
    "profile_sidebar_fill_color": null,
    "profile_text_color": null,
    "profile_use_background_image": null,
    "has_extended_profile": null,
    "default_profile": false,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null,
    "translator_type": null
    }
    ```





    get_profile_details() arguments:



    Argument
    Argument Type
    Description




    twitter_username
    String
    Twitter Username


    output_filename
    String
    What should be the filename where output is stored?.


    output_dir
    String
    What directory output file should be saved?


    proxy
    String
    Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.







    Keys of the output:
    Detail of each key can be found here.







    To scrape profile's tweets:


    In JSON format:

    ```python
    from twitter_scraper_selenium import scrape_profile

    microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
    print(microsoft)
    ```
    Output:
    ```javascript
    {
    "1430938749840629773": {
    "tweet_id": "1430938749840629773",
    "username": "Microsoft",
    "name": "Microsoft",
    "profile_picture": "https://twitter.com/Microsoft/photo",
    "replies": 29,
    "retweets": 58,
    "likes": 453,
    "is_retweet": false,
    "retweet_link": "",
    "posted_time": "2021-08-26T17:02:38+00:00",
    "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
    "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
    },...
    }
    ```



    In CSV format:

    ```python
    from twitter_scraper_selenium import scrape_profile

    scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")

    ```

    Output:

    tweet_id
    username
    name
    profile_picture
    replies
    retweets
    likes
    is_retweet
    retweet_link
    posted_time
    content
    hashtags
    mentions
    images
    videos
    post_url
    link

    1430938749840629773
    Microsoft
    Microsoft
    https://twitter.com/Microsoft/photo
    64
    75
    521
    False

    2021-08-26T17:02:38+00:00
    Easy to use and efficient for all – Windows 11 is committed to an accessible future.

    Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW
    []
    []
    []
    []
    https://twitter.com/Microsoft/status/1430938749840629773
    https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

    ...





    scrape_profile() arguments:



    Argument
    Argument Type
    Description




    twitter_username
    String
    Twitter username of the account


    browser
    String
    Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox


    proxy
    String
    Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.


    tweets_count
    Integer
    Number of posts to scrape. Default is 10.


    output_format
    String
    The output format, whether JSON or CSV. Default is JSON.


    filename
    String
    If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.


    directory
    String
    If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.


    headless
    Boolean
    Whether to run crawler headlessly?. Default is True







    Keys of the output



    Key
    Type
    Description




    tweet_id
    String
    Post Identifier(integer casted inside string)


    username
    String
    Username of the profile


    name
    String
    Name of the profile


    profile_picture
    String
    Profile Picture link


    replies
    Integer
    Number of replies of tweet


    retweets
    Integer
    Number of retweets of tweet


    likes
    Integer
    Number of likes of tweet


    is_retweet
    boolean
    Is the tweet a retweet?


    retweet_link
    String
    If it is retweet, then the retweet link else it'll be empty string


    posted_time
    String
    Time when tweet was posted in ISO 8601 format


    content
    String
    content of tweet as text


    hashtags
    Array
    Hashtags presents in tweet, if they're present in tweet


    mentions
    Array
    Mentions presents in tweet, if they're present in tweet


    images
    Array
    Images links, if they're present in tweet


    videos
    Array
    Videos links, if they're present in tweet


    tweet_url
    String
    URL of the tweet


    link
    String
    If any link is present inside tweet for some external website.






    To Scrap profile's tweets with API:

    ```python
    from twitter_scraper_selenium import scrape_profile_with_api

    scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)
    ```





    scrape_profile_with_api() Arguments:






    Argument
    Argument Type
    Description




    username
    String
    Twitter's Profile username


    tweets_count
    Integer
    Number of tweets to scrape.


    output_filename
    String
    What should be the filename where output is stored?.


    output_dir
    String
    What directory output file should be saved?


    proxy
    String
    Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.


    browser
    String
    Which browser to use for extracting out graphql key. Default is firefox.


    headless
    String
    Whether to run browser in headless mode?




    Output:


    ```js
    {
    "1608939190548598784": {
    "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
    "tweet_details":{
    ...
    },
    "user_details":{
    ...
    }
    }, ...
    }
    ```






    Using scraper with proxy (http proxy)


    Just pass proxy argument to function.

    ```python
    from twitter_scraper_selenium import scrape_profile

    scrape_profile("elonmusk", headless=False, proxy="66.115.38.247:5678", output_format="csv",filename="musk") #In IP:PORT format

    ```




    Proxy that requires authentication:

    ```python

    from twitter_scraper_selenium import scrape_profile

    microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
    proxy="sajid:[email protected]:5678") # username:password@IP:PORT
    print(microsoft_data)

    ```







    Privacy


    This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.








    LICENSE

    MIT