{"id":13611654,"url":"https://github.com/sokomishalov/skraper","last_synced_at":"2025-05-16T06:07:01.042Z","repository":{"id":36980593,"uuid":"234099582","full_name":"sokomishalov/skraper","owner":"sokomishalov","description":"Kotlin/Java library and cli tool for scraping posts and media from various sources with neither authorization nor full page rendering (Facebook, Instagram, Twitter, Youtube, Tiktok, Telegram, Twitch, Reddit, 9GAG, Pinterest, Flickr, Tumblr, Coub, Vimeo, IFunny, VK, Odnoklassniki, Pikabu)","archived":false,"fork":false,"pushed_at":"2025-03-28T03:12:33.000Z","size":1377,"stargazers_count":282,"open_issues_count":9,"forks_count":42,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-04-08T16:06:36.627Z","etag":null,"topics":["9gag","coub","facebook","flickr","ifunny","instagram","odnoklassniki","pikabu","pinterest","reddit","scraper","telegram","tiktok","tumblr","twitch","twitter","vimeo","vk","youtube"],"latest_commit_sha":null,"homepage":"","language":"Kotlin","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sokomishalov.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-15T14:35:23.000Z","updated_at":"2025-03-31T22:53:41.000Z","dependencies_parsed_at":"2024-06-12T09:56:31.807Z","dependency_job_id":"83c366ec-d75f-4912-8e8a-ed0fa8b174c8","html_url":"https://github.com/sokomishalov/skraper","commit_stats":{"total_commits":629,"total_committers":11,"mean_commits":57.18181818181818,"dds":0.65818759936407,"last_synced_commit":"05d318581f513dc7ed7d19eb1b0422ad992a1c9c"},"previous_names":[],"tags_count":54,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sokomishalov%2Fskraper","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sokomishalov%2Fskraper/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sokomishalov%2Fskraper/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sokomishalov%2Fskraper/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sokomishalov","download_url":"https://codeload.github.com/sokomishalov/skraper/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254478190,"owners_count":22077676,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["9gag","coub","facebook","flickr","ifunny","instagram","odnoklassniki","pikabu","pinterest","reddit","scraper","telegram","tiktok","tumblr","twitch","twitter","vimeo","vk","youtube"],"created_at":"2024-08-01T19:01:59.661Z","updated_at":"2025-05-16T06:06:56.004Z","avatar_url":"https://github.com/sokomishalov.png","language":"Kotlin","funding_links":[],"categories":["Kotlin"],"sub_categories":[],"readme":"Skraper\n========\n~~Here should be some fancy logo~~\n\n[![Awesome Kotlin Badge](https://kotlin.link/awesome-kotlin.svg)](https://github.com/KotlinBy/awesome-kotlin)\n[![Apache License 2](https://img.shields.io/badge/license-ASF2-blue.svg)](https://choosealicense.com/licenses/apache-2.0/)\n[![](https://img.shields.io/maven-central/v/ru.sokomishalov.skraper/skrapers)](https://mvnrepository.com/artifact/ru.sokomishalov.skraper/skrapers)\n[![](https://img.shields.io/jitpack/v/github/sokomishalov/skraper)](https://jitpack.io/#sokomishalov/skraper)\n\n# Overview\n\nKotlin/Java library and cli tool which allows scraping and downloading posts, attachments, other meta from more than 10\nsources without any authorization or full page rendering. Based on jsoup, jackson and kotlin-coroutines.\n\nRepository contains:\n\n- [Cli tool](#cli-tool)\n- [Kotlin library](#kotlin-library)\n- [Telegram bot](#telegram-bot)\n\nCurrent list of implemented sources:\n\n- [Facebook](https://facebook.com)\n- [Instagram](https://instagram.com)\n- [Twitter](https://twitter.com)\n- [Youtube](https://youtube.com)\n- [TikTok](https://tiktok.com)\n- [Telegram](https://t.me)\n- [Twitch](https://twitch.tv)\n- [Reddit](https://reddit.com)\n- [9GAG](https://9gag.com)\n- [Pinterest](https://pinterest.com)\n- [Flickr](https://flickr.com)\n- [Tumblr](https://tumblr.com)\n- [Vimeo](https://vimeo.com)\n- [IFunny](https://ifunny.co)\n- [Coub](https://coub.com)\n- [VK](https://vk.com)\n- [Odnoklassniki](https://ok.ru)\n- [Pikabu](https://pikabu.ru)\n\n# Bugs\n\nUnfortunately, each web-site is subject to change without any notice, so the tool may work incorrectly because of that.\nIf that happens, please let me know via an issue.\n\n# Cli tool\n\nCli tool allows to:\n\n- download media with flag `--media-only` from almost all presented sources.\n- scrape posts meta information\n\nRequirements:\n\n- Java: 1.8 +\n- Maven (optional)\n\nBuild tool\n\n```bash\n./mvnw clean package -DskipTests=true \n```\n\nUsage:\n\n```bash\n./skraper --help\n```\n\n```text\nusage: [-h] PROVIDER PATH [-n LIMIT] [-t TYPE] [-o OUTPUT] [-m]\n       [--parallel-downloads PARALLEL_DOWNLOADS]\n\noptional arguments:\n  -h, --help                                show this help message and exit\n\n  -n LIMIT, --limit LIMIT                   posts limit (50 by default)\n\n  -t TYPE, --type TYPE                      output type, options: [log, csv, json, xml, yaml]\n\n  -o OUTPUT, --output OUTPUT                output path\n\n  -m, --media-only                          scrape media only\n\n  --parallel-downloads PARALLEL_DOWNLOADS   amount of parallel downloads for media items if\n                                            enabled flag --media-only (4 by default)\n\n\npositional arguments:\n  PROVIDER                                  skraper provider, options: facebook, instagram,\n                                            twitter, youtube, tiktok, telegram, twitch, reddit,\n                                            9gag, pinterest, flickr, tumblr, ifunny, vk, pikabu,\n                                            vimeo, odnoklassniki, coub\n\n  PATH                                      path to user/community/channel/topic/trend\n```\n\nExamples:\n\n```bash\n./skraper 9gag /hot \n./skraper reddit /r/memes -n 5 -t csv -o ./reddit/posts\n./skraper instagram /explore/tags/memes -t json\n./skraper flickr /photos/harrythehawk -t yaml\n./skraper pinterest /levato/meme -t xml\n./skraper youtube /user/JetBrainsTV/videos --media-only -n 2\n```\n\n# Kotlin Library\n\n## Distribution\n\nMaven:\n\n```xml\n\u003cdependency\u003e\n    \u003cgroupId\u003eru.sokomishalov.skraper\u003c/groupId\u003e\n    \u003cartifactId\u003eskrapers\u003c/artifactId\u003e\n    \u003cversion\u003ex.y.z\u003c/version\u003e\n\u003c/dependency\u003e\n```\n\nGradle kotlin dsl:\n\n```kotlin\nimplementation(\"ru.sokomishalov.skraper:skrapers:x.y.z\")\n```\n\n## Usage\n\n### Instantiate specific scraper\n\nAs mentioned before, the provider implementation list is:\n\n- [FacebookSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/facebook/FacebookSkraper.kt)\n- [InstagramSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/instagram/InstagramSkraper.kt)\n- [TwitterSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/twitter/TwitterSkraper.kt)\n- [YoutubeSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/youtube/YoutubeSkraper.kt)\n- [TikTokSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/tiktok/TikTokSkraper.kt)\n- [TelegramSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/telegram/TelegramSkraper.kt)\n- [TwitchSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/twitch/TwitchSkraper.kt)\n- [RedditSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/reddit/RedditSkraper.kt)\n- [NinegagSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/ninegag/NinegagSkraper.kt)\n- [PinterestSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/pinterest/PinterestSkraper.kt)\n- [FlickrSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/flickr/FlickrSkraper.kt)\n- [TumblrSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/tumblr/TumblrSkraper.kt)\n- [VimeoSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/vimeo/VimeoSkraper.kt)\n- [IFunnySkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/ifunny/IFunnySkraper.kt)\n- [CoubSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/coub/CoubSkraper.kt)\n- [VkSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/vk/VkSkraper.kt)\n- [OdnoklassnikiSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/odnoklassniki/OdnoklassnikiSkraper.kt)\n- [PikabuSkraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/provider/pikabu/PikabuSkraper.kt)\n\nAfter that usage as simple as is:\n\n```kotlin\nval skraper = InstagramSkraper(client = OkHttpSkraperClient())\n```\n\n**Important moment:** it is highly recommended to not\nuse [DefaultBlockingSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/jdk/DefaultBlockingSkraperClient.kt)\n. There are some more efficient, non-blocking and resource-friendly implementations\nfor [SkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/SkraperClient.kt). To use them you just have to put\nrequired dependencies in the classpath.\n\nCurrent http-client implementation list:\n\n- [DefaultBlockingClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/jdk/DefaultBlockingSkraperClient.kt):\n  simple java.net.* blocking api implementation\n- [OkHttpSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/okhttp/OkHttpSkraperClient.kt): [okhttp3](https://mvnrepository.com/artifact/com.squareup.okhttp3/okhttp)\n  implementation\n- [SpringReactiveSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/spring/SpringReactiveSkraperClient.kt): [spring-webflux client](https://mvnrepository.com/artifact/org.springframework/spring-webflux)\n  implementation\n- [KtorSkraperClient](skrapers/src/main/kotlin/ru/sokomishalov/skraper/client/ktor/KtorSkraperClient.kt): [ktor-client-jvm](https://mvnrepository.com/artifact/io.ktor/ktor-client-core-jvm)\n  implementation\n\n### Available methods\n\nEach scraper is a class which implements [Skraper](skrapers/src/main/kotlin/ru/sokomishalov/skraper/Skraper.kt)\ninterface:\n\n```kotlin\ninterface Skraper {\n    val client: SkraperClient\n    fun getPosts(path: String): Flow\u003cPost\u003e\n    suspend fun getPageInfo(path: String): PageInfo?\n    fun supports(media: Media): Boolean\n    suspend fun resolve(media: Media): Media\n}\n```\n\nAlso, there are some provider-specific kotlin extensions for implementations. You can find them out at the provider\nimplementation package.\n\n### Usage from plain Java\nThere is an out-of-box java interop utility class `ru.sokomishalov.skraper.util.JavaInterop`:\n```java\nclass Example {\n    public static void main(String[] args) {\n      Skraper skraper = new InstagramSkraper();\n      List\u003cPost\u003e posts = JavaInterop.limitedFlow(skraper.getPosts(\"/memes.video\"), 10);\n      PageInfo info = JavaInterop.callBlocking(cont -\u003e skraper.getPageInfo(\"/memes.video\", cont));\n    }\n}\n```\n\n### Scrape user/community/channel/topic/trend posts\n\nTo scrape the latest posts for specific user, channel or trend use skraper like that:\n\n```kotlin\nsuspend fun main() {\n    val skraper = FacebookSkraper()\n    val posts = skraper.getUserPosts(username = \"memes\").take(2).toList() // extension for getPosts()\n    // or \n    val postsDetected = Skrapers.getPosts(url = \"https://facebook.com/memes\") // aggregating singleton\n    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(posts))\n}\n```\n\nReceived data structure is similar to each other provider's. Output data example:\n\n```json5\n[\n  {\n    \"id\": \"5029851093699104\",\n    \"text\": \"gotta love em!\",\n    \"publishedAt\": 1580744400000,\n    \"statistics\": {\n      \"likes\": 79,\n      \"comments\": 3\n    },\n    \"media\": [\n      {\n        \"url\": \"https://facebook.com/memes/posts/5029851093699104?__xts__%5B0%5D=68.ARA2yRI2YnlXQRKX7Pdphh8ztgvnP11aYE_bZFPNmqLpJZLhwJaG24gDPUTiKDLv-J_E09u2vLjCXalpmEuGSmVR0BkVtcng_i6QV8x5e-aZUv0Mkn1wwKLlhp5NNH6zQWKlqDqRjZrwvcKeUi0unzzulRCHRvDIrbz2leM6PLescFySwMYbMmKFc7ctqaC_F7nJ09Ya0lz9Pqaq_Rh6UsNKom6fqdgHAuoHV894a3QRuyY0BC6fQuXZLOLbRIfEVK3cF9Z5UQiXUYruCySF-WpQEV0k72x6DIjT6B3iovYFnBGHaji9VAx2PByZ-MDs33D1Hz96Mk-O1Pj7zBwO6FvXGhkUJgepiwUOVd0q-pV83rS5EhjtPFDylNoNO2xkDUSIi483p49vumVPWtmab8LX1V6w2anf55kh6pedCXcH3D8rBjz8DaTBnv995u9kk5im-1-HdAGQHyKrCZpaA0QyC-I4oGsCoIJGck3RO8u_SoHcfe2tKjTgPe6j9p1D\u0026__tn__=-R\",\n        \"aspectRatio\": 0.864,\n        \"duration\": 10860.000000000\n      }\n    ]\n  },\n  {\n    \"id\": \"4990218157662398\",\n    \"text\": \"Interesting\",\n    \"publishedAt\": 1580742000000,\n    \"statistics\": {\n      \"likes\": 3092,\n      \"comments\": 514\n    },\n    \"media\": [\n      {\n        \"url\": \"https://scontent.fhrk1-1.fna.fbcdn.net/v/t1.0-0/p526x296/52333452_10157743612509879_529328953723191296_n.png?_nc_cat=1\u0026_nc_ohc=oNMb8_mCbD8AX-w9zeY\u0026_nc_ht=scontent.fhrk1-1.fna\u0026oh=ca8a719518ecfb1a24f871282b860124\u0026oe=5E910D0C\",\n        \"aspectRatio\": 0.8960573476702509\n      }\n    ]\n  }\n]\n```\n\nYou can see the full model structure for posts and others [here](skrapers/src/main/kotlin/ru/sokomishalov/skraper/model)\n\n### Scrape user/community/channel/topic/trend info\n\nIt is possible to scrape user/channel/trend info for some purposes:\n\n```kotlin\nsuspend fun main() {\n    val skraper = TwitterSkraper()\n    val pageInfo = skraper.getUserInfo(username = \"memes\") // extension for `getPageInfo()`\n    // or \n    val pageInfoDetected = Skrapers.getPageInfo(url = \"https://twitter.com/memes\") // aggregating singleton\n    println(JsonMapper().writerWithDefaultPrettyPrinter().writeValueAsString(pageInfo))\n}\n```\n\nOutput:\n\n```json5\n{\n  \"nick\": \"memes\",\n  \"name\": \"Memes.com\",\n  \"description\": \"http://memes.com is your number one website for the funniest content on the web. You will find funny pictures, funny memes and much more.\",\n  \"statistics\": {\n    \"posts\": 10848,\n    \"followers\": 154718\n  },\n  \"avatar\": {\n    \"url\": \"https://pbs.twimg.com/profile_images/824808708332941313/mJ4xM6PH_normal.jpg\"\n  },\n  \"cover\": {\n    \"url\": \"https://abs.twimg.com/images/themes/theme1/bg.png\"\n  }\n}\n```\n\n### Resolve provider relative url\n\nSometimes you need to know direct media link:\n\n```kotlin\nsuspend fun main() {\n    val skraper = InstagramSkraper()\n    val info = skraper.resolve(Video(url = \"https://www.instagram.com/p/B-flad2F5o7/\"))\n    val serializer = JsonMapper().writerWithDefaultPrettyPrinter()\n    println(serializer.writeValueAsString(info))\n}\n```\n\nOutput:\n\n```json5\n{\n  \"url\": \"https://scontent-amt2-1.cdninstagram.com/v/t50.2886-16/91508191_213297693225472_2759719910220905597_n.mp4?_nc_ht=scontent-amt2-1.cdninstagram.com\u0026_nc_cat=104\u0026_nc_ohc=27bC52qar_oAX-7J2Zh\u0026oe=5EC0BC52\u0026oh=0aafee2860c540452b76e7b8e336147d\",\n  \"aspectRatio\": 0.8010012515644556,\n  \"thumbnail\": {\n    \"url\": \"https://scontent-amt2-1.cdninstagram.com/v/t51.2885-15/e35/91435498_533808773845524_5302421141680378393_n.jpg?_nc_ht=scontent-amt2-1.cdninstagram.com\u0026_nc_cat=100\u0026_nc_ohc=8gPAcByc6YAAX_kDBWm\u0026oh=5edf6b9d90d606f9c0e055b7dbcbfa45\u0026oe=5EC0DDE8\",\n    \"aspectRatio\": 0.8010012515644556\n  }\n}\n```\n\n### Download media\n\nThere is \"static\" method which allows to download any media from all known implemented sources:\n\n```kotlin\nsuspend fun main() {\n    val tmpDir = Files.createTempDirectory(\"skraper\").toFile()\n\n    val testVideo = Skrapers.download(\n        media = Video(\"https://youtu.be/fjUO7xaUHJQ\"),\n        destDir = tmpDir,\n        filename = \"Gandalf\"\n    )\n\n    val testImage = Skrapers.download(\n        media = Image(\"https://www.pinterest.ru/pin/89509111320495523/\"),\n        destDir = tmpDir,\n        filename = \"Do_no_harm\"\n    )\n\n    println(testVideo)\n    println(testImage)\n}\n```\n\nOutput:\n\n```log\n/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Gandalf.mp4\n/var/folders/sf/hm2h5chx5fl4f70bj77xccsc0000gp/T/skraper8377953374796527777/Do_no_harm.jpg\n```\n\n# Telegram bot\n\nTo use the bot follow the [link](https://t.me/SkraperBot).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsokomishalov%2Fskraper","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsokomishalov%2Fskraper","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsokomishalov%2Fskraper/lists"}