{"id":15308592,"url":"https://github.com/rexshijaku/scrapepilot","last_synced_at":"2026-04-15T09:31:29.512Z","repository":{"id":225773193,"uuid":"763168485","full_name":"rexshijaku/ScrapePilot","owner":"rexshijaku","description":"This library helps to scrape web content in a guided and user-friendly way. It aims to simplify repetitive and complex scraping tasks by combining and abstracting the functionality offered by HTMLAgilityPack and Selenium with the extended C# I/O tasks.","archived":false,"fork":false,"pushed_at":"2024-03-18T20:10:56.000Z","size":8567,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-06-03T21:31:12.593Z","etag":null,"topics":["dotnet","htmlagilitypack","scraper","scraping","selenium","userfriendly"],"latest_commit_sha":null,"homepage":"","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/rexshijaku.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-25T18:23:26.000Z","updated_at":"2024-12-30T14:45:05.000Z","dependencies_parsed_at":"2025-02-01T18:42:59.545Z","dependency_job_id":"544a0f8e-00a4-439c-9910-332f41f98fee","html_url":"https://github.com/rexshijaku/ScrapePilot","commit_stats":null,"previous_names":["rexshijaku/scrapepilot"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/rexshijaku/ScrapePilot","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FScrapePilot","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FScrapePilot/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FScrapePilot/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FScrapePilot/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/rexshijaku","download_url":"https://codeload.github.com/rexshijaku/ScrapePilot/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/rexshijaku%2FScrapePilot/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31834504,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-15T07:17:56.427Z","status":"ssl_error","status_checked_at":"2026-04-15T07:17:30.007Z","response_time":63,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dotnet","htmlagilitypack","scraper","scraping","selenium","userfriendly"],"created_at":"2024-10-01T08:17:09.508Z","updated_at":"2026-04-15T09:31:29.493Z","avatar_url":"https://github.com/rexshijaku.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ScrapePilot\n\nThis library helps to scrape web content in a guided and user-friendly way. It aims to simplify repetitive and complex scraping tasks by combining and abstracting the functionality offered by HTMLAgilityPack and Selenium with the extended C# I/O tasks.\n\n## How it works\nScrape Pilot uses a JSON file as an input, which contains a list of recipes that guides what Pilot should do, and each recipe has its instructions. An Instruction represents a single unit of task and is made of Arguments. You add multiple Recipes in case you need a \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/RecipeDriverType.cs'\u003eDriver\u003c/a\u003e change. The recipe examples can be found in the \u003ca href='https://github.com/rexshijaku/ScrapePilot/tree/master/ScrapePilot/Examples'\u003eExamples\u003c/a\u003e folder. Besides, this repo also provides a \u003ca href='http://152.70.176.144/'\u003etool\u003c/a\u003e (\u003ca href='https://github.com/rexshijaku/ScrapePilot/tree/master/ScrapePilot.Client'\u003ethe ScrapePilot.Client Project\u003c/a\u003e) which alleviates the recipe creation process.\n\n\u003ci\u003eThis documentation serves as a broad introduction to Recipes, offering guidance on their creation and outlining their key components. For a more in-depth understanding and to explore specific functionalities, we suggest delving into the code. \u003c/i\u003e\n\n### Usage\n\n##### Install by a manual download: \nDownload the repository and add ScrapePilot as a Project Reference to your project.\n\n##### NuGet\nYou can also install it from NuGet by running the following command:\n```html\ndotnet add package ScrapePilot\n```\n\n##### Simple example\n```csharp\n var app = new ScrapePilot.App(); // Create an instance of the ScrapePilot\n \n string recipeJSON = \"\u003cjson string here\u003e\"; // Set the Recipe JSON here\n \n List\u003cList\u003cstring\u003e\u003e table; // The Output is expected to be a list of lists (matrix | table)\n \n ProcessResponse result = app.ProcessRecipe(recipeJSON); // Process the Recipe \n \n // Handle the output\n if (result.Type == ScrapePilot.Constants.OutputType.TABLE_DATA_JSON)\n {\n     // All as expected now deserialize the result\n     table = JsonSerializer.Deserialize\u003cList\u003cList\u003cstring\u003e\u003e\u003e(result.Value);\n }\n else\n {\n     // Something Went Wrong!\n }\n```\n### appsettings.json\nThe App Constructor of the Scrape Pilot may take an IConfigurationSection as a parameter. This Configuration is optional. Please refer to the \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/AppConfiguration.cs'\u003e AppConfiguration.cs \u003c/a\u003e to see what properties you may use in your appsettings.json for the Scrape Pilot.\n\n### The Main Recipe Structure\n\nThe Main Recipe is the recipe that contains subrecipes. It has a global store and variables. \n\n#### The Store\nThe store is an in-memory storage that contains the results of the storable instructions. In technical terms, it is an implementation of a key-value C# Dictionary\u003cString, String\u003e, where the key corresponds to the 'name' property of the 'store' object. Once a value is stored, it can be used in subsequent steps throughout the script. However, certain instructions, such as basic tab changes are void and may not necessitate storing their results.\n\n#### A Variable\nAn element that can be declared and used within the main JSON file is indicated by a preceding hash (#) string. Once a variable is introduced in a step, it can be used in subsequent steps. See below how '#a_variable_example' is used as part of an output string, it was introduced earlier when used to store an instruction result. Note that if a variable is used without the hash (#), it won't be stored in The Store and therefore won't be available when referred to.\n\n#### The General Structure\nThe Main Recipe contains Multiple Recipe Items and gives a single Output at the end of the whole process. An example of the Main Recipe Structure is shown as follows:\n```js\n// An Example of the Main Recipe\n{\n  \"recipes\": [\n    { // #Recipe1 \n      \"use\": {\n       ... // #Recipe1 Options \n      },\n      \"instructions\": [\n       ... //  #Recipe1 Instructions\n       // Instruction 1.\n       {\n          // Instruction 1. Type\n          // Instruction 1. Arguments\n          \"store\": {\n            \"name\": \"#a_variable_example\"\n          }\n        }\n      ]\n    }\n    ... // The following Recipe (Should be Added Only If Driver Change is Needed)\n  ],\n  \"output\": {\n      \"type\": \"\", // An Output type from Constants.OutputType\n      \"value\":  [ \"#a_variable_example\" ] // The Specified Output\n      // Note that #a_variable_example is Created/Stored during the execution of Instructions\n  }\n}\n```\n\nYou can override appsettings.json properties inside the recipe file as below:\n\n```js\n{\n   \"recipes\": [],\n   \"configs\": { // Override appsettings.json Configuration\n      \"OutputPath\": \"some_path\",\n      \"Verbose\": false\n   },\n   \"output\": {}\n}\n```\n\n#### A Single Recipe Structure\nThe Main Components of a single Recipe are:\n- [Use] - the Options Specified for a Recipe\n- [Instructions] - a List of Instructions or Steps that Recipe should follow\n\n#### [Use]\nThis component contains the options where the Driver and its Configuration are specified. The possible Driver Types are located in \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/RecipeDriverType.cs'\u003eConstants.RecipeDriverType.cs\u003c/a\u003e, and configuration properties for each Driver type are separately specified in \u003ca href='https://github.com/rexshijaku/ScrapePilot/tree/master/ScrapePilot/Models/Configs'\u003e Models.Configs \u003c/a\u003e.\n\n```js\n\"use\": {\n   \"driver\": \"HTMLap\", // The Driver Type Used in the Recipe is HtmlAguilitiPack\n   \"configs\": {  // The Driver Configuration\n    \"EncodingName\": \"utf-8\"\n   }\n }\n```\n\n#### [Instructions]\nA List of Instructions or Steps that each Recipe should follow.\n```js\n// An Example of a single Instruction, \n// Its result is stored in a variable called 'extracted_file_url' \n{\n  \"type\": \"extract_attr\",\n  \"arguments\": {\n    \"From\": \"//a[text() = 'The element which downloads a file.']\",\n    \"Attr\": \"href\",\n    \"Constraints\": [ \"must_be_csv_file\" ]\n  },\n  \"store\": {\n    \"name\": \"#extracted_file_url\"\n  }\n}\n```\n\n#### [Instruction] Type\nThere are different \u003ca href='https://github.com/rexshijaku/ScrapePilot/tree/master/ScrapePilot/Constants/InstructionType'\u003eInstruction Types\u003c/a\u003e for each Driver.\n\n#### [Instruction] Arguments\nEach Instruction Type has different arguments, in the code you can see the Instruction Properties (which are the arguments) inside the Type of Instruction that you want to use. The previous example used the arguments for the ExtractAttr Instruction which resides in the Models.Instruction.Selenium namespace.\n\nName | Description \n--- | --- \nFrom | The Element which contains the Attribute\nAttr | The Attribute from which the value is extracted\nConstraints | The Conditions that the resulting value of the Attribute must respect\n\n#### [Instruction] Value\nSome Instruction Types may have Arguments that use predefined constant values, which means that they can't accept variables or hand-written values, these instruction types can be found in Constants.InstructionValue.{DriverName} folders. An example of this can be the \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Models/Instruction/Selenium/SwitchTab.cs'\u003eSwitch Tab Instruction\u003c/a\u003e in Selenium which only navigates to specified browser tabs such as \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/InstructionValue/Selenium/SwitchTabVal.cs'\u003efirst and last\u003c/a\u003e.\n\n### Functions\nSimilar to the variables, Functions can also be employed in constructing the result of an Instruction, the value of certain Arguments, or the Main Output value. The value of the Function can be utilized in the mentioned cases, and when a specific case is processed by the parser it will get what the function outputs. \n\n#### Independent Functions\nThese types of functions are not dependent on any other argument or variable inside the instruction or script. Examples of such functions are The Current DateTime or the Blank Space Generator. The full list is available in \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/IndependentFunctions.cs'\u003eConstants.IndependentFunctions\u003c/a\u003e.\n\n#### Dependent Functions\nThese types of functions are dependent on the value of some variable or other argument of the same Instruction. The full list of possible Dependent Functions is available in \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/DependentFunctions.cs'\u003eConstants.DependentFunctions\u003c/a\u003e namespace. \n\nThe implementation of the Functions is located \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Models/Functions.cs'\u003ehere\u003c/a\u003e. And an example of the use of the Functions is presented in the following example:\n```js\n // An Example where Independent and Dependent functions are used\n {\n   \"type\": \"download_a_file\",\n   \"arguments\": {\n     \"From\": \"#extracted_file_url\",\n     \"To\": [ \"Test\", \"fn_space\", \"fn_dateTime-yyyyMMdd\", \"fn_fileName-fromFullName\" ]\n   }\n }\n \n // Here fn_space and fn_dateTime-yyyyMMdd are Independent, they will output a static value without depending on some other parameter\n // However, fn_fileName-fromFullName will give the file name stored in the \"From\" property\n\n // let assume that fn_space outputs ' ' and fn_dateTime-yyyyMMdd = '20231007' \n // #extracted_file_url = 'some_folder_path/file_name.csv'\n // and fn_fileName-fromFullName = 'file_name.csv'\n // The To Property would be 'Test 20231007file_name.csv'\n```\n\n\n#### The Output\nIt contains a type that indicates the expected result type of the recipe. See the possible Output Types \u003ca href='https://github.com/rexshijaku/ScrapePilot/blob/master/ScrapePilot/Constants/OutputType.cs'\u003ehere.\u003c/a\u003e The Output has also the value part, which contains a list of fragments that are concatenated to create the final result. This list can include handwritten values (raw), defined variables, or independent functions as well.\n\n```js\n{\n  \"recipes\": [\n    ...\n  ],\n  \"output\": { \n    \"type\": \"table_data_json\",\n    \"value\": [ \"#a_variable_name\" ]\n  } \n}\n```\n\n### Support\nFor general questions about ScrapePilot, tweet at @rexshijaku or write me an email at rexhepshijaku@gmail.com.\n\n### Author\n##### Rexhep Shijaku\n - Email : rexhepshijaku@gmail.com\n - Twitter : https://twitter.com/rexshijaku\n\n### Acknowledgments\nSpecial thanks to \u003ca href='https://www.tm-tracking.org/'\u003eTrygg Mat Tracking (TMT)\u003c/a\u003e for supporting this project.\n\n### Contributing\nWe welcome contributions from everyone! Here are a few ways you can help improve this project:\n\nReporting Bugs: If you encounter any bugs or unexpected behavior, please open an issue on GitHub. Be sure to include as much detail as possible, including steps to reproduce the issue.\n\nSuggesting Enhancements: Have an idea for a new feature or improvement? Feel free to open an issue to discuss it, or even better, submit a pull request with your proposed changes.\n\nSubmitting Pull Requests: Found a bug and know how to fix it? Want to add a new feature? Pull requests are welcome! Please follow the guidelines below:\n\nFork the repository and create your branch from the main.\nMake your changes, ensuring they follow our coding conventions and style guide.\nWrite tests for any new functionality and ensure all tests pass.\nUpdate the documentation to reflect your changes if necessary.\nOpen a pull request, describing the changes you've made.\n\nDocumentation: Improving the documentation is always appreciated. If you notice areas where the documentation could be clearer or more comprehensive, please let us know or submit a pull request with your proposed changes.\n\nSpread the Word: If you find this project useful, consider sharing it with others who might benefit from it. You can also star the repository on GitHub to show your support.\n\nBy contributing to this project, you agree to abide by the Code of Conduct. Thank you for helping to make this project better!\n\n### License\nMIT License\n\nCopyright (c) 2024 | Rexhep Shijaku \u0026 Trygg Mat Tracking (TMT) \n\nPermission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frexshijaku%2Fscrapepilot","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frexshijaku%2Fscrapepilot","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frexshijaku%2Fscrapepilot/lists"}