{"id":15409916,"url":"https://github.com/a-gubskiy/x.web.metaextractor","last_synced_at":"2025-06-14T02:06:42.178Z","repository":{"id":58859008,"uuid":"98110242","full_name":"a-gubskiy/X.Web.MetaExtractor","owner":"a-gubskiy","description":"Powerful library that allows you to extract meta information from any web page URL.","archived":false,"fork":false,"pushed_at":"2025-06-09T12:38:23.000Z","size":448,"stargazers_count":8,"open_issues_count":0,"forks_count":5,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-06-14T02:06:40.796Z","etag":null,"topics":["dncuug","extract-meta-information","metadata","metadata-extraction","net-core","open-graph","web"],"latest_commit_sha":null,"homepage":"https://nuget.org/packages/X.Web.MetaExtractor","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/a-gubskiy.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["a-gubskiy"],"buy_me_a_coffee":"g.andrew","custom":["http://andrew.gubskiy.com/donate"]}},"created_at":"2017-07-23T16:18:32.000Z","updated_at":"2025-06-09T12:38:20.000Z","dependencies_parsed_at":"2024-08-20T14:40:13.317Z","dependency_job_id":"68068b3f-ed67-4005-9862-de84e248335b","html_url":"https://github.com/a-gubskiy/X.Web.MetaExtractor","commit_stats":{"total_commits":194,"total_committers":6,"mean_commits":"32.333333333333336","dds":0.6288659793814433,"last_synced_commit":"888e1c8c85498319075acbf0436f01dd162fa3f6"},"previous_names":["a-gubskiy/x.web.metaextractor"],"tags_count":12,"template":false,"template_full_name":null,"purl":"pkg:github/a-gubskiy/X.Web.MetaExtractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a-gubskiy%2FX.Web.MetaExtractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a-gubskiy%2FX.Web.MetaExtractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a-gubskiy%2FX.Web.MetaExtractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a-gubskiy%2FX.Web.MetaExtractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/a-gubskiy","download_url":"https://codeload.github.com/a-gubskiy/X.Web.MetaExtractor/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/a-gubskiy%2FX.Web.MetaExtractor/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":259747232,"owners_count":22905313,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dncuug","extract-meta-information","metadata","metadata-extraction","net-core","open-graph","web"],"created_at":"2024-10-01T16:41:55.252Z","updated_at":"2025-06-14T02:06:42.152Z","avatar_url":"https://github.com/a-gubskiy.png","language":"C#","funding_links":["https://github.com/sponsors/a-gubskiy","https://buymeacoffee.com/g.andrew","http://andrew.gubskiy.com/donate"],"categories":[],"sub_categories":[],"readme":"# X.Web.MetaExtractor\n[![NuGet version](https://badge.fury.io/nu/X.Web.MetaExtractor.svg)](https://badge.fury.io/nu/X.Web.MetaExtractor)\n[![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/andrew_gubskiy.svg?style=social\u0026label=Follow%20me!)](https://twitter.com/intent/user?screen_name=andrew_gubskiy)\n\n**X.Web.MetaExtractor** is a powerful library that allows you to extract meta information from any web page URL. It provides a variety of content loaders to handle HTTP requests using different libraries.\n\n## Breaking Changes\n\n- **Metadata class was changed**: The `Content` field has been removed from the `Metadata` class. Ensure to update your code to reflect this change if you were using the `Content` field.\n- **Description Extraction Logic**: The `Extractor` class now only extracts the description from meta tags, without attempting to parse the content of the page.\n- **New WebPage Model**: The library now returns a `WebPage` model with comprehensive information including links found on the page.\n- **Link Extraction**: Added support for extracting and processing all hyperlinks from web pages.\n\n## Features\n\n- Extract meta information from any web page URL.\n- Extract and process hyperlinks from web pages.\n- Support for multiple HTTP libraries:\n  - Flurl\n  - FsHttp\n  - RestSharp\n- Detect the language of the page content.\n\n## Installation\n\nTo install the library, use the following command:\n\n```bash\ndotnet add package X.Web.MetaExtractor\n```\n\n## Usage\n\nHere is a basic example of how to use the `X.Web.MetaExtractor` library:\n\n```csharp\nusing X.Web.MetaExtractor;\nusing X.Web.MetaExtractor.ContentLoaders;\nusing X.Web.MetaExtractor.LanguageDetectors;\n\n// Create instances of the necessary components\nIContentLoader contentLoader = new FlurlContentLoader();\nILanguageDetector languageDetector = new LanguageDetector();\nstring defaultImage = \"https://example.com/example.jpg\";\n\n// Create an instance of the Extractor\nIExtractor extractor = new Extractor(defaultImage, contentLoader, languageDetector);\n\n// Extract information from a URL\nvar webPage = await extractor.Extract(new Uri(\"https://example.com\"), CancellationToken.None);\n\n// Display the extracted information\nConsole.WriteLine($\"Title: {webPage.Title}\");\nConsole.WriteLine($\"Description: {webPage.Description}\");\nConsole.WriteLine($\"Keywords: {webPage.Keywords}\");\nConsole.WriteLine($\"Language: {webPage.Language}\");\n\n// Process links\nif (webPage.Links != null)\n{\n    Console.WriteLine($\"Found {webPage.Links.Count} links:\");\n    foreach (var link in webPage.Links)\n    {\n        Console.WriteLine($\"- {link.Title}: {link.Value}\");\n    }\n}\n```\n\n## Interfaces and Classes\n\n### IExtractor\n\n`IExtractor` defines the interface for extracting web page information, returning a comprehensive `WebPage` model.\n\n### ILanguageDetector\n\n`ILanguageDetector` defines the interface for detecting the language of the page content.\n\n### IContentLoader\n\n`IContentLoader` defines the interface for loading the content of a web page asynchronously.\n\n### WebPage\n\n`WebPage` is the main model containing extracted information from a web page, including metadata, links, and source information.\n\n### Link\n\n`Link` is a record that represents a hyperlink extracted from HTML content with Title and Value properties.\n\n### Source\n\n`Source` is a record that contains information about the origin of web content, including the original URL and raw page content.\n\n## Extractors\n\nThe library architecture supports multiple specialized extractors that work together to build a complete representation of a web page:\n\n* **MetaDocumentExtractor** - Extracts metadata from HTML \u003cmeta\u003e tags\n* **OpenGraphDocumentExtractor** - Extracts Open Graph protocol metadata\n* **TitleDocumentExtractor** - Extracts the page title\n* **ImageDocumentExtractor** - Extracts image URLs from the document\n* **LinksDocumentExtractor** – Extracts all hyperlinks from HTML documents, converting them to strongly-typed `Link` objects.\n\n## Content Loaders\n\n### Flurl\n\n`X.Web.MetaExtractor.ContentLoaders.Flurl` provides a content loader using the Flurl HTTP library.\n\n### FsHttp\n\n`X.Web.MetaExtractor.ContentLoaders.FsHttp` leverages the FsHttp library to load content.\n\n### HttpClient\n\n`X.Web.MetaExtractor.ContentLoaders.HttpClient` utilizes the HttpClient class to load content.\n\n### RestSharp\n\n`X.Web.MetaExtractor.ContentLoaders.RestSharp` uses the RestSharp library for content loading.\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](https://github.com/a-gubskiy/X.Web.MetaExtractor/blob/master/LICENSE) file for more details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa-gubskiy%2Fx.web.metaextractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fa-gubskiy%2Fx.web.metaextractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fa-gubskiy%2Fx.web.metaextractor/lists"}