Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/danny1113/html-parser-builder
A result builder that build HTML parser and transform HTML elements to strongly-typed result, inspired by RegexBuilder.
https://github.com/danny1113/html-parser-builder
dsl html-parser swift
Last synced: 3 months ago
JSON representation
A result builder that build HTML parser and transform HTML elements to strongly-typed result, inspired by RegexBuilder.
- Host: GitHub
- URL: https://github.com/danny1113/html-parser-builder
- Owner: danny1113
- License: mit
- Created: 2022-07-14T09:14:12.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-06-18T10:50:05.000Z (7 months ago)
- Last Synced: 2024-10-01T09:47:35.077Z (4 months ago)
- Topics: dsl, html-parser, swift
- Language: Swift
- Homepage:
- Size: 47.9 KB
- Stars: 9
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-result-builders - HTMLParserBuilder - Build your HTML parser with declarative syntax and strongly-typed result. (Parsing)
README
# HTMLParserBuilder
A result builder that build HTML parser and transform HTML elements to strongly-typed result, inspired by RegexBuilder.
> **Note**: `CaptureTransform.swift`, `TypeConstruction.swift` are copied from [apple/swift-experimental-string-processing](https://github.com/apple/swift-experimental-string-processing/).
- [HTMLParserBuilder](#htmlparserbuilder)
- [Installation](#installation)
- [Requirement](#requirement)
- [Introduction](#introduction)
- [Usage](#api-detail-usage)
- [Bring your own parser](#bringyourownparser)
- [Parsing](#parsing)
- [HTML](#html)
- [Capture](#capture)
- [TryCapture](#trycapture)
- [CaptureAll](#captureall)
- [Local](#local)
- [LateInit](#lateinit)
- [Wrap Up](#wrap-up)
- [Advanced use case](#advanced-use-case)## Installation
### Requirement
- Swift 5.9
- macOS 10.15
- iOS 13.0
- tvOS 13.0
- watchOS 6.0```swift
dependencies: [
// ...
.package(name: "HTMLParserBuilder", url: "https://github.com/danny1113/html-parser-builder.git", from: "2.0.0"),
]
```## Introduction
Parsing HTML can be complicated, for example you want to parse the simple html below:
```html
hello, world
INSIDE GROUP h1
INSIDE GROUP h2
```Existing HTML parsing library have these downside:
- Name every captured element
- It can be more complex as the element you want to capture become more and more
- Error handling can be hard```swift
let htmlString = "..."
let doc: any Document = HTMLDocument(string: htmlString)
let first = doc.querySelector("#hello")?.textContentlet group = doc.querySelector("#group")
let second = group?.querySelector("h1")?.textContent
let third = group?.querySelector("h2")?.textContentif let first = first,
let second = second,
let third = third {
// ...
} else {
// ...
}
```HTMLParserBuilder comes with some really great advantages:
- Strongly-typed capture result
- Structrued syntax
- Composible API
- Support for async await
- Error handling built inYou can construct your parser which reflect your original HTML structure:
```swift
let capture = HTML {
TryCapture("#hello") { (element: any Element?) -> String? in
return element?.textContent
} // => HTML
Local("#group") {
Capture("h1", transform: \.textContent) // => HTML
Capture("h2", transform: \.textContent) // => HTML
} // => HTML<(String, String)>
} // => HTML<(String?, (String, String))>let htmlString = "..."
let doc: any Document = HTMLDocument(string: htmlString)let output = try doc.parse(capture)
// => (String?, (String, String))
// output: (Optional("hello, world"), ("INSIDE GROUP h1", "INSIDE GROUP h2"))
```> **Note**: You can now compose up to 10 components inside the builder, but you can group your captures inside [`Local`](#local) as a workaround.
## Usage
### Bring your own parser
HTMLParserBuilder doesn't rely on any html parser, so you can chose any html parser you want to use, as long as it conforms to the `Document` and `Element` protocol.
For example, you can use SwiftSoup as the html parser, example for conformance to the `Document` and `Element` protocol is available in `Tests/HTMLParserBuilderTests/SwiftSoup+HTMLParserBuilder.swift`.
```swift
dependencies: [
// ...
.package(url: "https://github.com/scinfu/SwiftSoup.git", from: "2.6.0"),
.package(name: "HTMLParserBuilder", url: "https://github.com/danny1113/html-parser-builder.git", from: "2.0.0"),
],
targets: [
.target(name: "YourTarget", dependencies: ["SwiftSoup", "HTMLParserBuilder"]),
]
```### Parsing
HTMLParserBuilder provides 2 functions for parsing:
```swift
public func parse(_ html: HTML) throws -> Output
public func parse(_ html: HTML) async throws -> Output
```> **Note**: You can choose the async version for even better performance, since it use structured concurrency to parallelize child tasks.
### HTML
You can construct your parser inside `HTML`, it can also transform to other data type.
```swift
struct Group {
let h1: String
let h2: String
}let capture = HTML {
Capture("#group h1", transform: \.textContent) // => HTML
Capture("#group h2", transform: \.textContent) // => HTML
} transform: { (output: (String, String)) -> Group in
return Group(
h1: output.0,
h2: output.1
)
} // => HTML
```---
### Capture
Using `Capture` is the same as `querySelector`, you pass in CSS selector to find the HTML element, and you can transform it to any other type you want:
- innerHTML
- textContent
- attributes
- ...> **Note**: If `Capture` can't find the HTML element that match the selector, it will throw an error cause the whole parse fail, for failable capture, see [`TryCapture`](#trycapture).
You can use this API with various declaration that is most suitable for you:
```swift
Capture("#hello", transform: \.textContent)
Capture("#hello") { $0.textContent }
Capture("#hello") { (e: any Element) -> String in
return e.textContent
}
```### TryCapture
`TryCapture` is a litte different from `Capture`, it also calls `querySelector` to find the HTML element, but it returns an **optional** HTML element.
For this example, it will produce the result type of `String?`, and the result will be `nil` when the HTML element can't be found.
```swift
TryCapture("#hello") { (e: (any Element)?) -> String? in
return e?.innerHTML
}
```### CaptureAll
Using `CaptureAll` is the same as `querySelectorAll`, you pass in CSS selector to find all HTML elements that match the selector, and you can transform it to any other type you want:
You can use this API with various declaration that is most suitable for you:
```swift
CaptureAll("h1") { $0.map(\.textContent) }
CaptureAll("h1") { (e: [any Element]) -> [String] in
return e.map(\.textContent)
}
```You can also capture other elements inside and transform to other type:
```html
Group 1
Group 2
``````swift
CaptureAll("div.group") { (elements: [any Element]) -> [String] in
return elements.compactMap { e in
return e.querySelector("h1")?.textContent
}
}
// => [String]
// output: ["Group 1", "Group 2"]
```---
### Local
`Local` will find a HTML element that match the selector, and all the captures inside will find its element based on the element found by `Local`, this is useful when you just want to capture element that is inside the local group.
Just like `HTML`, `Local` can also transform captured result to other data type by adding `transform`:
```swift
struct Group {
let h1: String
let h2: String
}Local("#group") {
Capture("h1", transform: \.textContent) // => HTML
Capture("h2", transform: \.textContent) // => HTML
} transform: { (output: (String, String)) -> Group in
return Group(
h1: output.0,
h2: output.1
)
} // => Group
```> **Note**: If `Local` can't find the HTML element that match the selector, it will throw an error cause the whole parse fail, you can use [`TryCapture`](#trycapture) as alternative.
### LateInit
This library also comes with a handy property wrapper: `LateInit`, which can delay the initialization until the first time you access it.
```swift
struct Container {
@LateInit var capture = HTML {
Capture("h1", transform: \.textContent)
}
}// it needs to be `var` to perform late initialization
var container = Container()
let output = doc.parse(container.capture)
// ...
```### Wrap Up
| API | Use Case |
| ---------- | ---------------------------------------------------- |
| Capture | Throws error when element can't be captured |
| TryCapture | Returns `nil` when element can't be captured |
| CaptureAll | Capture all elements match the selector |
| Local | Capture elements in the local scope |
| LateInit | Delay the initialization to first time you access it |## Advanced use case
- Pass `HTMLComponent` into another
- Transform to custom data structure before parasing```swift
struct Group {
let h1: String
let h2: String
}// |--------------------------------------------------------------|
let groupCapture = HTML { // |
Local("#group") { // |
Capture("h1", transform: \.textContent) // => HTML // |
Capture("h2", transform: \.textContent) // => HTML // |
} // => HTML<(String, String)> // |
// |
} transform: { output -> Group in // |
return Group( // |
h1: output.0, // |
h2: output.1 // |
) // |
} // => HTML // |
// |
let capture = HTML { // |
TryCapture("#hello") { (element: (any Element)?) -> String? in // |
return element?.textContent // |
} // => HTML // |
// |
groupCapture // => HTML -------------------------------------|
} // => HTML<(String?, Group)>let htmlString = "..."
let doc: any Document = HTMLDocument(string: htmlString)let output = try doc.parse(capture)
// => (String?, Group)
```