https://github.com/searls/tagscraper
A straightforward way to perform XPath queries on HTML in a simple object model using the iPhone SDK
https://github.com/searls/tagscraper
Last synced: about 2 months ago
JSON representation
A straightforward way to perform XPath queries on HTML in a simple object model using the iPhone SDK
- Host: GitHub
- URL: https://github.com/searls/tagscraper
- Owner: searls
- License: other
- Created: 2009-08-08T20:11:25.000Z (almost 16 years ago)
- Default Branch: master
- Last Pushed: 2009-08-09T23:36:23.000Z (almost 16 years ago)
- Last Synced: 2025-02-13T15:37:42.413Z (4 months ago)
- Homepage:
- Size: 109 KB
- Stars: 4
- Watchers: 4
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.mdown
- License: LICENSE.txt
Awesome Lists containing this project
README
TagScraper
==========TagScraper is a very lightweight means to facilitate [Screen Scraping](http://en.wikipedia.org/wiki/Screen_Scraping#Screen_scraping) for iPhone applications. At the most basic level, this means performing an [XPath](http://en.wikipedia.org/wiki/XPath) query against an [XML](http://en.wikipedia.org/wiki/XML) document, usually loaded from the [Internet](http://en.wikipedia.org/wiki/Internet).
Adding TagScraper to your project
---------------------------------1. Clone the TagScraper repository to a permanent location on your hard drive:
$ git clone git://github.com/searls/TagScraper.git
2. Find `TagScraper.xcodeproj` in Finder, then drag-and-drop it into your project's "Groups & Files" pane in Xcode underneath your Project.
* Uncheck "Copy items"
* Set "Reference Type" to "Relative to Project"
* Check "Recursively create groups for any added folders"
3. Click "TagScraper.xcodeproj" in your "Groups & Files" pane, and in the upper right, you should see a file named "libTagScraper.a". Check the checkbox to the far right with a bullseye above it.
4. Set up the project's dependencies.
1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "General" tab.
2. Click the `+` icon under "Direct Dependencies" and add `TagScraper`.
3. Click the `+` icon under "Linked Libraries" and add `libxml2.dylib`.
5. Set up the project settings
1. From the menu bar, select "Project" -> "Edit Project Settings" and click the "Build" tab.
2. Set Configuration to `All Configurations`
3. Under "Other Linker Flags" add both: `-all_load` and `-ObjC`
4. Under "Header Search Paths", add the relative path from your XCode project to the TagScraper source files. If your project and TagScraper shared the same root folder, this would be "../TagScraper/src"Usage
-----
Simply import the global header to access any of TagScraper's classes. For now, those are merely `Tag` and `XPathQuery`
`#import "TagScraper.h"`
Then to try it in your code, here's an example converting an NSString to an NSData, performing the XPath query `//p`, and produce a Tag object from it. It *should* log "Content" to console.
`NSString *html = @"Content
";`
`NSData *data = [html dataUsingEncoding:NSUTF8StringEncoding];`
`Tag *tag = [XPathQuery firstResultForXPathQuery:@"//p" onDocument:data];`
`NSLog([tag retrieveText]);`I encourage you to explore the (brief) source code of the static library to see what your options are. The best usage examples will be in the Tag Scraper's "Tests" group, which can be executed by opening the `TagScraper.xcodeproj` and switching the Active Target to `TagScraperTests`.
What TagScraper Does
--------------------
* Performs XPath queries on NSData objects representing XML/HTML text
* Returns XPath results as a custom Tag object or as NSDictionary objects
* Is well suited to messily scrape data from web sites, typically for the purpose of redisplaying that data in a native iPhone application.
* Converts Tag object hierarchy back to HTML for debugging purposes (i.e. `NSLog([tag toHTML])`)What TagScraper Doesn't Do
--------------------------
* Make any effort to understand the difference between XML, XHTML, HTML, or specific doctypes within.
* Handle HTML entities or XML special characters (yet)Limitations
-----------
* Currently, will load and then release an XML document between every XPath query, making it terribly inefficient to perform multiple queries against one document.
* XPath queries are currently only case sensitive. Adding some flags to control this and other contextual items would be nice.
* Very few HTML entities are properly escaped.
* In general, the Tag object model will strip whitespace aggressively without being clear in the API. This is because it was written to scrape pages that used abhorrent amounts of unnecessary whitespace within HTML content tags.To-dos
------
* Come up with a beginQueries/endQueries API in order to allow multiple queries to be performed on the same XML Document to save cycles.
* As stated before, any weird entities can totally make this thing explode. If you'd like to help, checkout NSStringAdditions and flesh out the entities that are replaced. One just needs to take the effort to look up what's necessary and in what cases.