Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/fakturk/page-info

Give information about a web page
https://github.com/fakturk/page-info

Last synced: 2 days ago
JSON representation

Give information about a web page

Host: GitHub
URL: https://github.com/fakturk/page-info
Owner: fakturk
Created: 2021-03-27T11:19:31.000Z (almost 4 years ago)
Default Branch: main
Last Pushed: 2021-04-01T22:55:16.000Z (almost 4 years ago)
Last Synced: 2024-03-15T18:21:30.775Z (10 months ago)
Language: Go
Size: 226 KB
Stars: 0
Watchers: 2
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # Page-Info

## Purpose

Give information about a web page on plain text

Takes a website URL as an input and provides general information

about the contents of the page:

- [X] HTML Version

- [X]  Page Title

- [X]  Headings count by level

- [X]  Amount of internal and external links

- [X] Amount of inaccessible links

- [X] If a page contains a login form

## Architecture

We have page service, controller and utils (helper)

Page Controller have following methods;

``` code 

func FindUrlInfo(w http.ResponseWriter, r *http.Request)

```

``` code 

func FindUrlInfoWithPost(w http.ResponseWriter, r *http.Request) 

```

Page Service have following methods;

``` code 

func getInfo(url string) string 

// collect the information about given web page and returns the result

```

``` code 

func DetectHTMLTypeFromDoc(doc *goquery.Document) (string, error)

// Detect HTML Version From a http doc 

```

``` code 

func checkDoctype(html string) string 

```

``` code 

func getTitle(doc *goquery.Document) string 

```

``` code 

func getHeadingCount(doc *goquery.Document) map[string]int 

```

``` code 

func extractLinks(doc *goquery.Document) []string 

// find all links on the web page by finding 'a' with 'href' attribute

```

``` code 

func separateLinks(baseURL string, hrefs []string) ([]string, []string)

// seperate links to internal and external links

// if link starts with given urls host name (base url) or only '/' it is an internal link

// otherwise it is an external link

```

``` code 

func getInaccessibleLinks(urls []string, wg *sync.WaitGroup)

// find inaccessible links

// we add a wait group worker before checking each url concurrently

// when it finish to check it give done signal to the wait group

// we use goroutines for checking each url, instead of waiting them one by one we call the urls concurrently

// for counting on concurrent goroutines a sync.WaitGroup variable defined and add worker for each goroutine

// we increase the inaccessible links amounts with atomic counter

// and waitgroup wait all gouroutines to finish their job and writes result to the response

	

```

``` code 

func checkUrl(url string, wg *sync.WaitGroup) 

// checks the given url is accessible or not

// if not accessible increase amount atomicly because we check urls in concurrent manner

// if response code is not 200 we assume that url is inaccessible

```

``` code 

func checkLoginForm(doc *goquery.Document) bool 

// checks if the login form exist or not by checking the input field with password type

```

Page Utils have following methods;

``` code 

func GetDocumentFromURL(url string) (*goquery.Document, error) 

// Document represents an HTML document to be manipulated.

// It holds the root document node to manipulate, and can make selections on this document.

```

``` code 

func responseBuilder(current *string, title, info string) 

// add given information to the current response

```

``` code 

func verifyURL(myUrl string) string

// if url does not contain http or https http.Get does not work, because of that this function add http before url if not exists

```

``` code 

func getBaseUrl(myurl string) string 

// gets the only host part of the url for checking internal links

```

## How to Run

```shell

$ go run main.go

```

## Usage

- Get Url:

    - request type: GET

    - host: localhost:8080/url/{url}

    - works only with hostname without path, if the url has path POST method should use

    - example : http://localhost:8080/url/facebook.com

    

        ![Get URL](/images/get.png)

    

- Get Url with POST:

    - request type: POST

    - host: localhost:8080/?url={urlWithPath}

    - query parameters : url

    - example : localhost:8080/?url=facebook.com/login

      ![Get URL with POST](/images/post.png)