https://github.com/novitae/njsparser

A NextJS data parser, to scrape peacefully 🦩
https://github.com/novitae/njsparser

javascript next nextjs parser scraper scraping

Last synced: 10 months ago
JSON representation

A NextJS data parser, to scrape peacefully 🦩

Host: GitHub
URL: https://github.com/novitae/njsparser
Owner: novitae
License: mit
Created: 2024-07-30T12:34:14.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2024-12-27T12:05:42.000Z (over 1 year ago)
Last Synced: 2024-12-27T12:15:22.648Z (over 1 year ago)
Topics: javascript, next, nextjs, parser, scraper, scraping
Language: HTML
Homepage:
Size: 2.9 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # NJSParser

A powerful **parser** and **explorer** for any website built with [NextJS](https://nextjs.org).

- Parses flight data (from the **`self.__next_f.push`** scripts).

- Parses next data from **`__NEXT_DATA__`** script.

- Parses **build manifests**.

- Searches for **build id**.

- Many other things ...

It uses only **lxml**, **orjson**, **pydantic** to garantee a fast and efficient data parsing and processing.

## Installation:

```

pip install njsparser

```

## Use

### CLI

You can use the cli from 3 different commands:

- `njsp`

- `njsparser`

- `python3 -m njsparser.cli`

It has only one functionality of displaying informations about the website, like this:

![](./src/Capture%20d’écran%202024-12-27%20à%2013.01.10.png)

For more informations, use the `--help` argument with the command.

### Parsing `__next_f`.

The data you find in `__next_f` is called flight data, and contains data under react format. You can parse it easily with `njsparser` the way it follows.

*We will build a parser for the [flight data example](examples/flight_data.py)*

1. In the website you want to parse, make sure you see the `self.__next_f.push` in the begining of script contained the data you search for. Here I am searching for the description `"I should really have a better hobby, but this is it..."` (in blue) in [my page](https://mediux.pro/user/r3draid3r04), and I can also see the `self.__next_f.push` (in green). ![](./src/Capture%20d’écran%202024-12-12%20à%2015.44.11.png)

2. Then I will do this simple script, to parse, then dump the flight data of my website, and see what objects I am searching for:

   ```py

   import requests

   import njsparser

   import json

   # Here I get my page's html

   response = requests.get("https://mediux.pro/user/r3draid3r04").text

   # Then I parse it with njsparser

   fd = njsparser.BeautifulFD(response)

   # Then I will write to json the content of the flight data

   with open("fd.json", "w") as write:

       # I use the njsparser.default function to support the dump of the flight data objects.

       json.dump(fd, write, indent=4, default=njsparser.default)

   ```

3. In my dumped flight data, I will search for the same string: ![](./src/Capture%20d’écran%202024-12-12%20à%2015.51.01.png)

4. Then I will do to the closed `"value"` root to my found string, and look at the value of `"cls"`. Here it is `"Data"`: ![](./src/Capture%20d’écran%202024-12-12%20à%2015.51.17.png)

5. Now that I know the `"cls"` (class) of object my data is contained in, I can search for it in my `BeautifulFD` object:

   ```py

   import requests

   import njsparser

   import json

   # Here I get my page's html

   response = requests.get("https://mediux.pro/user/r3draid3r04").text

   # Then I parse it with njsparser

   fd = njsparser.BeautifulFD(response)

   # Then I iterate over the different classes `Data` in my flight data.

   for data in fd.find_iter([njsparser.T.Data]):

       # Then I make sure that the content of my data is not None, and

       # check if the key `"user"` is in the data's content. If it is,

       # then i break the loop of searching.

       if data.content is not None and "user" in data.content:

           break

   else:

       # If i didn't find it, i raise an error

       raise ValueError

   # Now i have the data of my user

   user = data.content["user"]

   # And I can print the string i was searching for before

   print(user["tagline"])

   ```

More informations:

- If your object is inside another object (e.g. `"Data"` in a `"DataParent"`, or in a `"DataContainer"`), the `.find_iter` will also find it recursively (except if you set `recursive=False`).

- Make sure you use the correct flight data classes attributes when fetching their data. The class `"Data"` has a `.content` attribute. If you use `.value`, you will end up with the raw value and will have to parse it yourself. If you work with a `"DataParent"` object, instead of using `.value` (that will give you `["$", "$L16", None, {"children": ["$", "$L17", None, {"profile": {}}]}])`, use `.children` (that will give you a `"Data"` object with a `.content` of `{"profile": {}}`). Check for the [type file](njsparser/parser/types.py) to see what classes you're interested in, and their attributes.

- You can also use `.find` on `BeautifulFD` to return the only first occurence of your query, or None if not found.

### Parsing ``

Just do:

```py

import njsparser

html_text = ...

data = njsparser.get_next_data(html_text)

```

If the page contains any script `<script id='__NEXT_DATA__'>`, it will return the json loaded data, otherwise will return `None`.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/novitae/njsparser

Awesome Lists containing this project

README