{"id":23754012,"url":"https://github.com/novitae/njsparser","last_synced_at":"2025-09-05T02:32:46.752Z","repository":{"id":250883248,"uuid":"835735709","full_name":"novitae/njsparser","owner":"novitae","description":"A NextJS data parser, to scrape peacefully 🦩","archived":false,"fork":false,"pushed_at":"2024-12-27T12:05:42.000Z","size":3038,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-12-27T12:15:22.648Z","etag":null,"topics":["javascript","next","nextjs","parser","scraper","scraping"],"latest_commit_sha":null,"homepage":"","language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/novitae.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-07-30T12:34:14.000Z","updated_at":"2024-12-27T12:04:50.000Z","dependencies_parsed_at":"2024-12-02T22:28:05.555Z","dependency_job_id":"72ab76de-9279-4d51-8e11-19d8bef67891","html_url":"https://github.com/novitae/njsparser","commit_stats":null,"previous_names":["novitae/njsparser"],"tags_count":10,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novitae%2Fnjsparser","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novitae%2Fnjsparser/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novitae%2Fnjsparser/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/novitae%2Fnjsparser/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/novitae","download_url":"https://codeload.github.com/novitae/njsparser/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":232020711,"owners_count":18461396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","next","nextjs","parser","scraper","scraping"],"created_at":"2024-12-31T18:50:26.926Z","updated_at":"2024-12-31T18:50:27.597Z","avatar_url":"https://github.com/novitae.png","language":"HTML","funding_links":[],"categories":["HTML"],"sub_categories":[],"readme":"# NJSParser\nA powerful **parser** and **explorer** for any website built with [NextJS](https://nextjs.org).\n- Parses flight data (from the **`self.__next_f.push`** scripts).\n- Parses next data from **`__NEXT_DATA__`** script.\n- Parses **build manifests**.\n- Searches for **build id**.\n- Many other things ...\n\nIt uses only **lxml**, **orjson**, **pydantic** to garantee a fast and efficient data parsing and processing.\n## Installation:\n```\npip install njsparser\n```\n## Use\n### CLI\nYou can use the cli from 3 different commands:\n- `njsp`\n- `njsparser`\n- `python3 -m njsparser.cli`\nIt has only one functionality of displaying informations about the website, like this:\n![](./src/Capture%20d’écran%202024-12-27%20à%2013.01.10.png)\nFor more informations, use the `--help` argument with the command.\n### Parsing `__next_f`.\nThe data you find in `__next_f` is called flight data, and contains data under react format. You can parse it easily with `njsparser` the way it follows.\n\n*We will build a parser for the [flight data example](examples/flight_data.py)*\n\n1. In the website you want to parse, make sure you see the `self.__next_f.push` in the begining of script contained the data you search for. Here I am searching for the description `\"I should really have a better hobby, but this is it...\"` (in blue) in [my page](https://mediux.pro/user/r3draid3r04), and I can also see the `self.__next_f.push` (in green). ![](./src/Capture%20d’écran%202024-12-12%20à%2015.44.11.png)\n2. Then I will do this simple script, to parse, then dump the flight data of my website, and see what objects I am searching for:\n   ```py\n   import requests\n   import njsparser\n   import json\n\n   # Here I get my page's html\n   response = requests.get(\"https://mediux.pro/user/r3draid3r04\").text\n   # Then I parse it with njsparser\n   fd = njsparser.BeautifulFD(response)\n   # Then I will write to json the content of the flight data\n   with open(\"fd.json\", \"w\") as write:\n       # I use the njsparser.default function to support the dump of the flight data objects.\n       json.dump(fd, write, indent=4, default=njsparser.default)\n   ```\n3. In my dumped flight data, I will search for the same string: ![](./src/Capture%20d’écran%202024-12-12%20à%2015.51.01.png)\n4. Then I will do to the closed `\"value\"` root to my found string, and look at the value of `\"cls\"`. Here it is `\"Data\"`: ![](./src/Capture%20d’écran%202024-12-12%20à%2015.51.17.png)\n5. Now that I know the `\"cls\"` (class) of object my data is contained in, I can search for it in my `BeautifulFD` object:\n   ```py\n   import requests\n   import njsparser\n   import json\n\n   # Here I get my page's html\n   response = requests.get(\"https://mediux.pro/user/r3draid3r04\").text\n   # Then I parse it with njsparser\n   fd = njsparser.BeautifulFD(response)\n   # Then I iterate over the different classes `Data` in my flight data.\n   for data in fd.find_iter([njsparser.T.Data]):\n       # Then I make sure that the content of my data is not None, and\n       # check if the key `\"user\"` is in the data's content. If it is,\n       # then i break the loop of searching.\n       if data.content is not None and \"user\" in data.content:\n           break\n   else:\n       # If i didn't find it, i raise an error\n       raise ValueError\n\n   # Now i have the data of my user\n   user = data.content[\"user\"]\n   # And I can print the string i was searching for before\n   print(user[\"tagline\"])\n   ```\n\nMore informations:\n- If your object is inside another object (e.g. `\"Data\"` in a `\"DataParent\"`, or in a `\"DataContainer\"`), the `.find_iter` will also find it recursively (except if you set `recursive=False`).\n- Make sure you use the correct flight data classes attributes when fetching their data. The class `\"Data\"` has a `.content` attribute. If you use `.value`, you will end up with the raw value and will have to parse it yourself. If you work with a `\"DataParent\"` object, instead of using `.value` (that will give you `[\"$\", \"$L16\", None, {\"children\": [\"$\", \"$L17\", None, {\"profile\": {}}]}])`, use `.children` (that will give you a `\"Data\"` object with a `.content` of `{\"profile\": {}}`). Check for the [type file](njsparser/parser/types.py) to see what classes you're interested in, and their attributes.\n- You can also use `.find` on `BeautifulFD` to return the only first occurence of your query, or None if not found.\n\n### Parsing `\u003cscript id='__NEXT_DATA__'\u003e`\nJust do:\n```py\nimport njsparser\n\nhtml_text = ...\ndata = njsparser.get_next_data(html_text)\n```\nIf the page contains any script `\u003cscript id='__NEXT_DATA__'\u003e`, it will return the json loaded data, otherwise will return `None`.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnovitae%2Fnjsparser","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnovitae%2Fnjsparser","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnovitae%2Fnjsparser/lists"}