{"id":23412968,"url":"https://github.com/iofu728/spider","last_synced_at":"2025-04-12T04:03:45.760Z","repository":{"id":96784367,"uuid":"152982734","full_name":"iofu728/spider","owner":"iofu728","description":"🕷some website spider application base on proxy pool (support http \u0026 websocket)","archived":false,"fork":false,"pushed_at":"2021-12-11T13:08:31.000Z","size":3367,"stargazers_count":111,"open_issues_count":3,"forks_count":36,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-03-25T23:34:53.477Z","etag":null,"topics":["proxy-ip","spider","spider-press","stress-testing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iofu728.png","metadata":{"files":{"readme":"README.md","changelog":"news/news.py","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-10-14T14:28:20.000Z","updated_at":"2024-12-04T03:58:43.000Z","dependencies_parsed_at":null,"dependency_job_id":"5ebd8a45-97ac-48ce-97d8-106d117de33c","html_url":"https://github.com/iofu728/spider","commit_stats":null,"previous_names":[],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iofu728%2Fspider","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iofu728%2Fspider/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iofu728%2Fspider/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iofu728%2Fspider/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iofu728","download_url":"https://codeload.github.com/iofu728/spider/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248514227,"owners_count":21116903,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["proxy-ip","spider","spider-press","stress-testing"],"created_at":"2024-12-22T18:27:13.583Z","updated_at":"2025-04-12T04:03:45.750Z","avatar_url":"https://github.com/iofu728.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n\u003ca href=\"https://wyydsb.xin\" target=\"_blank\" rel=\"noopener noreferrer\"\u003e\n\u003cimg width=\"100\" src=\"https://cdn.nlark.com/yuque/0/2018/jpeg/104214/1540358574166-46cbbfd2-69fa-4406-aba9-784bf65efdf9.jpeg\" alt=\"Spider logo\"\u003e\u003c/a\u003e\u003c/p\u003e\n\u003ch1 align=\"center\"\u003eSpider Man\u003c/h1\u003e\n\n[![GitHub](https://img.shields.io/github/license/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/blob/master/LICENSE)\n[![GitHub tag](https://img.shields.io/github/tag/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider/releases)\n[![GitHub code size in bytes](https://img.shields.io/github/languages/code-size/iofu728/spider.svg?style=popout-square)](https://github.com/iofu728/spider)\n\n\u003cdiv align=\"center\"\u003e\u003cstrong\u003e高可用代理IP池 高并发生成器 一些实战经验\u003c/strong\u003e\u003c/div\u003e\n\u003cdiv align=\"center\"\u003e\u003cstrong\u003eHighly Available Proxy IP Pool, Highly Concurrent Request Builder, Some Application\u003c/strong\u003e\u003c/div\u003e\n\n## Navigation\n\n| site                 | document                                  | Last Modified time |\n| -------------------- | ----------------------------------------- | ------------------ |\n| some proxy site,etc. | [Proxy pool](#proxy-pool)                 | 20-06-01           |\n| music.163.com        | [Netease](#netease)                       | 18-10-21           |\n| -                    | [Press Test System](#press-test-system)   | 18-11-10           |\n| news.baidu.com       | [News](#news)                             | 19-01-25           |\n| note.youdao.com      | [Youdao Note](#youdao-note)               | 20-01-04           |\n| jianshu.com/csdn.net | [blog](#blog)                             | 20-01-04           |\n| elective.pku.edu.cn  | [Brush Class](#brush-class)               | 19-10-11           |\n| zimuzu.tv            | [zimuzu](#zimuzu)                         | 19-04-13           |\n| bilibili.com         | [Bilibili](#bilibili)                     | 20-06-06           |\n| exam.shaoq.com       | [shaoq](#shaoq)                           | 19-03-21           |\n| data.eastmoney.com   | [Eastmoney](#eastmoney)                   | 19-03-29           |\n| hotel.ctrip.com      | [Ctrip Hotel Detail](#ctrip-hotel-detail) | 19-10-11           |\n| douban.com           | [DouBan](#douban)                         | 19-05-07           |\n| 66ip.cn              | [66ip](#66ip)                             | 19-05-07           |\n\n## keyword\n\n- Big data store\n- High concurrency requests\n- Support WebSocket\n- method for font cheat\n- method for js compile\n- Some Application\n\n## Quick Start\n\n`docker` is on the road.\n\n```bash\n$ git clone https://github.com/iofu728/spider.git\n$ cd spider\n$ pip install -r requirement.txt\n\n# load proxy pool\n$ python proxy/getproxy.py                             # to load proxy resources\n```\n\n\u003e To use proxy pool\n\n```python\n''' using proxy requests '''\nfrom proxy.getproxy import GetFreeProxy                # to use proxy\nproxy_req = GetFreeProxy().proxy_req\nproxy_req(url:str, types:int, data=None, test_func=None, header=None)\n\n''' using basic requests '''\nfrom util.util import basic_req\nbasic_req(url: str, types: int, proxies=None, data=None, header=None, need_cookie: bool = False)\n```\n\n## Structure\n\n```bash\n.\n├── LICENSE\n├── README.md\n├── bilibili\n│   ├── analysis.py                // data analysis\n│   ├── bilibili.py                // bilibili basic\n│   └── bsocket.py                 // bilibili websocket\n├── blog\n│   └── titleviews.py              // Zhihu \u0026\u0026 CSDN \u0026\u0026 jianshu\n├── brushclass\n│   └── brushclass.py              // PKU elective\n├── buildmd\n│   └── buildmd.py                 // Youdao Note\n├── eastmoney\n│   └── eastmoney.py               // font analysis\n├── exam\n│   ├── shaoq.js                   // jsdom\n│   └── shaoq.py                   // compile js shaoq\n├── log\n├── netease\n│   ├── netease_music_base.py\n│   ├── netease_music_db.py        // Netease Music\n│   └── table.sql\n├── news\n│   └── news.py                    // Google \u0026\u0026 Baidu\n├── press\n│   └── press.py                   // Press text\n├── proxy\n│   ├── getproxy.py                // Proxy pool\n│   └── table.sql\n├── requirement.txt\n├── utils\n│   ├── db.py\n│   └── utils.py\n└── zimuzu\n    └── zimuzu.py                  // zimuzi\n```\n\n## Proxy pool\n\n\u003e proxy pool is the heart of this project.\n\n- Highly Available Proxy IP Pool\n  - By obtaining data from `Gatherproxy`, `Goubanjia`, `xici` etc. Free Proxy WebSite\n  - Analysis of the Goubanjia port data\n  - Quickly verify IP availability\n  - Cooperate with Requests to automatically assign proxy Ip, with Retry mechanism, fail to write DB mechanism\n  - two models for proxy shell\n    - model 1: load gather proxy list \u0026\u0026 update proxy list file(need over the GFW, your personality passwd in http://gatherproxy.com to `proxy/data/passage` one line by username, one line by passwd)\n    - model 0: update proxy pool db \u0026\u0026 test available\n  - one common proxy api\n    - `from proxy.getproxy import GetFreeProxy`\n    - `proxy_req = GetFreeProxy().proxy_req`\n    - `proxy_req(url: str, types: int, data=None, test_func=None, header=None)`\n  - also one common basic req api\n    - `from util import basic_req`\n    - `basic_req(url: str, types: int, proxies=None, data=None, header=None)`\n  - if you want spider by using proxy\n    - because access proxy web need over the GFW, so maybe you can't use `model 1` to download proxy file.\n    - download proxy txt from 'http://gatherproxy.com'\n    - cp download_file proxy/data/gatherproxy\n    - python proxy/getproxy.py --model==0\n\n## Netease\n\n\u003e Netease Music song playlist crawl - [netease/netease_music_db.py](https://github.com/iofu728/spider/blob/master/netease/netease_music_db.py)\n\n- problem: `big data store`\n- classify -\u003e playlist id -\u003e song_detail\n- V1 Write file, One run version, no proxy, no record progress mechanism\n- V1.5 Small amount of proxy IP\n- V2 Proxy IP pool, Record progress, Write to MySQL\n\n  - Optimize the write to DB `Load data/ Replace INTO`\n\n- [Netease Music Spider for DB](https://wyydsb.xin/other/neteasedb.html)\n- [Netease Music Spider](https://wyydsb.xin/other/netease.html)\n\n## Press Test System\n\n\u003e Press Test System - [press/press.py](https://github.com/iofu728/spider/blob/master/press/press.py)\n\n- problem: `high concurrency requests`\n- By highly available proxy IP pool to pretend user.\n- Give some web service uneven pressure\n- To do: press uniform\n\n## News\n\n\u003e google \u0026 baidu info crawl- [news/news.py](https://github.com/iofu728/spider/blob/master/news/news.py)\n\n- get news from search engine by Proxy Engine\n- one model: careful analysis `DOM`\n- the other model: rough analysis `Chinese words`\n\n## Youdao Note\n\n\u003e Youdao Note documents crawl - [buildmd/buildmd.py](https://github.com/iofu728/spider/blob/master/buildmd/buildmd.py)\n\n- load data from `youdaoyun`\n- by series of rules to deal data to .md\n\n## blog\n\n\u003e csdn \u0026\u0026 zhihu \u0026\u0026 jianshu view info crawl - [blog/titleview.py](https://github.com/iofu728/spider/blob/master/blog/titleviews.py)\n\n```bash\n$ python blog/titleviews.py --model=1 \u003e\u003e log 2\u003e\u00261 # model = 1: load gather model or python blog/titleviews.py --model=1 \u003e\u003e proxy.log 2\u003e\u00261\n$ python blog/titleviews.py --model=0 \u003e\u003e log 2\u003e\u00261 # model = 0: update gather model\n```\n\n## Brush Class\n\n\u003e PKU Class brush - [brushclass/brushclass.py](https://github.com/iofu728/spider/blob/master/brushclass/brushclass.py)\n\n- when your expected class have places, It will send you some email.\n\n## zimuzu\n\n\u003e ZiMuZu download list crawl - [zimuzu/zimuzu.py](https://github.com/iofu728/spider/blob/master/zimuzu/zimuzu.py)\n\n- when you want to download lots of show like Season 22, Season 21.\n- If click one by one, It is very boring, so zimuzu.py is all you need.\n- The thing you only need do is to wait for the program run.\n- And you copy the Thunder URL for one to download the movies.\n- Now The Winter will come, I think you need it to review `\u003cGame of Thrones\u003e`.\n\n## Bilibili\n\n\u003e Get av data by http - [bilibili/bilibili.py](https://github.com/iofu728/spider/blob/master/bilibili/bilibili.py)\n\n- `homepage rank` -\u003e check `tids` -\u003e to check data every 2min(during on rank + one day)\n- monitor every rank av -\u003e star num \u0026 basic data\n\n\u003e Get av data by websocket - [bilibili/bsocket.py](https://github.com/iofu728/spider/blob/master/bilibili/bsocket.py)\n\n- base on WebSocket\n- byte analysis\n- heartbeat\n\n\u003e Get comment data by http - [bilibili/bilibili.py](https://github.com/iofu728/spider/blob/master/bilibili/bilibili.py)\n\n- load comment from `/x/v2/reply`\n\n- UnicodeEncodeError: 'ascii' codec can't encode characters in position 7-10: ordinal not in range(128)\n\n  - read/write in `utf-8`\n  - with codecs.open(filename, 'r/w', encoding='utf-8')\n\n- `bilibili` some url return 404 like `http://api.bilibili.com/x/relation/stat?jsonp=jsonp\u0026callback=__jp11\u0026vmid=`\n\n  basic_req auto add `host` to headers, but this URL can't request in ‘Host’\n\n## shaoq\n\n\u003e Get text data by compiling javascript - [exam/shaoq.py](https://github.com/iofu728/spider/blob/master/exam/shaoq.py)\n\n- Idea\n\n  1. get cookie\n  2. request image\n  3. requests after 5.5s\n  4. compile javascript code -\u003e get css\n  5. analysic css\n\n- Requirement\n\n  ```sh\n  pip3 install PyExecJS\n  yarn install add jsdom # npm install jsdom PS: not global\n  ```\n\n- Can't get true html\n\n  - Wait time must be 5.5s.\n  - So you can use `threading` or `await asyncio.gather` to request image\n\n  - [Coroutines and Tasks](https://docs.python.org/3/library/asyncio-task.html)\n\n- Error: Cannot find module 'jsdom'\n\n  \u003e jsdom must install in local not in global\n\n  - [Cannot find module 'jsdom'](https://github.com/scala-js/scala-js/issues/2642)\n\n- remove subtree \u0026 edit subtree \u0026 re.findall\n\n  ```py\n  subtree.extract()\n  subtree.string = new_string\n  parent_tree.find_all(re.compile('''))\n  ```\n\n  - [extract()](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#extract)\n  - [NavigableString](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigablestring)\n  - [A regular expression](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#a-regular-expression)\n\n## Eastmoney\n\n\u003e Get stock info by analysis font - [eastmoney/eastmoney.py](https://github.com/iofu728/spider/blob/master/eastmoney/eastmoney.py)\n\n- font analysis\n\n- Idea\n\n  1. get data from HTML -\u003e json\n  2. get font map -\u003e transform num\n  3. or load font analysis font(contrast with base)\n\n- error: unpack requires a buffer of 20 bytes\n\n  - requests.text -\u003e str,\n  - requests.content -\u003e byte\n\n  - [Struct.error: unpack requires a buffer of 16 bytes](https://stackoverflow.com/questions/51110525/struct-error-unpack-requires-a-buffer-of-16-bytes)\n\n- How to analysis font\n\n  - use fonttools\n  - get TTFont().getBestCamp()\n  - contrast with base\n\n- configure file\n\n  - cfg = ConfigParser()\n  - cfg.read(assign_path, 'utf-8')\n  - [13.10read configure file](https://python3-cookbook.readthedocs.io/zh_CN/latest/c13/p10_read_configuration_files.html)\n\n## Ctrip Hotel Detail\n\n\u003e Get Ctrip Hotel True Detail - [ctrip/hotelDetail.py](https://github.com/iofu728/spider/blob/master/ctrip/hotelDetail.py)\n\n- int32\n\n  ```python\n  np.int32()\n  ```\n\n- js charCodeAt() in py\n\n  [python 中如何实现 js 里的 charCodeAt()方法？](https://www.zhihu.com/question/57108214)\n\n  ```python\n  ord(string[index])\n  ```\n\n- python access file fold import\n\n  ```python\n  import sys\n  sys.path.append(os.getcwd())\n  ```\n\n- generate char list\n\n  using ASCII\n\n  ```python\n  lower_char = [chr(i) for i in range(97,123)] # a-z\n  upper_char = [chr(i) for i in range(65,91)]  # A-Z\n  ```\n\n- Can't get cookie in `document.cookie`\n\n  Service use `HttpOnly` in `Set-Cookie`\n\n  - [Why doesn't document.cookie show all the cookie for the site?](https://stackoverflow.com/questions/1022112/why-doesnt-document-cookie-show-all-the-cookie-for-the-site)\n  - [Secure and HttpOnly](https://en.wikipedia.org/wiki/HTTP_cookie#Secure_and_HttpOnly)\n\n  \u003e The Secure attribute is meant to keep cookie communication limited to encrypted transmission, directing browsers to use cookies only via secure/encrypted connections. However, if a web server sets a cookie with a secure attribute from a non-secure connection, the cookie can still be intercepted when it is sent to the user by **man-in-the-middle attacks**. Therefore, for maximum security, cookies with the Secure attribute should only be set over a secure connection.\n  \u003e\n  \u003e The HttpOnly attribute directs browsers not to expose cookies through channels other than HTTP (and HTTPS) requests. This means that the cookie cannot be accessed via client-side scripting languages (notably JavaScript), and therefore cannot be stolen easily via cross-site scripting (a pervasive attack technique).\n\n- ctrip cookie analysis\n\n| key                           | method | how                                                                                                 | constant | login | finish |\n| ----------------------------- | ------ | --------------------------------------------------------------------------------------------------- | -------- | ----- | ------ |\n| `magicid`                     | set    | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     | 1      |\n| `ASP.NET_SessionId`           | set    | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     | 1      |\n| `clientid`                    | set    | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     | 1      |\n| `_abtest_userid`              | set    | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     | 1      |\n| `hoteluuid`                   | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     |\n| `fcerror`                     | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     |\n| `_zQdjfing`                   | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     |\n| `OID_ForOnlineHotel`          | js     | `https://webresource.c-ctrip.com/ResHotelOnline/R8/search/js.merge/showhotelinformation.js`         | 1        | 0     |\n| `_RSG`                        | req    | `https://cdid.c-ctrip.com/chloro-device/v2/d`                                                       | 1        | 0     |\n| `_RDG`                        | req    | `https://cdid.c-ctrip.com/chloro-device/v2/d`                                                       | 1        | 0     |\n| `_RGUID`                      | set    | `https://cdid.c-ctrip.com/chloro-device/v2/d`                                                       | 1        | 0     |\n| `_ga`                         | js     | for google analysis                                                                                 | 1        | 0     |\n| `_gid`                        | js     | for google analysis                                                                                 | 1        | 0     |\n| `MKT_Pagesource`              | js     | `https://webresource.c-ctrip.com/ResUnionOnline/R3/float/floating_normal.min.js`                    | 1        | 0     |\n| `_HGUID`                      | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 1        | 0     |\n| `HotelDomesticVisitedHotels1` | set    | `https://hotels.ctrip.com/Domestic/tool/AjaxGetHotelAddtionalInfo.ashx`                             | 1        | 0     |\n| `_RF1`                        | req    | `https://cdid.c-ctrip.com/chloro-device/v2/d`                                                       | 1        | 0     |\n| `appFloatCnt`                 | js     | `https://webresource.c-ctrip.com/ResUnionOnline/R3/float/floating_normal.min.js?20190428`           | 1        | 0     |\n| `gad_city`                    | set    | `https://crm.ws.ctrip.com/Customer-Market-Proxy/AdCallProxyV2.aspx`                                 | 1        | 0     |\n| `login_uid`                   | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `login_type`                  | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `cticket`                     | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `AHeadUserInfo`               | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `ticket_ctrip`                | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `DUID`                        | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `IsNonUser`                   | set    | `https://accounts.ctrip.com/ssoproxy/ssoCrossSetCookie`                                             | 1        | 1     |\n| `UUID`                        | req    | `https://passport.ctrip.com/gateway/api/soa2/12770/setGuestData`                                    | 1        | 1     |\n| `IsPersonalizedLogin`         | js     | `https://webresource.c-ctrip.com/ares2/basebiz/cusersdk/~0.0.8/default/login/1.0.0/loginsdk.min.js` | 1        | 1     |\n| `_bfi`                        | js     | `https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`                                | 1        | 0     |\n| `_jzqco`                      | js     | `https://webresource.c-ctrip.com/ResUnionOnline/R1/remarketing/js/mba_ctrip.js`                     | 1        | 0     |\n| `__zpspc`                     | js     | `https://webresource.c-ctrip.com/ResUnionOnline/R1/remarketing/js/s.js`                             | 1        | 0     |\n| `_bfa`                        | js     | `https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`                                | 1        | 0     |\n| `_bfs`                        | js     | `https://webresource.c-ctrip.com/code/ubt/_bfa.min.js?v=20193_28.js`                                | 1        | 0     |\n| `utc`                         | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 0        | 0     | 1      |\n| `htltmp`                      | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 0        | 0     | 1      |\n| `htlstm`                      | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 0        | 0     | 1      |\n| `arp_scroll_position`         | js     | `https://hotels.ctrip.com/hotel/xxx.html`                                                           | 0        | 0     | 1      |\n\n- some fusion in ctrip\n\n  ```js\n  function a31(a233, a23, a94) {\n    var a120 = {\n      KWcVI: \"mMa\",\n      hqRkQ: function a272(a309, a20) {\n        return a309 + a20;\n      },\n      WILPP: function a69(a242, a488) {\n        return a242(a488);\n      },\n      ydraP: function a293(a338, a255) {\n        return a338 == a255;\n      },\n      ceIER: \";expires=\",\n      mDTlQ: function a221(a234, a225) {\n        return a234 + a225;\n      },\n      dnvrD: function a268(a61, a351) {\n        return a61 + a351;\n      },\n      DIGJw: function a368(a62, a223) {\n        return a62 == a223;\n      },\n      pIWEz: function a260(a256, a284) {\n        return a256 + a284;\n      },\n      jXvnT: \";path=/\",\n    };\n    if (a120[\"KWcVI\"] !== a120[\"KWcVI\"]) {\n      var a67 = new Date();\n      a67[a845(\"0x1a\", \"4Vqw\")](\n        a120[a845(\"0x1b\", \"RswF\")](a67[\"getDate\"](), a94)\n      );\n      document[a845(\"0x1c\", \"WjvM\")] =\n        a120[a845(\"0x1d\", \"3082\")](a233, \"=\") +\n        a120[a845(\"0x1e\", \"TDHu\")](escape, a23) +\n        (a120[\"ydraP\"](a94, null)\n          ? \"\"\n          : a120[\"hqRkQ\"](a120[\"ceIER\"], a67[a845(\"0x1f\", \"IErH\")]())) +\n        a845(\"0x20\", \"eHIq\");\n    } else {\n      var a148 = a921(this, function() {\n        var a291 = function() {\n            return \"dev\";\n          },\n          a366 = function() {\n            return \"window\";\n          };\n        var a198 = function() {\n          var a168 = new RegExp(\"\\\\w+ *\\\\(\\\\) *{\\\\w+ *[' | '].+[' | '];? *}\");\n          return !a168[\"test\"](a291[\"toString\"]());\n        };\n        var a354 = function() {\n          var a29 = new RegExp(\"(\\\\[x|u](\\\\w){2,4})+\");\n          return a29[\"test\"](a366[\"toString\"]());\n        };\n        var a243 = function(a2) {\n          var a315 = ~-0x1 \u003e\u003e (0x1 + (0xff % 0x0));\n          if (a2[\"indexOf\"](\"i\" === a315)) {\n            a310(a2);\n          }\n        };\n        var a310 = function(a213) {\n          var a200 = ~-0x4 \u003e\u003e (0x1 + (0xff % 0x0));\n          if (a213[\"indexOf\"]((!![] + \"\")[0x3]) !== a200) {\n            a243(a213);\n          }\n        };\n        if (!a198()) {\n          if (!a354()) {\n            a243(\"indÐµxOf\");\n          } else {\n            a243(\"indexOf\");\n          }\n        } else {\n          a243(\"indÐµxOf\");\n        }\n      });\n      // a148();\n      var a169 = new Date();\n      a169[\"setDate\"](a169[\"getDate\"]() + a94);\n      document[\"cookie\"] = a120[\"mDTlQ\"](\n        a120[\"dnvrD\"](\n          a120[\"dnvrD\"](a120[\"dnvrD\"](a233, \"=\"), escape(a23)),\n          a120[\"DIGJw\"](a94, null)\n            ? \"\"\n            : a120[\"pIWEz\"](a120[\"ceIER\"], a169[\"toGMTString\"]())\n        ),\n        a120[\"jXvnT\"]\n      );\n    }\n  }\n  ```\n\n  equal to\n\n  ```js\n  document[\"cookie\"] =\n    a233 +\n    \"=\" +\n    escape(a23) +\n    (a94 == null ? \"\" : \";expires=\" + a169[\"toGMTString\"]()) +\n    \";path=/\";\n  ```\n\n  So, It is only a function to set cookie \u0026 expires.\n\n  And you can think `a31` is a entry point to judge where code about compiler cookie.\n\n- Get current timezone offset\n\n  ```python\n  import datetime, tzlocal\n  local_tz = tzlocal.get_localzone()\n  timezone_offset = -int(local_tz.utcoffset(datetime.datetime.today()).total_seconds() / 60)\n  ```\n\n- JSON.stringfy(e)\n\n  ```python\n  import json\n  json.dumps(e, separators=(',', ':'))\n  ```\n\n  - [JSON.stringify (Javascript) and json.dumps (Python) not equivalent on a list?](https://stackoverflow.com/questions/46227854/json-stringify-javascript-and-json-dumps-python-not-equivalent-on-a-list)\n\n- Element​.get​Bounding​Client​Rect()\n\n  return Element position\n\n  - [Element​.get​Bounding​Client​Rect()](https://developer.mozilla.org/en-US/docs/Web/API/Element/getBoundingClientRect)\n  - [​Event​Target​.add​Event​Listener()](https://developer.mozilla.org/en-US/docs/Web/API/EventTarget/addEventListener)\n\n## DouBan\n\n- RuntimeError: dictionary changed size during iteration (when user pickle)\n\n  - This situation maybe happen when your pickle params change in pickling.\n  - so copy of your params before pickle\n\n  ```python\n  comment_loader = comment.copy()\n  dump_bigger(comment_loader, '{}data.pkl'.format(data_dir))\n  ```\n\n  [How to avoid “RuntimeError: dictionary changed size during iteration” error?](https://stackoverflow.com/questions/11941817/how-to-avoid-runtimeerror-dictionary-changed-size-during-iteration-error)\n  [pickling SimpleLazyObject fails just after accessing related object of wrapped model instance.](https://code.djangoproject.com/ticket/25426)\n\n- RecursionError: maximum recursion depth exceeded while pickling an object\n\n  - object depth more than MAXIMUM stack depth\n\n  ```python\n  import sys\n  sys.setrecursionlimit(10000)\n  ```\n\n## 66ip\n\n\u003e Q: @liu wong 一段 js 代码在浏览器上执行的结果和在 python 上用 execjs 执行的结果不一样，有啥原因呢？ http://www.66ip.cn/\n\n\u003e A: 一般 eval 差异 主要是有编译环境，DOM，py 与 js 的字符规则，context 等有关\n\u003e 像 66ip 这个网站，主要是从 py 与 js 的字符规则不同 + DOM 入手的，当然它也有可能是无意的(毕竟爬虫工程师用的不只是 py)\n\u003e 首次访问 66ip 这个网站，会返回一个 521 的 response，header 里面塞了一个 HTTP-only 的 cookie，body 里面塞了一个 script\n\n```js\nvar x = \"@...\".replace(/@*$/, \"\").split(\"@\"),\n  y = \"...\",\n  f = function(x, y) {\n    return num;\n  },\n  z = f(\n    y\n      .match(/\\w/g)\n      .sort(function(x, y) {\n        return f(x) - f(y);\n      })\n      .pop()\n  );\nwhile (z++)\n  try {\n    eval(\n      y.replace(/\\b\\w+\\b/g, function(y) {\n        return x[f(y, z) - 1] || \"_\" + y;\n      })\n    );\n    break;\n  } catch (_) {}\n```\n\n\u003e 可以看到 eval 的是 y 字符串用 x 数组做了一个字符替换之后的结果，所以按道理应该和编译环境没有关系，但把 eval 改成 aa 之后放在 py 和放在 node，chrome 中编译结果却不一样\n\u003e 这是因为在 p 正则\\b 会被转义为\\x80，这就会导致正则匹配不到，就更不可能替换了，导致我们拿到的 eval_script 实际上是一串乱码\n\u003e 这里用 r'{}'.format(eval_script) 来防止特殊符号被转义\n\u003e 剩下的就是 对拿到的 eval_script 进行 dom 替换操作\n\u003e 总的来说是一个挺不错的 js 逆向入门练手项目, 代码量不大，逻辑清晰\n\u003e 具体代码参见[iofu728/spider](https://github.com/iofu728/spider/blob/master/proxy/ip66.py)\n\n![image](https://cdn.nlark.com/yuque/0/2019/png/104214/1557240022438-bc891ec5-7bbc-412a-b4d4-f330608d21f0.png)\n\n## OceanBall V2\n\ncheck param list:\n\n| param        | Ctrip | Incognito | Node | !!import |\n| ------------ | ----- | --------- | ---- | -------- |\n| define       | ✔     | x         | x    |\n| \\_\\_filename | x     | x         | x    |\n| module       | x     | x         | ✔    | x        |\n| process      | ✔     | x         | ✔    |\n| \\_\\_dirname  | ✔     | x         | x    |\n| global       | x     | x         | ✔    | x        |\n| INT_MAX      | ✔     | x         | x    |\n| require      | ✔     | x         | ✔    | ✔        |\n| History      | ✔     | x         |\n| Location     | ✔     | x         |\n| Window       | ✔     | x         |\n| Document     | ✔     | x         |\n| window       | ✔     | x         |\n| navigator    | ✔     | x         |\n| history      | ✔     | x         |\n\n**----To be continued----**\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiofu728%2Fspider","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiofu728%2Fspider","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiofu728%2Fspider/lists"}