Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ruichongliu/Crawler_pubg.op.gg

This is a web crawler for pubg.op.gg, written by Ruichong Liu. 绝地求生游戏数据抓取
https://github.com/ruichongliu/Crawler_pubg.op.gg

beautifulsoup4 crawler pubg python3 scrape selenium

Last synced: about 2 months ago
JSON representation

This is a web crawler for pubg.op.gg, written by Ruichong Liu. 绝地求生游戏数据抓取

Awesome Lists containing this project

README

        

# Web Crawler for *pubg.op.gg*
In this file, I will talk you through the entire code I wrote for this web crawler. I don't have much documentation within each file because I find most of them tedious and straightforward if you have done any scraping. Therefore, I will introduce the idea of this project and clarify some code in this file. If after reading this file, the whole thing still doesn't make sense to you, don't hesitate to leave a comment below or somewhere else in this repo.

Before the detailed introduction, check if you have **Python 3.6** installed on your machine, because I used **Python 3.6**. If you do have **Python 3.6**, you could move on to installing dependencies.

```shell
pip install bs4
pip install lxml
pip install requests
pip install selenium
```

## Composition
This entire project consists of four separate files, three of which are functional files and the remaining one is a simple wrapper.

###### main.py
This is a simple wrapper. Just remember to replace zsda123*(in line 13)* with a **real** PUBG user ID.

###### finder.py
This plays an important role in this entire project, because you will find that userID for each player in *[pubg.op.gg](https://pubg.op.gg)* in a random hexadecimal number. However, I still managed to find a pattern in the previous version of *finder.py*:

- UserID has a cluster-like distribution. In each cluster, there are 40 to 80 users.

I had my server run for a whole day and could not capture any cluster besides the starting one. Therefore, I decided to forgo this brutal way. Then, I decided to go with this new way:

Find the other players within the same game with the *user0*, In this case, *zsda123*. If you want to capture a larger sample, change **line 73** to the following.
```python
q = nameToId(userMap(userMap(q)))
```
By calling *userMap* twice, I was able to capture 4970 distinct users, and it took me an hour. The size of the data grows exponentially! Suppose there are 80 players in each game and you decided to call *userMap* 3 times. You should expect your server to do work proportionally to **512,000** users, which could eat up all of the memory of your server/PC. Be aware of that!

Let's talk more on documentation. In *Line 19* and *Line 42*, the query variable server is set to *as*. You can change it to other servers if you feel like doing that. Since the companying I am working for is a company focused on Greater China, most of the server settings are Asia or Southeast Asia or Japan/Korea.

In *Line 28*, remember to change that value to a real *op.gg* player ID. The reason I have this line is that sometimes a player with name **#unknown** appears ~~, due to their rubbish servers~~ . I will tell you how to find a player ID now.
- Go to *[pubg.op.gg](https://pubg.op.gg)*
- Search a real PUBG player username
- Wait for the whole webpage to be fully loaded
- Inspect the webpage and go to Network section
- Scroll all the way to the bottom and click on More
- You should be able to see a request begins with *recent?*
- Check Request URL, the player ID is the string after *https://pubg.op.gg/api/users/*

In *Line 40*, don't forget to change the executable_path to the path of your driver. After El Capitan, Mac users no longer have access to /usr/bin. You can put the driver under /usr/local/bin and change the executable_path like what I did. Also, you have to have **that browser installed** on your machine, which means if you don't have Firefox and you use *geckodriver*, you will receive a bunch of warnings! Also, if you do use Firefox, make the change as such.
```python
driver = webdriver.Firefox(executable_path = "/usr/local/bin/geckodriver")
```

In *Line 44,45,46*, you will see a big chuck of code there. Don't worry about it, I did not type it up. I will be mad if I did. It was generated by *Xpath Finder*, a plugin for Chrome. Line 44 is used to select the game with rank lower than 10, Line 45 for Top 10 games and Line 46 for Chicken Dinners. By clicking on those buttons, the game details are loaded into the webpage. Since it is asynchronous, I make the machine wait for a while (2 is tested to be not stable, so I go with 3 seconds).

The rest of the file is basically some beautiful soups, and I believe no further explanation is needed for that part.

###### scraper.py
I want to explain why I have a try statement in *Line 46*. Some users they just don't play solo games, so we could not get any data from there. Also, you could change the parameters passed into *query()* to extract something different. Besides these, we have another bunch of beautiful soups.

###### reader.py
The data I collect is quite primitive, so I also have a simple reader to summarize my data.
```python
[userId, x['participant']['user']['nickname'], x['season'], x['server'],
x['queue_size'], x['mode'], x['participant']['stats']['combat']['kda']['kills']]
```
If you demand something more complicated than what I did in my project, I will talk you through here. Recall that I talked about Request URL in the finder section. If you visit a [Request URL](https://pubg.op.gg/api/users/59fe36049e49c400014a68eb/matches/recent?season=2018-01&server=na&queue_size=&mode=tpp&after=100), you will see a JSON object like this.
```JSON
{"params":{"server":"na","season":null,"queue_size":0,"mode":"tpp"},
"matches":{"summary":{"matches_cnt":20,"win_matches_cnt":1,"topten_matches_cnt":7,"ranks_avg":16.95,
"ranks_list":[6,24,18,43,42,14,11,16,3,1,29,7,19,18,2,12,9,2,16,47],
"kills_avg":2,"deaths_avg":0.95,"kills_max":6,"damage_avg":230.261890915,
"time_survived_avg":964.2993499999999,"modes":{"2":{"matches_cnt":5,"win_matches_cnt":0,"topten_matches_cnt":1,
"rating_delta_sum":4.185349880000004},"4":{"matches_cnt":13,"win_matches_cnt":0,"topten_matches_cnt":5,
"rating_delta_sum":10.264028235999994},"1":{"matches_cnt":2,"win_matches_cnt":1,"topten_matches_cnt":1,
"rating_delta_sum":151.97761028000002}}},"items":[{"season":"2017-pre6","server":"na","queue_size":2,"mode":"tpp",
"started_at":"2017-12-06T03:27:29+0000","total_rank":42,"offset":101,
"match_id":"2U4GBNA0YmnSRjFPiSEp6LaN-bpuG8kRbg6Rdt5PZpPKmHyludByUMHwbLTOzeEO",
"participant":{"_id":"5a276bd059e73b0001e5b828","user":{"nickname":"LexWynnZzWw",
"profile_url":"https:\/\/pubg.op.gg\/user\/LexWynnZzWw?server=na"},"stats":{"rank":6,
"rating_delta":40.009836480000004,"combat":{"time_survived":1802.317,"vehicle_destroys":0,
"win_place":6,"kill_place":4,"heals":5,"weapon_acquired":9,"boosts":4,"death_type":"byplayer",
"most_damage":0,"kda":{"kills":4,"assists":2,"kill_steaks":1,"road_kills":0,"team_kills":0,"headshot_kills":2,
"longest_kill":49.3916779},"distance_traveled":{"walk_distance":2522.22559,"ride_distance":3938.64038},
"damage":{"damage_dealt":482.832336},"dbno":{"knock_downs":2,"revives":0}}}},
"team":{"_id":24,"stats":{"rank":6},"participants":[]}}
...
]}}
```
You can parse the object and get something interesting to you from there.

###### userIdList.txt
The following files might be messy on Windows machines. If you do have a \*nix machine, you should see something like this in this file.
```
5a3befa88676120001104e8d
5a307bafc284c1000169e7db
59feb54368c1ea00019c056b
5a0c5e93f0eb7800013cd191
...
```

###### log.txt
Do you still remember that I talked about Player #unknown? You can always find something interesting in the log file. The last message I present in Finder happens to be #unknown. The overall log should be like this.
```
Master: START--START--START--START--START
Finder: Starting with User YechenDetoxic...
Finder: Collecting Friends of User YechenDetoxic...
Finder: Collecting Friends of User ashingboomORZ...
Finder: Collecting Friends of User Thinktomuch...
Finder: Collecting Friends of User miaomiao-3-...
Finder: Collecting Friends of User dujun211...
Finder: Collecting Friends of User SUSHAOLEI...
...
Finder: Translating User kanchao_ge...
Finder: Translating User QingFeng141...
Finder: Translating User Clearloveccp...
Finder: Translating User 980010...
Finder: Translating User with164...
Finder: Translating User E-RomanA...
Finder: Translating User #unknown...
...
Finder: DONE!!
Time Used: 3866 seconds
User Captured: 4941
Scraper: Scraper Starts
Scraper: Working on User 5a0c5e93f0eb7800013cd191
Scraper: Working on User 5a0bed2905279f00011d10f5
Scraper: Working on User 5a2e49d4e358310001185431
Scraper: Working on User 59fd962cab1fff00019e0759
Scraper: Working on User 59fdb0a699392b0001608809
Scraper: Working on User 59fd958031e4c1000157b475
Scraper: Working on User 59fe352cb503ad0001f16526
...
Scraper: DONE!!
Time Used: 15435 seconds
Reader: Reader Starts
Reader: DONE!!
Time Used: 0 seconds
Master: DONE--DONE--DONE--DONE--DONE
```

###### data.csv
You should see data collected in such format:
```
Player ID Username Season Server Queue_Size Mode Kills
59fd96dddfa2830001fb24aa Kev666-- 2018-01 sea 1 tpp 9
59fd96dddfa2830001fb24aa Kev666-- 2018-01 sea 1 tpp 0
59fd96dddfa2830001fb24aa Kev666-- 2018-01 sea 1 tpp 0
59fd96dddfa2830001fb24aa Kev666-- 2018-01 sea 1 tpp 0
59fd96dddfa2830001fb24aa Kev666-- 2017-pre6 sea 1 tpp 1
59fd96dddfa2830001fb24aa Kev666-- 2017-pre6 sea 1 tpp 3
59fd96dddfa2830001fb24aa Kev666-- 2017-pre5 sea 1 tpp 0
...
```

###### result.csv
You should see summarization as such:
```
kills Frequency Relative Frequency
0 158768 0.574224839
1 62972 0.227754249
2 27958 0.101117215
3 13143 0.047535001
5 3330 0.012043792
6 1776 0.006423356
4 6399 0.02314361
8 515 0.001862628
7 945 0.003417833
10 175 0.000632932
12 45 0.000162754
9 285 0.001030775
11 91 0.000329125
14 26 9.40E-05
26 1 3.62E-06
15 9 3.26E-05
13 35 0.000126586
16 4 1.45E-05
18 3 1.09E-05
17 7 2.53E-05
29 1 3.62E-06
20 1 3.62E-06
19 2 7.23E-06
...
```
## Acknowledgement
I thank my co-workers here at Bullup Inc. for their generous help. I thank *op.gg* for not banning my IP, because as you can see, I did not set up proxy. All credits go to *op.gg*, because I am using their backdoor APIs and database and I feel obligated to say so.

## Footnote
I know I said a lot. Today is the last business day of year 2017, and I finished this project on PUBG, my favorite game so far. To give something back to the world, I decided to make this repo public and write nice documentation for it XD.

*On AWS, it took me 3866 seconds to run finder with two degrees of userMap, 15435 seconds to run scraper, 0 seconds to run reader.*
```html
2017>
<2018>
```