0%

动态网页爬虫示例——如何抓取答题界面题目


动态界面爬虫示例

这里仅针对编程技巧差如狗的同学们(比如本博主)一些简单粗暴的爬虫技巧。

关键点搜寻

牢记两个东西,一个是Request URL,另一个是Request Header(准确的说是Request Header中的User-Agent)。我们寻找它的方式很简单,Chrome浏览器鼠标右键,打开检查。然后按照Network->XHR的顺序点击,寻找动态网页有没有什么通过发送请求来获取的链接文件,比如xxxx.php。如果你打开发现没有文件,请刷新后再看过。于是你看到下面这张图。

于是你瞬间意识到事情不简单,然后把它整段地扒拉了下来:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
General:
Request URL: https://www.say-huahuo.com/qa.php
Request Method: GET
Status Code: 200
Remote Address: 104.24.98.123:443
Referrer Policy: no-referrer-when-downgrade

Response Headers:
alt-svc: h3-27=":443"; ma=86400, h3-25=":443"; ma=86400, h3-24=":443"; ma=86400, h3-23=":443"; ma=86400
cache-control: no-store, no-cache, must-revalidate
cf-cache-status: DYNAMIC
cf-ray: 59602ebbbd56aa4e-SIN
cf-request-id: 02cffb89520000aa4ec1b86200000001
content-encoding: br
content-type: text/html; charset=UTF-8
date: Tue, 19 May 2020 19:21:43 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
expires: Thu, 19 Nov 1981 08:52:00 GMT
pragma: no-cache
server: cloudflare
status: 200
strict-transport-security: max-age=15552000; includeSubDomains
x-powered-by: PHP/7.3.17

Resquest Headers:
:authority: www.say-huahuo.com
:method: GET
:path: /qa.php
:scheme: https
accept: application/json, text/plain, */*
accept-encoding: gzip, deflate, br
accept-language: zh,en;q=0.9,en-US;q=0.8,zh-TW;q=0.7
cookie: __cfduid=df20cba4727396e1ed49ee026cabfda4d1589903621; FKX9_aeeb_saltkey=AZCORcaa; FKX9_aeeb_lastvisit=1589900021; FKX9_aeeb_sid=A10BEi; FKX9_aeeb_visitedfid=55; FKX9_aeeb_viewid=tid_38897; FKX9_aeeb_st_p=0%7C1589903803%7C0e03c5e65115dd1fde5dc1e8067bfa6a; FKX9_aeeb_secqaa=11333.adc5aad3849465fb52; FKX9_aeeb_lastact=1589903820%09misc.php%09seccode; FKX9_aeeb_seccode=11334.5d33298c3891337235; PHPSESSID=d6srkdmo0h4avtdfgrv91fios3
referer: https://www.say-huahuo.com/answer/
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: same-origin
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36
x-requested-with: XMLHttpRequest

你发现这玩意儿是个通过request.get的请求方式来获取链接文件的。那么很简单,把URL和user-agent通过Ctrl+C搞过来不就得了。然后你就觉得你会了,加紧力度地写了几行简单的代码:

1
2
3
4
5
6
7
8
import requests

url = 'https://www.say-huahuo.com/qa.php'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.138 Safari/537.36'}

response = requests.get(url=url, headers=headers, verify=False, timeout=30)
print(response.text)

你一运行,突然发现,咦这text咋回事?这是乱码了么?

1
[{"title":"\u7231\u8863\u9171\u5927\u80dc\u5229\u51fa\u81ea\u54ea\u90e8\u756a\uff1f","code":"6239xt","options":["\u6211\u7684\u59b9\u59b9\u4e0d\u53ef\u80fd\u600e\u4e48\u53ef\u7231","\u6211\u7684\u5973\u53cb\u548c\u9752\u6885\u7af9\u9a6c\u7684\u60e8\u70c8\u4fee\u7f57\u573a","\u5200\u5251\u795e\u57df","\u6211\u7684\u9752\u6625\u604b\u7231\u7269\u8bed\u679c\u7136\u6709\u95ee\u9898"],"img":""},{"title":"\u8725\u8734\u306e\u5c3b\u5c3e\u5207\u308a\u4e2d\u7537\u4e3b\u4e3a\u4ec0\u4e48\u75f4\u8ff7\u4e8e\u4eba\u4f53\u518d\u751f\uff1f","code":"zfiuu7","options":["\u5973\u670b\u53cb\u5f97\u764c\u75c7\u4e86","\u5973\u670b\u53cb\u5f53\u7740\u4ed6\u7684\u9762\u8df3\u8f68\u81ea\u6740\u4e86","\u5973\u670b\u53cb\u4e0d\u7231\u4ed6\u4e86","\u60f3\u8981\u6551\u4eba"],"img":""},{"title":"\u8bf7\u95ee\u8bba\u575b\u7684\u5b66\u56ed\u957f\u662f\uff1f","code":"wjdddi","options":["say\u82b1\u706b","Say\u706b\u82b1","say\u706b\u82b1","Say\u82b1\u706b"],"img":""},{"title":"\u88ab\u8a89\u4e3a\u4e3a\u795e\u4ed9\u6c34\u7684\u662f\uff1f","code":"s8zlu8","options":["Albion\u5065\u5eb7\u6c34","\u9edb\u73c2\u7d2b\u82cf\u6c34","SK-II","\u5d02\u5c71\u767d\u82b1\u86c7\u8349\u6c34 "],"img":""},{"title":"\u8001\u865a\u662f____\u3002","code":"2iwjmt","options":["\u7231\u7684\u6218\u58eb","\u6696\u5165\u5fc3\u7530","\u50ac\u4eba\u6cea\u4e0b","\u5251\u9053\u72c2\u4eba "],"img":""},{"title":"\u56fd\u5185\u73a9\u5bb6\u5bf9\u4e8e\u300a\u590f\u5a03\u5e74\u4ee3\u7eaa\u300b\u91cc\u5927\u516c\u4e3b\u7684\u4e89\u8bae\u4e3b\u8981\u539f\u56e0\u4e3a\uff1f","code":"9vulnl","options":["\u4e0d\u53ef\u653b\u7565","ntr","\u8d2b\u4e73","\u4e11\u964b"],"img":""},{"title":"\u4ee5\u4e0b\u90a3\u4f5c\u4e0d\u662fKEY\u793e\u4f5c\u54c1\uff1f","code":"u7bqwi","options":["\u300aKanon\u300b","\u300aCLANNAD\u300b","\u300aLittle Busters!\u300b","\u300aClover Day\u2019s\u300b"],"img":""},{"title":"\u63d0\u8d77\u6cd5\u56fd\u5267\u4f5c\u5bb6\u7f57\u65af\u4e39\u7684\u4ee3\u8868\u4f5c\u300a\u897f\u54c8\u8bfa\u300b\uff0c\u4f60\u4f1a\u60f3\u5230\u4ee5\u4e0b\u54ea\u90e8galgame\uff1f","code":"vbxk2k","options":["\u521d\u96ea\u6a31","\u6a31\u4e4b\u8bd7","\u7535\u6ce2\u6d88\u901d\u4e4b\u65e5","\u7f8e\u597d\u7684\u6bcf\u4e00\u5929"],"img":""},{"title":"\u52a8\u753b\u300a\u7edd\u56ed\u7684\u66b4\u98ce\u96e8\u300b\u91cc\uff0c\u4e0d\u7834\u7231\u82b1\u7ecf\u5e38\u7231\u5f15\u7528____\u7684\u53e5\u5b50\u3002","code":"q76pmn","options":["\u6728\u6876","\u9e45\u5988\u5988\u7ae5\u8c23","\u54c8\u59c6\u96f7\u7279","\u5b64\u5c9b\u4e4b\u9b3c"],"img":""},{"title":"\u4e0b\u9762\u54ea\u4e00\u90e8\u4e0d\u662f\u949f\u8868\u793e\u7684\u4f5c\u54c1\uff1f","code":"mv4s5j","options":["\u624b\u57a2\u5857\u308c\u306e\u5929\u4f7f","\u53cb\u7231","maggot\u00a0biats","euphoria"],"img":""},{"title":"\u7eb8\u4e0a\u9b54\u6cd5\u4f7f\u4e2d\u54ea\u4e00\u4e2a\u662f\uff1f","code":"vak6cu","options":["\u56db\u6761\u7409\u7483","\u4f0f\u89c1\u7406\u592e","\u6708\u675c\u5983","\u6e38\u884c\u5bfa\u591c\u5b50"],"img":""},{"title":"\u52a8\u753b\u300a\u5e72\u7269\u59b9\uff01\u5c0f\u57cb\u300b\u4e2d\uff0c\u5c0f\u57cb\u7684\u6e38\u620f\u540d\u79f0\u662f\u4ec0\u4e48\uff1f","code":"788l26","options":["JNB","UMR","UZR","PDD"],"img":""},{"title":"\u52a8\u753b\u6e05\u604b\u8d70\u4e86\u51e0\u6761\u5973\u4e3b\u7ebf\uff1f","code":"9hsi3x","options":["1","2","3","4"],"img":""},{"title":"key\u793e\u7684\u6625\u590f\u51ac\u79cb\u56db\u5b63\u5206\u522b\u662f\uff0cCLANNAD\u3001AIR\u3001Kanon\u548c____\u3002 ","code":"hialhr","options":["Memories Off","\u79cb\u8272\u604b\u534e","\u79cb\u8272\u4e4b\u7a7a","ONE\uff5e\u8f89\u4e4b\u5b63\u8282\uff5e"],"img":""},{"title":"\u300a\u9ed1\u6267\u4e8b\u300b\u4e2d\uff0c\u4f0a\u4e3d\u838e\u767d\u662f\uff1f","code":"ie9pev","options":["\u5730\u4e3b\u5bb6\u7684\u50bb\u5973\u513f","\u6218\u6597\u5973\u4ec6","\u516c\u4e3b","\u5251\u672f\u5929\u624d"],"img":""},{"title":"\u65e5\u672c\u540c\u4eba\u6f2b\u753b\u5bb6\u4e8c\u9636\u5802\u307f\u3064\u304d\u548c\u54ea\u4f4d\u6f2b\u753b\u5bb6\u5173\u7cfb\u6700\u8fd1\uff1f","code":"eot6cy","options":["\u5927\u5c9b\u6c38\u8fdc","\u5d69\u4e43\u6714","\u306a\u3082\u308a","\u30b5\u30d6\u30ed\u30a6\u30bf"],"img":""},{"title":"\u4ee5\u4e0b\u54ea\u4e2a\u89d2\u8272\u51fa\u81ea\u7ea6\u4f1a\u5927\u4f5c\u6218\uff1f","code":"uw0p7t","options":["\u91ce\u9e70\u4e09\u56db","\u5929\u8349\u56db\u90ce","\u65f6\u5d0e\u72c2\u4e09","\u52a0\u85e4\u5609\u4e00"],"img":""},{"title":"\u4ee5\u4e0b\u54ea\u4e2a\u4eba\u7269\u548c\u4f50\u4f2f\u514b\u54c9\u51fa\u73b0\u5728\u540c\u4e00\u6b3e\u6e38\u620f\u91cc\uff1f","code":"4cu136","options":["\u74e6\u5c14\u4f0a\u8fbe","\u5fa1\u5802\u5b5d\u5178","\u4f50\u4f2f\u864e\u6b21\u90ce","rin"],"img":""},{"title":"\u661f\u4e4b\u5361\u6bd4\u4e2d\uff0c\u5361\u6bd4\u541e\u4e0b\u5c0f\u602a\u4e4b\u540e\u4f1a\u53d1\u751f\u4ec0\u4e48\uff1f","code":"0m5dna","options":["\u53d8\u8eab","\u4ec0\u4e48\u90fd\u4e0d\u53d1\u751f","\u6fc0\u6012\u5c0f\u602a","\u4ee5\u4e0a\u90fd\u6709\u53ef\u80fd "],"img":""},{"title":"\u4ee5\u4e0b\u56db\u4e2a\u89d2\u8272\u54ea\u4e2a\u559c\u6b22\u5973\u6027\uff1f","code":"fbrksh","options":["\u4f50\u6761\u5229\u4eba","\u5ddd\u795e\u767e\u4ee3","\u79cb\u6fd1\u6216","\u4f50\u85e4\u5723"],"img":""}]

不!实际上这是json数据。json可真是个好文明,都不需要解析器,直接查字典就可以转码了。至于如何更好地补全题库,既然他这出题是随机的,那么我多for loop一下,总是能把题目全给拿下的。于是,粗暴的爬虫代码就完成了。

花火学园答题界面Python代码示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# 目标: https://www.say-huahuo.com/answer/#/exam/
# 返回的是json数据,那么就不需要解析器了。直接转字典就好了。

import requests
import json

for n in range(0, 20):
# 传递信息的真实url
url = 'https://www.say-huahuo.com/qa.php'

# 解决拒绝访问问题
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/81.0.4044.138 Safari/537.36'}

# 使用request打开并获取网页内容
response = requests.get(url=url, headers=headers, verify=False, timeout=30)
print(response.text)

content = response.content

# json格式转换为字典
result = json.loads(content)
print(content, result)

# 获取相关信息并存入列表的字典中
HuaHuo_List = []
mos = result

for i in range(0, len(mos)):
mo = {}
mo['title'] = mos[i]['title']
mo['options'] = mos[i]['options']
HuaHuo_List.append(mo)

file = open('HuaHuo.txt', 'a+', encoding='utf-8')

for line in HuaHuo_List:
file.write(str(line))

file.write('\n') # 显示写入换行

file.close()