鸿 网 互 联 www.68idc.cn

当前位置 : 服务器租用 > 数据库 > DB2 > >

使用Spark分析拉勾网招聘信息(二): 获取数据

来源:互联网 作者:佚名 时间:2018-02-10 22:09
要获取什么样的数据? 我们要获取的数据,是指那些公开的,可以轻易地获取地数据.如果你有完整的数据集,肯定是极好的,但一般都很难通过还算正当的方式轻易获取.单就本系列文章要研究的实时招聘信息来讲,能获取最近一个月的相关信息,已是足矣. 如何获取数据? 爬

要获取什么样的数据?

我们要获取的数据,是指那些公开的,可以轻易地获取地数据.如果你有完整的数据集,肯定是极好的,但一般都很难通过还算正当的方式轻易获取.单就本系列文章要研究的实时招聘信息来讲,能获取最近一个月的相关信息,已是足矣.

如何获取数据?

爬虫,也是可以的,作为一个备选方案.但是,我注意到拉勾网本身的数据,是通过ajax请求更新的,所以批量获取变得更加简单.基于ajax请求来获取数据,方式有很多,这里我演示其中的自认为较为简单通用的一种: 使用 curl 模拟 ajax 请求获取数据.

注意,以下的步骤演示全部基于 Mac 版的 ** Google Chrome** 浏览器,其他浏览器部分操作可能会有些许差异.最后一步会给出 提取出的通用 curl 脚本,直接其实也是可以的,如果对步骤不是很关心.

1.找到目标城市和目标职位,然后按"最新排序",参考链接: http://www.lagou.com/jobs/list_iOS?px=new&city=北京#order

0-0.png

2.双指击/右击 页面,弹出快捷菜单,选择"检查",以进入浏览器调试界面,切换到调试器的 network -> xhr 标签下.

0_1.png

3.cmd + R 刷新页面,此时会捕捉到此页面发出的xhr请求.找到 http://www.lagou.com/jobs/positionAjax.json 开头的请求,并双指击/右击,选择 copy as cUrl.

这个 curl代码非常长,对于本次分析来说,最关键的是 末尾的 pn=1&kd=iOS,分别代表着页面和职位,动态设置,即可获取更多职位的更多数据了,文章的其他篇幅,会单独分析.

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed

0_2.png

4.讲上一步中的curl指令复制到终端,桥下回车键,即可看到输出.

{"success":true,"requestId":null,"msg":null,"resubmitToken":null,"content":{"pageNo":1,"pageSize":15,"positionResult":{"totalCount":974,"resultSize":15,"locationInfo":{"city":"北京","district":null,"queryByGisCode":false,"businessZone":null,"locationCode":null},"queryAnalysisInfo":{"positionName":"ios","companyName":null,"usefulCompany":false,"industryName":null},"strategyProperty":{"name":"dm-csearch-newSimScorer","id":1},"result":[{"companyId":129801,"companyShortName":"言之有物科技","createTime":"2016-08-30 19:28:12","positionId":1857486,"positionAdvantage":"一线公司,技术驱动,免费三餐,超期望回报","salary":"25k-50k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"iOS高级研发工程师/Lead","companyLogo":"i/image/M00/43/4E/CgqKkVeDGsuAXz0gAAA4XeGAAHQ390.png","financeStage":"成长型(A轮)","industryField":"移动互联网,电子商务","jobNature":"全职","approve":1,"companySize":"15-50人","district":null,"companyLabelList":["股票期权","扁平管理","美女多","领导好"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:28发布","gradeDescription":null,"companyFullName":"北京言之有物科技有限公司","businessZones":null,"imState":"today","lastLogin":1472556472000,"publisherId":5092848,"explain":null,"plus":null,"pcShow":0},{"companyId":133,"companyShortName":"猎豹移动","createTime":"2016-08-30 19:09:34","positionId":2151896,"positionAdvantage":"明星产品 超赞年终奖 靠谱领导","salary":"15k-30k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/39/70/CgYXBlWo3nqABJTsAADJ3hn5gmE062.jpg","financeStage":"上市公司","industryField":"移动互联网,信息安全","jobNature":"全职","approve":1,"companySize":"500-2000人","district":"朝阳区","companyLabelList":["带薪年假","美女前台","超赞年终奖","一公里工作圈"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:09发布","gradeDescription":null,"companyFullName":"北京金山网络科技有限公司","businessZones":["姚家园","十里堡","高碑店"],"imState":"today","lastLogin":1472555392000,"publisherId":129969,"explain":null,"plus":null,"pcShow":0},{"companyId":107608,"companyShortName":"MUM计算机","createTime":"2016-08-30 19:03:24","positionId":1963945,"positionAdvantage":"帮助程序员赴美做IT,享受高薪高品质生活","salary":"10k-20k","score":0,"workYear":"不限","education":"本科","city":"北京","positionName":"IOS程序员赴美项目推广员","companyLogo":"i/image/M00/00/C2/CgqKkVZVHmSAWPtRAASUg0iUVuI932.jpg","financeStage":"初创型(不需要融资)","industryField":"教育","jobNature":"全职","approve":0,"companySize":"少于15人","district":"昌平区","companyLabelList":["赴美工作","美元薪水","告别996","技术前沿"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"19:03发布","gradeDescription":null,"companyFullName":"北京玛赫西计算机教育咨询有限公司","businessZones":null,"imState":"disabled","lastLogin":1472558059000,"publisherId":5179699,"explain":null,"plus":null,"pcShow":0},{"companyId":67576,"companyShortName":"车满满","createTime":"2016-08-30 18:47:30","positionId":2307877,"positionAdvantage":"期权","salary":"20k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS高级开发工程师","companyLogo":"i/image/M00/01/47/Cgp3O1ZmYACABBpPAAGzVR5S-Ps906.png","financeStage":"成长型(A轮)","industryField":"移动互联网","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["股票期权","技能培训","弹性工作","定期体检"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:47发布","gradeDescription":null,"companyFullName":"车满满(北京)信息技术有限公司","businessZones":["建外大街","CBD","国贸"],"imState":"today","lastLogin":1472566873000,"publisherId":2116322,"explain":null,"plus":null,"pcShow":0},{"companyId":1575,"companyShortName":"百度","createTime":"2016-08-30 18:30:05","positionId":2307765,"positionAdvantage":"BAT 薪酬福利好","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS移动开发","companyLogo":"image1/M00/00/06/CgYXBlTUWAWAOBXrAABGHHFb0q8748.jpg","financeStage":"上市公司","industryField":"移动互联网,数据服务","jobNature":"全职","approve":1,"companySize":"2000人以上","district":null,"companyLabelList":["股票期权","弹性工作","五险一金","免费班车"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:30发布","gradeDescription":null,"companyFullName":"百度在线网络技术(北京)有限公司","businessZones":null,"imState":"disabled","lastLogin":1472553001000,"publisherId":5705515,"explain":null,"plus":null,"pcShow":0},{"companyId":13321,"companyShortName":"FunPlus 趣加游戏","createTime":"2016-08-30 18:26:28","positionId":2240276,"positionAdvantage":"国际一线团队,无限的成长空间,任你发挥","salary":"18k-36k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS 视频处理工程师/高级工程师","companyLogo":"image1/M00/00/1A/Cgo8PFTUWFWAKE5aAABwJ1mgAYw423.png","financeStage":"成长型(B轮)","industryField":"游戏","jobNature":"全职","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":["绩效奖金","股票期权","专项奖金","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:26发布","gradeDescription":null,"companyFullName":"北京趣加科技有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472552889000,"publisherId":285309,"explain":null,"plus":null,"pcShow":0},{"companyId":15111,"companyShortName":"联拓天际","createTime":"2016-08-30 18:22:12","positionId":2307696,"positionAdvantage":"与其在别处仰望,不如在这里并肩","salary":"15k-25k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"image1/M00/00/1D/Cgo8PFTUWGGAZQdjAADRNZVO9fc470.jpg","financeStage":"成熟型(不需要融资)","industryField":"电子商务","jobNature":"全职","approve":1,"companySize":"500-2000人","district":null,"companyLabelList":["五险一金","午餐补助","定期体检","技能培训"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:22发布","gradeDescription":null,"companyFullName":"北京联拓天际电子商务有限公司","businessZones":null,"imState":"today","lastLogin":1472552392000,"publisherId":1595082,"explain":null,"plus":null,"pcShow":0},{"companyId":119049,"companyShortName":"优久科技","createTime":"2016-08-30 18:15:29","positionId":1853231,"positionAdvantage":"良好的工作环境、成长平台和工作伙伴","salary":"10k-18k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS","companyLogo":"i/image/M00/16/74/CgqKkVbvnVuAeC-YAAA_YSPyb5A166.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"少于15人","district":"海淀区","companyLabelList":["交通补助","通讯津贴","午餐补助"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:15发布","gradeDescription":null,"companyFullName":"北京优久科技有限责任公司","businessZones":["中关村","知春路","人民大学"],"imState":"today","lastLogin":1472552013000,"publisherId":4427723,"explain":null,"plus":null,"pcShow":0},{"companyId":41878,"companyShortName":"商询科技","createTime":"2016-08-30 18:14:06","positionId":2278393,"positionAdvantage":"微软创业团队,工程师文化!","salary":"10k-15k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"iOS开发","companyLogo":"i/image/M00/24/22/Cgp3O1cZmpWAGslpAAA9MdgVNWU645.jpg","financeStage":"成长型(A轮)","industryField":"企业服务,数据服务","jobNature":"全职","approve":1,"companySize":"15-50人","district":"朝阳区","companyLabelList":["股票期权","人脉资源","办公环境好","国际化团队"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:14发布","gradeDescription":null,"companyFullName":"北京商询科技有限公司","businessZones":["姚家园"],"imState":"today","lastLogin":1472554153000,"publisherId":803257,"explain":null,"plus":null,"pcShow":0},{"companyId":5832,"companyShortName":"新浪微博","createTime":"2016-08-30 18:02:30","positionId":254885,"positionAdvantage":"亿级别DAU,微博重点项目组","salary":"20k-40k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"新浪微博iOS客户端研发工程师","companyLogo":"image1/M00/00/0D/CgYXBlTUWCCAdkhOAABNgyvZQag818.jpg","financeStage":"上市公司","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"2000人以上","district":"海淀区","companyLabelList":["年底双薪","专项奖金","股票期权","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:02发布","gradeDescription":null,"companyFullName":"微梦创科网络科技(中国)有限公司","businessZones":["西北旺","马连洼","上地"],"imState":"disabled","lastLogin":1472556144000,"publisherId":561302,"explain":null,"plus":null,"pcShow":0},{"companyId":48321,"companyShortName":"合广众","createTime":"2016-08-30 18:00:40","positionId":2263615,"positionAdvantage":"老板nice","salary":"10k-20k","score":0,"workYear":"3-5年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"i/image/M00/01/D6/CgqKkVZ496GAYypzAAAKATKLXuY379.png","financeStage":"初创型(天使轮)","industryField":"移动互联网","jobNature":"全职","approve":0,"companySize":"50-150人","district":"海淀区","companyLabelList":["节日礼物","带薪年假","绩效奖金","岗位晋升"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"18:00发布","gradeDescription":null,"companyFullName":"北京合广众文化发展有限公司","businessZones":["八里庄","定慧寺","四季青"],"imState":"today","lastLogin":1472550077000,"publisherId":3608518,"explain":null,"plus":null,"pcShow":0},{"companyId":38239,"companyShortName":"Keep","createTime":"2016-08-30 17:52:25","positionId":2076872,"positionAdvantage":"福利健全、北京工作居住证、C轮","salary":"25k-35k","score":0,"workYear":"5-10年","education":"本科","city":"北京","positionName":"iOS开发工程师","companyLogo":"image1/M00/0A/40/CgYXBlTun9KASqKdAAAs36QVurU409.png","financeStage":"成熟型(C轮)","industryField":"社交网络,文化娱乐","jobNature":"全职","approve":1,"companySize":"150-500人","district":null,"companyLabelList":["节日礼物","年度旅游","定期体检","五险一金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京卡路里科技有限公司","businessZones":null,"imState":"today","lastLogin":1472550738000,"publisherId":3425178,"explain":null,"plus":null,"pcShow":0},{"companyId":179,"companyShortName":"她理财","createTime":"2016-08-30 17:52:02","positionId":982402,"positionAdvantage":"五险一金 绩效奖金 年底15薪 带薪年假","salary":"15k-25k","score":0,"workYear":"1-3年","education":"本科","city":"北京","positionName":"高级iOS开发工程师","companyLogo":"image1/M00/0C/F2/CgYXBlT2mG2AOPevAAB_09mD2Ko247.png","financeStage":"成长型(A轮)","industryField":"电子商务,金融","jobNature":"全职","approve":1,"companySize":"50-150人","district":"朝阳区","companyLabelList":["年底双薪","节日礼物","技能培训","绩效奖金"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:52发布","gradeDescription":null,"companyFullName":"北京新工场投资顾问有限公司","businessZones":["大望路","华贸","百子湾"],"imState":"today","lastLogin":1472557005000,"publisherId":97147,"explain":null,"plus":null,"pcShow":0},{"companyId":11053,"companyShortName":"中科三方","createTime":"2016-08-30 17:33:13","positionId":2307276,"positionAdvantage":"留用机会,户口指标","salary":"2k-4k","score":0,"workYear":"应届毕业生","education":"本科","city":"北京","positionName":"iOS实习生","companyLogo":"image1/M00/00/16/CgYXBlTUWEWAXnWbAACvz96W4qA927.jpg","financeStage":"成长型(不需要融资)","industryField":"移动互联网","jobNature":"实习","approve":0,"companySize":"150-500人","district":"海淀区","companyLabelList":null,"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:33发布","gradeDescription":null,"companyFullName":"北京中科三方网络技术有限公司","businessZones":["中关村","知春路","双榆树"],"imState":"today","lastLogin":1472549621000,"publisherId":141237,"explain":null,"plus":null,"pcShow":0},{"companyId":116183,"companyShortName":"情非得已","createTime":"2016-08-30 17:28:11","positionId":1786957,"positionAdvantage":"五险一金、无限小吃、Mac办公、定期体检","salary":"8k-15k","score":0,"workYear":"1-3年","education":"不限","city":"北京","positionName":"android&iOS测试工程师","companyLogo":"i/image/M00/1C/58/CgqKkVcB1QyAJM2-AAA4t6tVzs8439.jpg","financeStage":"初创型(天使轮)","industryField":"移动互联网,企业服务","jobNature":"全职","approve":0,"companySize":"15-50人","district":"朝阳区","companyLabelList":["定期体检","年度旅游","领导好","扁平管理"],"adWord":0,"appShow":0,"deliver":0,"formatCreateTime":"17:28发布","gradeDescription":null,"companyFullName":"情非得已(北京)科技有限公司","businessZones":["建外大街","国贸","CBD"],"imState":"today","lastLogin":1472553855000,"publisherId":4170237,"explain":null,"plus":null,"pcShow":0}]}},"code":0}

可以看到,与网站的第一页获取的实际数据是完全对应的.

如何将数据保存为文件?

将curl的结果,直接保存为文件,才方便进一步处理,方法就是使用重定向符 >,以下代码,讲curl的结果,不是在控制器输出,而是保存到指定文件 1.json

curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data 'first=true&pn=1&kd=iOS' --compressed > 1.json

如何获取其他职位的数据?

此处需要一点更深入些的shell语法,简单说,需要一个for in 循环来遍历一组给定的职位,动态更改 前面curl脚本中的 末尾的kd属性的值,并写入职位对应的文件中,注意 末尾 --data后的 单引号对,要改成双引导对,否则无法应用变量.完整代码如下,职位数组,可按需自行添加:

for kd in "Java" "PHP" "C" "C++" "Android" "iOS"
do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=1&kd=$kd" --compressed > $kd.json 
done  

如何批量获取?

curl 脚本,现在是每次只可以获取单页,要想获取多页,加个for循环就可以了.经过观察,拉勾有效数据大概最多在100页左右,所以写个1~100的循环,并以 $kd_$pn.json 的格式保存:

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json
done  
done

如何提高获取速度?

如果你运行了上面的脚本,如你所见,似乎有点太慢,因为curl请求是同步执行的,必须一条下载完成后,才会继续执行下面的代码.可以借助 & 符 异步同时获取多个请求,来提高速度.另外需要注意的一点是:一个电脑,能同时创建的 curl 链接是有限的,为了避免不必要的中断,加了个极短的sleep,改进后的代码如下:

注意: 此处代码,可能会导致您的ip被lagou封闭,如果不是太赶时间的话,慎用;当然,你可以多换几个ip.

for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > $kd\_$pn.json &
sleep 0.02
done  
done

注意: 如果一直卡住不动,可以 ctrl + c 退出;如果总是异常脚本中断,可以尝试将 sleep 后的数值调大.

一个更完整的脚本

此处,单独将数据放到 jobs目录,以便于组织目录结构,完整数据可异步文首的github项目中下载:

mkdir jobs
for (( pn=1; pn<=100; pn=pn+1 )); do
for kd in "Java" "PHP" "C" "C++" "Android" "iOS"; do 
curl 'http://www.lagou.com/jobs/positionAjax.json?px=new&city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false' -H 'Cookie: user_trace_token=20160522122749-8a4d6717-1fd5-11e6-963e-5254005c3644; LGUID=20160522122749-8a4d6cb3-1fd5-11e6-963e-5254005c3644; tencentSig=8513357824; LGMOID=20160818212815-33C56329AA2FB6D809D557FD6CC1DE3C; JSESSIONID=E0C5692414B160F3BEFFCFF8D240B693; _gat=1; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=http%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_iOS%3Fpx%3Dnew%26city%3D%25E5%258C%2597%25E4%25BA%25AC; ctk=1472587366; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1470032014,1470032429,1471526897,1471752875; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1472587368; LGSID=20160831035434-92fffed3-6eeb-11e6-8c02-5254005c3644; LGRID=20160831040247-b90762b6-6eec-11e6-a745-525400f775ce; _ga=GA1.2.1654849521.1463891270; SEARCH_ID=506c8097626343ab9b5a2c1139807a2c' -H 'Origin: http://www.lagou.com' -H 'X-Anit-Forge-Code: 0' -H 'Accept-Encoding: gzip, deflate' -H 'Accept-Language: zh-CN,zh;q=0.8,en;q=0.6' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'Accept: application/json, text/javascript, */*; q=0.01' -H 'Cache-Control: max-age=0' -H 'X-Requested-With: XMLHttpRequest' -H 'Connection: keep-alive' -H 'X-Anit-Forge-Token: None' -H 'Referer: http://www.lagou.com/jobs/list_iOS?px=new&city=%E5%8C%97%E4%BA%AC' --data "first=true&pn=$pn&kd=$kd" --compressed > jobs/$kd\_$pn.json &
sleep 0.02
done  
done

另外,你可能会发现,部分职位并没有100页的有效数据,那是否需要额外处理这些数据呢?当然是没有的.Spark等大数据分析工具的一个基本功能就是适度数据集容错.部分异常数据,一般是不会影响数据本身的导入的.导入后,直接分析即可.这都是后话,此系列后面的文章会单独讲述的.


本系列专属github地址:https://github.com/ios122/spark_lagou

网友评论
<