如题所述
1. ç¬ä¸ªå«ç¹å®ç½ç«ï¼ä¸ä¸å®å¾ç¨pythonåç¬è«ï¼å¤æ°æ
åµwgetä¸æ¡å½ä»¤å¤æ°ç½ç«å°±è½ç¬çä¸éï¼ççç©å°èªå·±åç¬è«äºï¼æç»éå°çæ éæ¯å¦ä½å大å壮ï¼æä¹ååå¸å¼ç¬è«ãscrapyè¿ç§ä»·å¼æ¥è¿0ï¼å¼æ¥æè
å¤çº¿ç¨ææåï¼éä¸ä¸ªæççåºäºç£ççéååºï¼kafkaä¹ç±»çï¼scrapy帮äºå¥ï¼
2. httpåºä¼å¤ï¼è¿ægeventåºmonkey patch以åcoroutineçç©è¿ä¸éæ©ï¼è§æ¨¡åä¸çè¯urllib3å¾å¥½ã
3. 对ä»ç½ç«ç诸å¦ç»å½ãajaxï¼è¿ç§ä¸è¿æ¯ä½åæ°å·¥æ´»ï¼ä¸å±å¼äºã
4. é度å¾éè¦ï¼æ¾ec2æè å½å çäºä¸è·ï¼å¾éè¦çææ æ¯ä½ æ¯ä¸äº¿ç½é¡µç¬ä¸æ¥ææ¬å¤å°ï¼ç¬çæ¶åæ¯å¦4æ ¸ä¸ä¸ªèææºèç¹ï¼ä½ è½inbound贷款ç¨è¶³100mbpsåã
5. beautifulsoupå¤ªæ ¢ï¼å ¨ç½çç¬ï¼encodingçåæä¹è¦è¦å¿«ï¼cå®ç°çchardetè¿è¡
æå ³é®çï¼æ°¸è¿æ¯ç¬ä¸æ¥ä»¥åçä¿¡æ¯çæåãåæã使ç¨ï¼å°±æ¯å¦å¤ä¸ä¸ªè¯é¢äºã
1.å¦ä¼ä½¿ç¨chromeæµè§å¨æ¥çé信以åæ¥çå ç´ æ ¼å¼
2.å¢å User-Agent, è¿æ¯æç®åçåç¬æªæ½äº
3.åç¬è«æ好使ç¨Ipythonï¼å¨äº¤äºå¼çç¯å¢ä¸ï¼å¯ä»¥æ¶å»äºè§£èªå·±é®é¢å ·ä½åºå¨åªé
4.使ç¨requests
5.ç¨getæè postä¸å¥½htmlä¹åï¼è¦ç¡®è®¤ä½ éè¦çä¸è¥¿htmléé¢æï¼èä¸æ¯ä¹åç¨ajaxæè javascriptå è½½çã
6.解æçè¯ï¼BeautifulSoupä¸éã对äºå°æ°é常ç¹æ®çï¼å¯ä»¥èèç¨reã
7ï¼éè¦å¤§éééæ°æ®çè¯ï¼å¦ä¼ä½¿ç¨æ¡æ¶ï¼æ¯å¦scrapyã
è¿é¶ï¼
å å ¥ç½ç«éè¦æ¨¡æç»éï¼éé¢ä½¿ç¨äºå¾å¤ajaxæè javascriptï¼æè åç¬è«å害ï¼ç¨requestsçsessionï¼æ³¨æF12æ¥çå°åºåéäºä»ä¹æ°æ®ã
å®å¨ä¸ä¼ï¼å°±ä½¿ç¨æ¨¡ææµè§å¨å§ï¼æ¨èseleniumï¼è½ç¶éåº¦æ ¢ç¹ï¼å åå¤ç¹ï¼ä½æ¯ççå¾çåï¼èä¸åºæ¬æ¥ä¸åºæ¥ã
æåï¼ç¬è«é度ä¸è¦å¤ªå¿«ï¼å ä¸time.sleep(1),å°½éå°ç¨å¤çº¿ç¨ï¼å«äººå»ºç«ä¹ä¸å®¹æï¼ï¼å°¤å ¶æ¯å°ç«ï¼ä½ ä¸ç»å«äººå¸¦æ¥å¾å¤§ç麻ç¦ï¼å«äººä¹å°±çä¸åªç¼éä¸åªç¼äºï¼å¦åå°IPä¸æ¯å¥½ç©çã
æäºé¡µé¢å欢使ç¨redirectï¼ç¶èrequestsçgetåpostæ¹æ³ä¸é»è®¤æ¯ç´æ¥è·³è½¬çï¼å¾å¯è½ä½ 就带çé误çcookiesåheaders跳转äºï¼æ以å¡å¿ å°allow_redirectsåæ°è®¾ä¸ºfalse
2. httpåºä¼å¤ï¼è¿ægeventåºmonkey patch以åcoroutineçç©è¿ä¸éæ©ï¼è§æ¨¡åä¸çè¯urllib3å¾å¥½ã
3. 对ä»ç½ç«ç诸å¦ç»å½ãajaxï¼è¿ç§ä¸è¿æ¯ä½åæ°å·¥æ´»ï¼ä¸å±å¼äºã
4. é度å¾éè¦ï¼æ¾ec2æè å½å çäºä¸è·ï¼å¾éè¦çææ æ¯ä½ æ¯ä¸äº¿ç½é¡µç¬ä¸æ¥ææ¬å¤å°ï¼ç¬çæ¶åæ¯å¦4æ ¸ä¸ä¸ªèææºèç¹ï¼ä½ è½inbound贷款ç¨è¶³100mbpsåã
5. beautifulsoupå¤ªæ ¢ï¼å ¨ç½çç¬ï¼encodingçåæä¹è¦è¦å¿«ï¼cå®ç°çchardetè¿è¡
æå ³é®çï¼æ°¸è¿æ¯ç¬ä¸æ¥ä»¥åçä¿¡æ¯çæåãåæã使ç¨ï¼å°±æ¯å¦å¤ä¸ä¸ªè¯é¢äºã
1.å¦ä¼ä½¿ç¨chromeæµè§å¨æ¥çé信以åæ¥çå ç´ æ ¼å¼
2.å¢å User-Agent, è¿æ¯æç®åçåç¬æªæ½äº
3.åç¬è«æ好使ç¨Ipythonï¼å¨äº¤äºå¼çç¯å¢ä¸ï¼å¯ä»¥æ¶å»äºè§£èªå·±é®é¢å ·ä½åºå¨åªé
4.使ç¨requests
5.ç¨getæè postä¸å¥½htmlä¹åï¼è¦ç¡®è®¤ä½ éè¦çä¸è¥¿htmléé¢æï¼èä¸æ¯ä¹åç¨ajaxæè javascriptå è½½çã
6.解æçè¯ï¼BeautifulSoupä¸éã对äºå°æ°é常ç¹æ®çï¼å¯ä»¥èèç¨reã
7ï¼éè¦å¤§éééæ°æ®çè¯ï¼å¦ä¼ä½¿ç¨æ¡æ¶ï¼æ¯å¦scrapyã
è¿é¶ï¼
å å ¥ç½ç«éè¦æ¨¡æç»éï¼éé¢ä½¿ç¨äºå¾å¤ajaxæè javascriptï¼æè åç¬è«å害ï¼ç¨requestsçsessionï¼æ³¨æF12æ¥çå°åºåéäºä»ä¹æ°æ®ã
å®å¨ä¸ä¼ï¼å°±ä½¿ç¨æ¨¡ææµè§å¨å§ï¼æ¨èseleniumï¼è½ç¶éåº¦æ ¢ç¹ï¼å åå¤ç¹ï¼ä½æ¯ççå¾çåï¼èä¸åºæ¬æ¥ä¸åºæ¥ã
æåï¼ç¬è«é度ä¸è¦å¤ªå¿«ï¼å ä¸time.sleep(1),å°½éå°ç¨å¤çº¿ç¨ï¼å«äººå»ºç«ä¹ä¸å®¹æï¼ï¼å°¤å ¶æ¯å°ç«ï¼ä½ ä¸ç»å«äººå¸¦æ¥å¾å¤§ç麻ç¦ï¼å«äººä¹å°±çä¸åªç¼éä¸åªç¼äºï¼å¦åå°IPä¸æ¯å¥½ç©çã
æäºé¡µé¢å欢使ç¨redirectï¼ç¶èrequestsçgetåpostæ¹æ³ä¸é»è®¤æ¯ç´æ¥è·³è½¬çï¼å¾å¯è½ä½ 就带çé误çcookiesåheaders跳转äºï¼æ以å¡å¿ å°allow_redirectsåæ°è®¾ä¸ºfalse
温馨提示:答案为网友推荐,仅供参考