python静态的html爬取笔记（一）

爬虫 python

关于python，好多初学者一开始接触的都可能是爬虫

下面是我学习是记录的一点笔记

目前爬虫框架python也有很成熟的，但是我为了学习还是自己还是动手学习一下基础知识。这样以后在用框架的过程中才能运用自如，并且知道他们的原理是什么，好了别的不多说了。

要想从互联网上抓取自己想要的数据分为几个步骤

1：找到要抓取的目标网站或者说第一入口网站

2：目标网站的数据加载方式是什么？可分两种：一种是通过js异步加载的数据，另一种是通过后台数据直接渲染的数据，这些数据会直接显示在前台

3：通过python解析这些数据，得到自己想要的结果

接下来我们先介绍后台直接渲染的数据抓取

1 所需要的工具

frombs4 importBeautifulSoup #这个是解析网页html结构用的

importrequests #发送http协议获取网页源代码用的

importre #正则匹配使用

下面是代码示例：抓取的是儿歌网站里面的下载地址

from bs4 import BeautifulSoup
import requests
import re
import MyThread #线程是我自己写的，大家可以不用
import threading

threadLock = threading.Lock()
def getMp4(text = ''):
    mach = re.search(r'http.*\.mp4', text, re.I)

    if mach:
        return mach.group()
    else:
        return mach

def getTieleVideo(url='',img=''):
    res = requests.get(url)

    js_string = re.search(r'\"bk\":\"(.*)\}',res.text,re.I|re.M)
    soup = BeautifulSoup(res.text, "html.parser")

    if js_string:
        video_url = re.search(r'http(.*)mp4',js_string.group().split(',')[0])
        if video_url:
            video_url_string = re.sub(r'\\', '', video_url.group())
        else:
            video_url_string = None
    else:
        video_url_string = None
    title = soup.find('h1', class_ = 'content_title')
    return [title.string, img, video_url_string]

def getLink (url=''):
    res = requests.get(url)
    soup = BeautifulSoup(res.text, 'html.parser')
    ul = soup.find('ul', id = 's1')
    if ul:
        for item in ul.children:
            a = item.find('a')
            img = item.find('img')
            if a != -1:
                titleVideo_list = getTieleVideo('http://www.4399er.com' + a['href'],img['src'])
                if titleVideo_list[2]:
                    threadLock.acquire()
                    print(titleVideo_list)
                    threadLock.release()

res = requests.get("http://www.4399er.com/erge/egty/")
soup = BeautifulSoup(res.text, 'html.parser')
zong_div = soup.find('div', class_='mod_pg')
zong_sapn = zong_div.find('span', class_='zong')
zong_re = re.match(r'共(\w+)页',zong_sapn.string)
zong_num = 0
if zong_re:
    # zong_num = int(zong_re.group(1))
    zong_num = int(3)


threadList = []
t = ''
for i in range(zong_num):
   i+=1
   if i == 1:
       url = 'http://www.4399er.com/erge/egty/'
   else:
       url = 'http://www.4399er.com/erge/egty/list-243-'+str(i)+'.html'
       t = MyThread.MyThread(getLink,url)
       t.start()

       threadList.append(t)

for a in threadList:
    print(a.join())

requests 文档 http://docs.python-requests.org/zh_CN/latest/user/install.html

BeautifulSoup 文档 https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

re 文档使用方法 http://www.runoob.com/python3/python3-reg-expressions.html

python爬虫笔记动态js网页爬取（二） http://suiyidian.cn/post-167.html

本文由 kevin 创作，采用知识共享署名4.0 国际许可协议进行许可。
本站文章除注明转载/出处外，均为本站原创或翻译，转载前请务必署名。