实用的python《爬虫》

Beautiful Soup特点:它是一个工具箱,通过解析文档为用户提供需要抓取的数据

Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。

Beautiful Soup 的安装

安装 pip (如果需要): sudo easy_install pip

安装 Beautiful Soup: sudo pip install beautifulsoup

1.打开Chrome的调试栏,找到<div>的位置(父容器)

导包:

import sys

import json

import urllib2 as HttpUtils

import urllib as UrlUtils

from bs4 import BeautifulSoup 

获取页面信息(分页):

def gethtml(page):

    '获取指定页码的网页数据'

    url = 'https://box.xxx.com/Project/List'

    values = {

        'category': '',

        'rate': '',

        'range': '',

        'page': page

    }

    data = UrlUtils.urlencode(values)

    # 使用 DebugLog

    httphandler = HttpUtils.HTTPHandler(debuglevel=1)

    httpshandler = HttpUtils.HTTPSHandler(debuglevel=1)

    opener = HttpUtils.build_opener(httphandler, httpshandler)

    HttpUtils.install_opener(opener)

    request = HttpUtils.Request(url + '?' + data)

    request.get_method = lambda: 'GET'

    try:

        response = HttpUtils.urlopen(request, timeout=10)

    except HttpUtils.URLError, err:

        if hasattr(err, 'code'):

            print err.code

        if hasattr(err, 'reason'):

            print err.reason

        return None

    else:

        print '====== Http request OK ======'

    return response.read().decode('utf-8')

解析获取的数据

创建BeautifulSoup对象

soup = BeautifulSoup(html, 'html.parser')

获取待遍历的对象

# items 是一个 <listiterator object at 0x10a4b9950> 对象,不是一个list,但是可以循环遍历所有子节点。

items = soup.find(attrs={'class':'row'}).children

projectList = []

for item in items:

    if item == '\n': continue

    # 获取需要的数据

    title = item.find(attrs={'class': 'title'}).string.strip()

    projectId = item.find(attrs={'class': 'subtitle'}).string.strip()

    projectType = item.find(attrs={'class': 'invest-item-subtitle'}).span.string

    percent = item.find(attrs={'class': 'percent'})

    state = 'Open'

    if percent is None: # 融资已完成

        percent = '100%'

        state = 'Finished'

        totalAmount = item.find(attrs={'class': 'project-info'}).span.string.strip()

        investedAmount = totalAmount

    else:

        percent = percent.string.strip()

        state = 'Open'

        decimalList = item.find(attrs={'class': 'decimal-wrap'}).find_all(attrs={'class': 'decimal'})

        totalAmount =  decimalList[0].string

        investedAmount = decimalList[1].string

    investState = item.find(attrs={'class': 'invest-item-type'})

    if investState != None:

        state = investState.string

    profitSpan = item.find(attrs={'class': 'invest-item-rate'}).find(attrs={'class': 'invest-item-profit'})

    profit1 = profitSpan.next.strip()

    profit2 = profitSpan.em.string.strip()

    profit = profit1 + profit2

    term = item.find(attrs={'class': 'invest-item-maturity'}).find(attrs={'class': 'invest-item-profit'}).string.strip()

    project = {

        'title': title,

        'projectId': projectId,

        'type': projectType,

        'percent': percent,

        'totalAmount': totalAmount,

        'investedAmount': investedAmount,

        'profit': profit,

        'term': term,

        'state': state

    }

    projectList.append(project)





木辛 2023-10-21