新浪财经数据抓取

写了一个Python抓取新浪财经的脚本

#!/usr/bin/python
# -*- coding: utf-8 -*-
############################################
# @brief    新浪财经 http://hq.sinajs.cn/list=hk 行情编号，英文代码，中文名称
# @author  dwk715@gmail.com
############################################
from urllib.request import urlopen
from bs4 import BeautifulSoup

listID = "00000"
count = 99999
with open('hk.txt', 'w') as f:

    for x in range(count):
        listID = int(listID) + 1
        listID = str(listID).zfill(5)
        url = "http://hq.sinajs.cn/list=hk" + listID
        html = urlopen(url).read()
        bsObj = BeautifulSoup(html, "html.parser", fromEncoding="GBK")
        content = bsObj.get_text()
        print(content.find(','))
        if(content.find(',') != -1):
            conntent = content.split('"')
            conntent = conntent[1].split(',')
            print(conntent)
            f.write("HK" + listID)
            f.write(',')
            f.write(conntent[0])
            f.write(',')
            f.write(conntent[1])
            f.write("\n")
        else:
            pass

脚本的运行效率不高，循环99999次用了2000s。

这里记录下写时遇到的几个坑：

zfill()

Python zfill() 方法返回指定长度的字符串，原字符串右对齐，前面填充0。

语法：

1	str.zfill(width)

bs4的get_text()

这个方法的返回值为str，我以前一直以为为list

中文字符乱码
这个问题是看这篇博文解决的,这里直接抄原画：
- Beautiful Soup 会按顺序尝试不同的编码将你的文档转换为Unicode：
- 可以通过fromEncoding参数传递编码类型给soup的构造器
- 通过文档本身找到编码类型：例如XML的声明或者HTML文档http-equiv的META标签。如果Beautiful Soup在文档中发现编码类型，它试着使用找到的类型转换文档。但是，如果你明显的指定一个编码类型，并且成功使用了编码：这时它会忽略任何它在文档中发现的编码类型。
- 通过嗅探文件开头的一下数据，判断编码。如果编码类型可以被检测到，它将是这些中的一个：UTF-*编码，EBCDIC或者ASCII。
- 通过chardet库,嗅探编码，如果你安装了这个库。
  
  总的来说:

如果是网页标称为GB2312，但是部分字符用到了GBK的了，那么解决办法就是在调用BeautifulSoup，传递参数fromEncoding=”GBK”，即：

1 2	page = urllib2.build_opener().open(req).read() soup = BeautifulSoup(page, fromEncoding=”GBK“)

如果是网页标称为GBK，但是部分字符用到了GB18030的了，那么解决办法就是在调用BeautifulSoup，传递参数fromEncoding=”GB18030″，即：

1 2	page = urllib2.build_opener().open(req).read() soup = BeautifulSoup(page, fromEncoding=”GB18030“)

实际上由于GB18030从字符数上都涵盖了GB2312和GBK，所以如果是上述两种任意情况，即只要是中文字符出现乱码，不管是标称GB2312中用到了GBK，还是标称GBK中用到了GB18030，那么都直接传递GB18030，也是可以的，即：

1	soup = BeautifulSoup(page, fromEncoding=”GB18030“)

哦，对了，接下来会学习下多线程处理（笑）