python – 第2页 – LeoKim's Blog

Beautiful Soup 常用笔记

leokim / 2017年5月14日2017年5月14日 / Python

记一些bs4常用的东西

终点是后面的css选择器很方便

对象的种类

1.Tag

2.NavigableString

3.BeautifulSoup

4.Comment

Tag

tag对象与XML或HTML原生文档中的tag相同，主要属性是name，attributes

1 2	`tag.name` `#u'b'`

一个tag可能有很多个属性. tag <b class="boldest"> 有一个 “class” 的属性,值为 “boldest” . tag的属性的操作方法与字典相同:

1	`tag['class']# u'boldest'`

也可以直接”点”取属性, 比如: .attrs :

1	`tag.attrs# {u'class': u'boldest'}`

tag的属性可以被添加,删除或修改. tag的属性操作方法与字典一样

tag['class'] = 'verybold'
tag['id'] = 1
tag
# <blockquote class="verybold" id="1">Extremely bold</blockquote>
 
del tag['class']
del tag['id']
tag
# <blockquote>Extremely bold</blockquote>
 
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

NavigableString

BeautifulSoup用NavigableString 来包装tag中的字符串

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

一个 NavigableString 字符串与Python中的Unicode字符串相同,并且还支持包含在遍历文档树和搜索文档树中的一些特性. 通过 unicode() 方法可以直接将 NavigableString对象转换成Unicode字符串

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象,它支持遍历文档树和搜索文档树中描述的大部分的方法.

因为 BeautifulSoup 对象并不是真正的HTML或XML的tag,所以它没有name和attribute属性.但有时查看它的 .name 属性是很方便的,所以 BeautifulSoup 对象包含了一个值为 “[document]” 的特殊属性 .name

1 2	`soup.name` `# u'[document]'`

遍历文档树

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点.Beautiful Soup提供了许多操作和遍历子节点的属性.

注意: Beautiful Soup中字符串节点不支持这些属性,因为字符串没有子节点

.contents 和 .children

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

我觉得不太好用

还不如直接find_all之后for出来呢

parent

获取父节点

兄弟节点 — .next_sibling 和 .previous_sibling

find_all()

find_all( name , attrs , recursive , string , **kwargs )

find_all() 方法搜索当前tag的所有tag子节点,并判断是否符合过滤器的条件.这里有几个例子:

soup.find_all("title")
# [<title>The Dormouse's story</title>]
 
soup.find_all("p", "title")
# [<p class="title"><b>The Dormouse's story</b></p>]
 
soup.find_all("a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.find_all(id="link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
 
import re
soup.find(string=re.compile("sisters"))
# u'Once upon a time there were three little sisters; and their names were\n'

有几个方法很相似,还有几个方法是新的,参数中的 string 和 id 是什么含义? 为什么 find_all("p", "title") 返回的是CSS Class为”title”的<p>标签? 我们来仔细看一下 find_all() 的参数

1 2	`soup.find_all(id='link2')` `# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]`

如果传入 href 参数,Beautiful Soup会搜索每个tag的”href”属性:

1 2	`soup.find_all(href=re.compile("elsie"))` `# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]`

搜索指定名字的属性时可以使用的参数值包括字符串 , 正则表达式 , 列表, True .

下面的例子在文档树中查找所有包含 id 属性的tag,无论 id 的值是什么:

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

使用多个指定名字的参数可以同时过滤tag的多个属性:

1 2	`soup.find_all(href=re.compile("elsie"),` `id='link1')` `# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]`

有些tag属性在搜索不能使用,比如HTML5中的 data-* 属性:

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression

但是可以通过 find_all() 方法的 attrs 参数定义一个字典参数来搜索包含特殊属性的tag:

1	`data_soup.find_all(attrs={"data-foo":` `"value"})# [<div data-foo="value">foo!</div>]`

按CSS搜索

按照CSS类名搜索tag的功能非常实用,但标识CSS类名的关键字 class 在Python中是保留字,使用 class 做参数会导致语法错误.从Beautiful Soup的4.1.1版本开始,可以通过 class_参数搜索有指定CSS类名的tag:

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class_ 参数同样接受不同类型的过滤器 ,字符串,正则表达式,方法或 True :

soup.find_all(class_=re.compile("itl"))
# [<p class="title"><b>The Dormouse's story</b></p>]
 
def has_six_characters(css_class):
    return css_class is not None and len(css_class) == 6
 
#这TM也行？？？我惊了
soup.find_all(class_=has_six_characters)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

tag的 class 属性是多值属性 .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名:

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]
 
css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

搜索 class 属性时也可以通过CSS值完全匹配:

1 2	`css_soup.find_all("p",` `class_="body strikeout")` `# [<p class="body strikeout"></p>]`

完全匹配 class 的值时,如果CSS类名的顺序与实际不符,将搜索不到结果:

soup.find_all("a", attrs={"class": "sister"})
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

string 参数

我看了文档之后还真发现这个参数其实还是有些用处的，比如说直接在find/find_all里通过匹配或者正则搜索string，好像有点逆天····

通过 string 参数可以搜搜文档中的字符串内容.与 name 参数的可选值一样, string 参数接受字符串 , 正则表达式 , 列表, True . 看例子:

soup.find_all(string="Elsie")
# [u'Elsie']
 
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
 
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
 
def is_the_only_string_within_a_tag(s):
    ""Return True if this string is the only child of its parent tag.""
    return (s == s.parent.string)
 
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']

虽然 string 参数用于搜索字符串,还可以与其它参数混合使用来过滤tag.Beautiful Soup会找到 .string 方法与 string 参数值相符的tag.下面代码用来搜索内容里面包含“Elsie”的<a>标签:

1 2	`soup.find_all("a", string="Elsie")` `# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]`

其他的参数还有limit，recursive参数

limit就是在find里直接加上limit=n很简单

recursive:调用tag的 find_all() 方法时,Beautiful Soup会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False .

最好用的应该还是CSS选择器我到目前还没怎么用过下次写爬虫的时候一定要用了熟悉一下

CSS选择器

Beautiful Soup支持大部分的CSS选择器 http://www.w3.org/TR/CSS2/selector.html , 在 Tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag:

select 返回的是一个tag的list CSS选择器用select这个太方便了！！！！！

soup.select("title")
# [<title>The Dormouse's story</title>]
 
soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

通过tag标签逐层查找:

soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("html head title")
# [<title>The Dormouse's story</title>]

找到某个tag标签下的直接子标签:

soup.select("head > title")
# [<title>The Dormouse's story</title>]

soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
# []

找到兄弟节点标签:

soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie"  id="link3">Tillie</a>]

soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过CSS的类名查找:

soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过tag的id查找:

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

同时用多种CSS选择器查询元素:

soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

通过是否存在某个属性来查找:

soup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

通过属性的值来查找:

soup.select('a[href="http://example.com/elsie"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
 
soup.select('a[href^="http://example.com/"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select('a[href$="tillie"]')
# [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 
soup.select('a[href*=".com/el"]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

基本上用到这些就差不多了后面有觉得重要的再补充进来

测试文件：

#! /usr/bin/env python
# -*- coding:utf8 -*-
# __author__="leokim"
 
from bs4 import BeautifulSoup
 
html_doc = """
<html><head><title>The Dormouse's story</title></head>
    <body>
<p class="title"><b>The Dormouse's story</b></p>
 
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
 
<p class="story">...</p>
"""
 
soup = BeautifulSoup(html_doc, 'html.parser')
 
#print(type(soup))
#<class 'bs4.BeautifulSoup'>
 
#print(type(soup.head))
#<class 'bs4.element.Tag'>
 
#print(type(soup.title))
#<class 'bs4.element.Tag'>
 
 
# print(soup.title.string)
#The Dormouse's story
#看来Tag是有string属性的
 
#print(type(soup.title.string))
#<class 'bs4.element.NavigableString'>
 
#print(soup.find_all('a'))
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a clas
#s="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="siste
#r" href="http://example.com/tillie" id="link3">Tillie</a>]
 
#print(type(soup.find_all('a')))
#<class 'bs4.element.ResultSet'>
 
#print(type(soup.find('a')))
#<class 'bs4.element.Tag'>
#find返回的是Tag 而 find_all返回的是ResultSet
 
# for a in soup.find_all('a'):
#  print(type(a))
#  print(a.string)
#<class 'bs4.element.Tag'>
#<class 'bs4.element.Tag'>
#<class 'bs4.element.Tag'>
#ResoultSet 是Tag的集合可以通过for把Tag单独拿出来之后进行tag相关的操作
#所以我之前用了find_all("xxx").find_all('xxx')是解析不出来的，应该要单独把Tag循环出来操作
 
 
 
#print(type(soup.find('a').string))
#<class 'bs4.element.NavigableString'>
 
#title_tag = soup.title
#print(title_tag.parent)
#<head><title>The Dormouse's story</title></head>
 
 
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all("data-foo"="value")

通过语言设置来查找:

multilingual_markup = """
 <p>Hello</p>
 <p>Howdy, y'all</p>
 <p>Pip-pip, old fruit</p>
 <p>Bonjour mes amis</p>
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [<p>Hello</p>,
#  <p>Howdy, y'all</p>,
#  <p>Pip-pip, old fruit</p>]

返回查找到的元素的第一个

1 2	`soup.select_one(".sister")` `# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>`

对于熟悉CSS选择器语法的人来说这是个非常方便的方法.Beautiful Soup也支持CSS选择器API, 如果你仅仅需要CSS选择器的功能,那么直接使用 lxml 也可以, 而且速度更快,支持更多的CSS选择器语法,但Beautiful Soup整合了CSS选择器的语法和自身方便使用API.

python中的urlencode与urldecode

leokim / 2017年4月30日2017年4月30日 / Python

当url地址含有中文，或者参数有中文的时候，这个算是很正常了，但是把这样的url作为参数传递的时候（最常见的callback），需要把一些中文甚至'/'做一下编码转换。

一、urlencode

urllib库里面有个urlencode函数，可以把key-value这样的键值对转换成我们想要的格式，返回的是a=1&b=2这样的字符串，比如：

>>> from urllib import urlencode 
>>> data = { 
... 'a': 'test', 
... 'name': '魔兽' 
... } 
>>> print urlencode(data) 
a=test&amp;name=%C4%A7%CA%DE

如果只想对一个字符串进行urlencode转换，怎么办？urllib提供另外一个函数：quote()

>>> from urllib import quote 
>>> quote('魔兽') 
'%C4%A7%CA%DE'

二、urldecode

当urlencode之后的字符串传递过来之后，接受完毕就要解码了——urldecode。urllib提供了unquote()这个函数，可没有urldecode()！

>>> from urllib import unquote 
>>> unquote('%C4%A7%CA%DE') '\xc4\xa7\xca\xde' 
>>> print unquote('%C4%A7%CA%DE') 
魔兽

三、讨论

在做urldecode的时候，看unquote()这个函数的输出，是对应中文在gbk下的编码，在对比一下quote()的结果不难发现，所谓的urlencode就是把字符串转车gbk编码，然后把\x替换成%。如果你的终端是utf8编码的，那么要把结果再转成utf8输出，否则就乱码。

可以根据实际情况，自定义或者重写urlencode()、urldecode()等函数。

python爬虫练手

leokim / 2017年4月30日2017年5月2日 / Python

51无聊谢了个爬虫玩

python写这类小东西真的很方便

__author__='LeoKim'
from bs4 import BeautifulSoup
  
import re
import urllib.request, urllib.parse, http.cookiejar
import json
import time
import pymysql
 
conn=pymysql.connect(host='localhost',user='root',passwd='superhero',db='python_test',port=3306,charset='utf8')
cur=conn.cursor()#获取一个游标
 
#通过链接获取每页的小区名
def getVillage(url):
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'),
    ('Cookie', 'select_city=320100; lianjia_uuid=c73af582-9ed7-42ed-9738-bbd4688c67e0; UM_distinctid=15bb9f33ca387c-0ac15874ad5d0d-6a11157a-1fa400-15bb9f33ca4a02; _jzqckmp=1; all-lj=c28812af28ef34a41ba2474a2b5c52c2; _jzqx=1.1493473537.1493544561.2.jzqsr=nj%2Elianjia%2Ecom|jzqct=/ershoufang/gulou/.jzqsr=nj%2Elianjia%2Ecom|jzqct=/xiaoqu/pg1/; _gat=1; _gat_past=1; _gat_global=1; _gat_new_global=1; _gat_dianpu_agent=1; _smt_uid=59049861.8870a59; CNZZDATA1253492138=835595246-1493470448-null%7C1493541950; CNZZDATA1254525948=1922726511-1493470772-null%7C1493540995; CNZZDATA1255633284=630946367-1493469955-null%7C1493543402; CNZZDATA1255604082=270979082-1493468920-null%7C1493544528; _qzja=1.1520598967.1493473405458.1493480837509.1493544561423.1493544849473.1493544851953.0.0.0.29.3; _qzjb=1.1493544561423.10.0.0.0; _qzjc=1; _qzjto=10.1.0; _jzqa=1.2414222906473966000.1493473537.1493480838.1493544561.3; _jzqc=1; _jzqb=1.10.10.1493544561.1; _ga=GA1.2.1108117219.1493473408; _gid=GA1.2.2091828031.1493544853; lianjia_ssid=5c8ebd96-81f4-4430-bfda-6d941fcb8663')]
 
    urllib.request.install_opener(opener)
 
    html_bytes = urllib.request.urlopen(url).read()
    html_string = html_bytes.decode('utf-8')
    return html_string
 
 
 
def start(start_url):
    html_doc = getVillage(start_url)
    soup = BeautifulSoup(html_doc, 'html.parser')
 
    #获取所有页数和现在页数
    totalPageNoDiv=soup.find("div","house-lst-page-box")
    Page = eval(totalPageNoDiv.attrs['page-data'])
 
    totalPageNo = Page['totalPage']
    curPage = Page['curPage']
 
    print('当前正在抓取第'+str(curPage)+'页，共'+str(totalPageNo)+'页.')
 
 
    #获取小区内容
    divs = soup.find_all("div","title")
    for div in divs:
        a_tag = div.find("a",target="_blank")
        if(a_tag):
            #插入数据库
            sql = "INSERT INTO `village` (`name`) VALUES (%s)"
            cur.execute(sql, (a_tag.string))
 
    curPage = curPage + 1;
    if(totalPageNo == curPage-1):
        print('执行完毕.')
    else:
        time.sleep(10)
        start_url = "http://nj.lianjia.com/xiaoqu/pg"+str(curPage)
        start(start_url)
 
 
totalPageNo=1
curPage=1
 
start_url = "http://nj.lianjia.com/xiaoqu/pg"+str(curPage)
start(start_url)
 
 
cur.close()#关闭游标
conn.close()#释放数据库资源

__author__='LeoKim'
from bs4 import BeautifulSoup
import pymysql
import urllib.request, urllib.parse, http.cookiejar
from urllib import parse
import pymysql
 
conn=pymysql.connect(host='localhost',user='root',passwd='superhero',db='python_test',port=3306,charset='utf8')
cur=conn.cursor()#获取一个游标
 
def getgeohash(keyword):
    key={
        'keyword':keyword
    }
 
 
    url='https://mainsite-restapi.ele.me/v2/pois?extras%5B%5D=count&geohash=wtsm0ss7yfj8&limit=20&type=nearby&'+parse.urlencode(key)
 
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'),
    ('Cookie', 'ubt_ssid=but8xnmtkpfrbvypd9z3hxaa5i8ugmj0_2017-04-29; _utrace=edd9bb6de13caed667d2cf273d73fc0a_2017-04-29')]
 
    urllib.request.install_opener(opener)
 
    html_bytes = urllib.request.urlopen(url).read()
    html_string = html_bytes.decode('utf-8')
    soup = BeautifulSoup(html_string, 'html.parser')
 
    try:
        info = eval(soup.prettify())
        if len(info) and info is not None:
            return info[0]
        else:
            return 'error'
    except:
        return 'error'
     
 
sql = "SELECT id,name FROM `village` where geohash is null"
cur.execute(sql)
data = cur.fetchall()
 
for d in data:
    print(d[0])
 
    geohash=''
    latitude=''
    longitude=''
 
    gh=getgeohash(d[1])
 
    if gh=='error':
        geohash='error'
        latitude=''
        longitude=''
    else:
        geohash = gh['geohash']
        latitude = gh['latitude']
        longitude = gh['longitude']
 
    print(geohash,latitude,longitude)
 
# gh['geohash'] is None
 
    sql = "UPDATE `village` SET geohash=%s,latitude=%s,longitude=%s where id=%s"
    cur.execute(sql, (geohash,latitude,longitude,d[0]))
     
cur.close()#关闭游标
conn.close()#释放数据库资源

__author__='LeoKim'
from bs4 import BeautifulSoup
import pymysql
import urllib.request, urllib.parse, http.cookiejar
from urllib import parse
import pymysql
import json
import re
 
conn=pymysql.connect(host='localhost',user='root',passwd='superhero',db='python_test',port=3306,charset='utf8')
cur=conn.cursor()#获取一个游标
 
def getstore(village_id,geohash,latitude,longitude,limit):
    key={
        'geohash':geohash,
        'latitude':latitude,
        'longitude':longitude,
        'limit':limit
    }
 
 
    url='https://mainsite-restapi.ele.me/shopping/restaurants?extras%5B%5D=activities&offset=0&terminal=web'+parse.urlencode(key)
 
    cj = http.cookiejar.CookieJar()
    opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
    opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.101 Safari/537.36'),
    ('Cookie', 'ubt_ssid=but8xnmtkpfrbvypd9z3hxaa5i8ugmj0_2017-04-29; _utrace=edd9bb6de13caed667d2cf273d73fc0a_2017-04-29')]
 
    urllib.request.install_opener(opener)
 
    html_bytes = urllib.request.urlopen(url).read()
    html_string = html_bytes.decode('utf-8')
    soup = BeautifulSoup(html_string, 'html.parser')
 
    info = soup.prettify()
    jsonData = json.loads(info)
     
 
    for data in jsonData:
 
        print(data['id'])
        print(village_id)
        print(data['name'])
        print(data['recent_order_num'])
        print(data['address'])
        print(data['order_lead_time'])
        print(data['float_delivery_fee'])
 
        average_cost=0
        if 'average_cost' in data:
            cost = re.findall(r'\d+', data['average_cost'])
            average_cost=cost[0]
            print(average_cost)
 
        print(data['rating'])
        print('---------------------------------------------')
 
        shop_id = data['id']
        name = data['name']
        address = data['address']
        recent_order_num = data['recent_order_num']
        order_lead_time = data['order_lead_time']
        float_delivery_fee = data['float_delivery_fee']
        rating = data['rating']
 
        sql = "INSERT INTO `store` (`shop_id`,`village_id`,`name`,`address`,`recent_order_num`,`order_lead_time`,`float_delivery_fee`, `average_cost`, `rating`) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"
        cur.execute(sql, (shop_id,village_id,name, address, recent_order_num, order_lead_time, float_delivery_fee, average_cost, rating))
 
 
# getstore('wtst84g4g0u','31.91988','118.83238',30)
 
sql = "SELECT id,name,geohash,latitude,longitude FROM `village` where id >482 and geohash is not null"
cur.execute(sql)
data = cur.fetchall()
 
for d in data:
    village_id=d[0]
    geohash = d[2]
    latitude = d[3]
    longitude = d[4]
 
    getstore(village_id,geohash,latitude,longitude,30)
 
 
cur.close()#关闭游标
conn.close()#释放数据库资源

Python3.5 用 pip 安装lxml

leokim / 2016年7月27日2016年11月3日 / 技术

pip 安装不了

最后在知乎上找到答案

作者：深海鱼
链接：http://www.zhihu.com/question/26857761/answer/69754633
来源：知乎
著作权归作者所有，转载请联系作者获得授权。

1. 安装wheel，命令行运行：pip install wheel
2.在这里下载对应的.whl文件，注意别改文件名！http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxmlCtrl + F，输入lxml，找到下面这段Lxml, a binding for the libxml2 and libxslt libraries.lxml‑3.4.4‑cp27‑none‑win32.whllxml‑3.4.4‑cp27‑none‑win_amd64.whllxml‑3.4.4‑cp33‑none‑win32.whllxml‑3.4.4‑cp33‑none‑win_amd64.whllxml‑3.4.4‑cp34‑none‑win32.whllxml‑3.4.4‑cp34‑none‑win_amd64.whllxml‑3.4.4‑cp35‑none‑win32.whllxml‑3.4.4‑cp35‑none‑win_amd64.whlcp后面是Python的版本号，27表示2.7，根据你的Python版本选择下载。3. 进入.whl所在的文件夹，执行命令即可完成安装pip install 带后缀的完整文件名

Python 参数知识（变量前加星号的意义）

leokim / 2015年6月15日2016年11月3日 / Python

在运行时知道一个函数有什么参数，通常是不可能的。另一个情况是一个函数能操作很多对象。更有甚者，调用自身的函数变成一种api提供给可用的应用。

注意args和kwargs只是python的约定。任何函数参数，你可以自己喜欢的方式命名，但是最好和python标准的惯用法一致，以便你的代码，其他的程序员也能轻松读懂。

在参数名之前使用一个星号，就是让函数接受任意多的位置参数。

>>> def multiply(*args):
…     total = 1
…     for arg in args:
…         total *= arg
…     return total
…
>>> multiply(2, 3)
6
>>> multiply(2, 3, 4, 5, 6)
720

python把参数收集到一个元组中，作为变量args。显式声明的参数之外如果没有位置参数，这个参数就作为一个空元组。

<span rgb(0, 0, 0); font-family: Arial; line-height: 26px; font-size: small;">关键字参数

python在参数名之前使用2个星号来支持任意多的关键字参数。

>>> def accept(**kwargs):
…     for keyword, value in kwargs.items():
…         print "%s => %r" % (keyword, value)
…
>>> accept(foo='bar', spam='eggs')
foo => 'bar'
spam => 'eggs'

注意：kwargs是一个正常的python字典类型，包含参数名和值。如果没有更多的关键字参数，kwargs就是一个空字典。

混合参数类型

任意的位置参数和关键字参数可以和其他标准的参数声明一起使用。混合使用时要加些小心，因为python中他们的次序是重要的。参数归为4类，不是所有的类别都需要。他们必须按下面的次序定义，不用的可以跳过。

1）必须的参数
2）可选的参数
3）过量的位置参数
4）过量的关键字参数

def complex_function(a, b=None, *c, **d):

这个次序是必须的，因为*args和**kwargs只接受那些没有放进来的其他任何参数。没有这个次序，当你调用一个带有位置参数的函数，python就不知道哪个值是已声明参数想要的，也不知道哪个被作为过量参数对待。

也要注意的是，当函数能接受许多必须的参数和可选的参数，那它只要定义一个过量的参数类型即可。

传递参数集合

除了函数能接受任意参数集合，python代码也可以调用带有任意多数量的函数，像前面说过的用星号。这种方式传递的参数由python扩展成为参数列表。以便被调用的函数
不需要为了这样调用而去使用过量参数。python中任何可调用的，都能用这种技法来调用。并且用相同的次序规则和标准参数一起使用。

>>> def add(a, b, c):
…     return a + b + c
…
>>> add(1, 2, 3)
6
>>> add(a=4, b=5, c=6)
15
>>> args = (2, 3)
>>> add(1, *args)
6
>>> kwargs={'b': 8, 'c': 9}
>>> add(a=7, **kwargs)
24
>>> add(a=7, *args)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: add() got multiple values for keyword argument 'a'
>>> add(1, 2, a=7)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: add() got multiple values for keyword argument 'a'

注意这个例子的最后几行，特别留意当传递一个元组作为过量的位置参数时，是否要显式的传递关键字参数。因为python使用次序规则来扩展过量的参数，那位置参数要放在前面。这个例子中，最后两个调用是相同的，python不能决定那个值是给a的。

python 错误类型

leokim / 2015年6月14日2016年11月3日 / Python

1、NameError：尝试访问一个未申明的变量

NameError: name 'v' is not defined

>>> v = 1/0

3、SyntaxError：语法错误

SyntaxError: invalid syntax (<pyshell#14>, line 1)

>>> List = [2]

Traceback (most recent call last):

    List[3]

5、KeyError：字典关键字不存在

>>> Dic['3']

File "<pyshell#20>", line 1, in <module>

KeyError: '3'

>>> f = open('abc')

7、AttributeError：访问未知对象属性

def Work():

>>> w = Worker()

Traceback (most recent call last):

    w.a

Traceback (most recent call last):

    int('d')

9、TypeError：类型错误

>>> iVal = 22

Traceback (most recent call last):

    obj = iStr + iVal;

10、AssertionError：断言错误

Traceback (most recent call last):

    assert 1 != 1

http://blog.csdn.net/fcoolx/article/details/4202872

11、 NotImplementedError：方法没实现引起的异常

示例：

13、 如果你不确定数据类型是字典还是列表时，可以用 14、StandardError 标准异常。

除StopIteration, GeneratorExit, KeyboardInterrupt 和SystemExit外，其他异常都是StandarError的子类。

python 笔记

leokim / 2015年6月8日2016年11月3日 / Python

1. %s, %d, %f用法

在python中，print语句和字符操作符结合使用，可实现字符串替换功能。

%s—表示由一个字符串来替换

%d—表示由一个整型来替换

%f—表示由一个浮点型来替换

eg:

>>> print "%s is number %d!" % ("Python",1)

Python is number 1!

2.raw_input()内建函数

raw_input–从标准输入读取一个字符串，并自动删除串尾的换行字符。

a)可将读取的数据赋值给一个变量，作再次使用

>>> user=raw_input("enter your name:")

enter your name:jane

>>> print "your login is:", user

your login is: jane

b)也可使用int()函数将输入的字符串转换为整型

>>> num=raw_input("Now enter a number:")

Now enter a number:1023

>>> print "doubling your number: %d" %(int(num)*2)

doubling your number: 2046

>>> print "doubling your number: %d" %(num *2)

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

TypeError: %d format: a number is required, not str

2025年4月
一	二	三	四	五	六	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30