pandas教程:Interacting with Web APIs API和数据库的交互

文章目录

  • [6.3 Interacting with Web APIs (网络相关的API交互)](#6.3 Interacting with Web APIs (网络相关的API交互))
  • [6.4 Interacting with Databases(与数据库的交互)](#6.4 Interacting with Databases(与数据库的交互))

6.3 Interacting with Web APIs (网络相关的API交互)

很多网站都有公开的API,通过JSON等格式提供数据流。有很多方法可以访问这些API,这里推荐一个易用的requests包。

找到githubpandas最新的30个issues,制作一个GET HTTP request, 通过使用requests包:

python 复制代码
import pandas as pd
import numpy as np
python 复制代码
import requests
python 复制代码
url = 'https://api.github.com/repos/pandas-dev/pandas/issues'
python 复制代码
resp = requests.get(url)
python 复制代码
resp
复制代码
<Response [200]>

responsejson方法能返回一个dict,包含可以解析为python objectJSON

python 复制代码
data = resp.json()
data[0]['title']
复制代码
'Optimize data type'
python 复制代码
data[0]
复制代码
{'assignee': None,
 'assignees': [],
 'author_association': 'NONE',
 'body': 'Hi guys, i\'m user of mysql\r\nwe have an "function" PROCEDURE ANALYSE\r\nhttps://dev.mysql.com/doc/refman/5.5/en/procedure-analyse.html\r\n\r\nit get all "dataframe" and show what\'s the best "dtype", could we do something like it in Pandas?\r\n\r\nthanks!',
 'closed_at': None,
 'comments': 1,
 'comments_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/comments',
 'created_at': '2017-11-13T22:51:32Z',
 'events_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/events',
 'html_url': 'https://github.com/pandas-dev/pandas/issues/18272',
 'id': 273606786,
 'labels': [],
 'labels_url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272/labels{/name}',
 'locked': False,
 'milestone': None,
 'number': 18272,
 'repository_url': 'https://api.github.com/repos/pandas-dev/pandas',
 'state': 'open',
 'title': 'Optimize data type',
 'updated_at': '2017-11-13T22:57:27Z',
 'url': 'https://api.github.com/repos/pandas-dev/pandas/issues/18272',
 'user': {'avatar_url': 'https://avatars0.githubusercontent.com/u/2468782?v=4',
  'events_url': 'https://api.github.com/users/rspadim/events{/privacy}',
  'followers_url': 'https://api.github.com/users/rspadim/followers',
  'following_url': 'https://api.github.com/users/rspadim/following{/other_user}',
  'gists_url': 'https://api.github.com/users/rspadim/gists{/gist_id}',
  'gravatar_id': '',
  'html_url': 'https://github.com/rspadim',
  'id': 2468782,
  'login': 'rspadim',
  'organizations_url': 'https://api.github.com/users/rspadim/orgs',
  'received_events_url': 'https://api.github.com/users/rspadim/received_events',
  'repos_url': 'https://api.github.com/users/rspadim/repos',
  'site_admin': False,
  'starred_url': 'https://api.github.com/users/rspadim/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/rspadim/subscriptions',
  'type': 'User',
  'url': 'https://api.github.com/users/rspadim'}}

data中的每一个元素都是一个dict,这个dict就是在github上找到的issue页面上的信息。我们可以把data传给DataFrame并提取感兴趣的部分:

python 复制代码
issues = pd.DataFrame(data, columns=['number', 'title', 
                                    'labels', 'state'])
issues

| | number | title | labels | state |
| 0 | 18272 | Optimize data type | [] | open |
| 1 | 18271 | BUG: Series.rank(pct=True).max() != 1 for a la... | [] | open |
| 2 | 18270 | (Series|DataFrame) datetimelike ops | [] | open |
| 3 | 18268 | DOC: update Series.combine/DataFrame.combine d... | [] | open |
| 4 | 18266 | DOC: updated .combine_first doc strings | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 5 | 18265 | Calling DataFrame.stack on an out-of-order col... | [] | open |
| 6 | 18264 | cleaned up imports | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 7 | 18263 | Tslibs offsets paramd | [] | open |
| 8 | 18262 | DEPR: let's deprecate | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 9 | 18258 | DEPR: deprecate (Sparse)Series.from_array | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 10 | 18255 | ENH/PERF: Add cache='infer' to to_datetime | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 11 | 18250 | Categorical.replace() unexpectedly returns non... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 12 | 18246 | pandas.MultiIndex.reorder_levels has no inplac... | [] | open |
| 13 | 18245 | TST: test tz-aware DatetimeIndex as separate m... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 14 | 18244 | RLS 0.21.1 | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 15 | 18243 | DEPR: deprecate .ftypes, get_ftype_counts | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 16 | 18242 | CLN: Remove days, seconds and microseconds pro... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 17 | 18241 | DEPS: drop 2.7 support | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 18 | 18238 | BUG: Fix filter method so that accepts byte an... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 19 | 18237 | Deprecate Series.asobject, Index.asobject, ren... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 20 | 18236 | df.plot() very slow compared to explicit matpl... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 21 | 18235 | Quarter.onOffset looks fishy | [] | open |
| 22 | 18231 | Reduce copying of input data on Series constru... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 23 | 18226 | Patch init to prevent passing invalid kwds | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 24 | 18222 | DataFrame.plot() produces incorrect legend lab... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 25 | 18220 | DataFrame.groupy renames columns when given a ... | [] | open |
| 26 | 18217 | Deprecate Index.summary | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 27 | 18216 | Pass kwargs from read_parquet() to the underly... | [{'url': 'https://api.github.com/repos/pandas-... | open |
| 28 | 18215 | DOC/DEPR: ensure that @deprecated functions ha... | [{'url': 'https://api.github.com/repos/pandas-... | open |

29 18213 Deprecate Series.from_array ? [{'url': 'https://api.github.com/repos/pandas-... open

6.4 Interacting with Databases(与数据库的交互)

如果在工作中,大部分数据并不会以textexcel的格式存储。最广泛使用的是SQL-based的关系型数据库(SQL Server,PostgreSQL,MySQL)。选择数据库通常取决于性能,数据整合性,实际应用的可扩展性。

读取SQLDataFrame非常直观,pandas中有一些函数能简化这个过程。举个例子,这里创建一个SQLite数据库,通过使用python内建的sqlite3 driver

python 复制代码
import sqlite3
import pandas as pd
python 复制代码
query = """
CREATE TABLE test
(a VARCHAR(20), b VARCHAR(20),
 c REAL,        d INTEGER
);"""
python 复制代码
con = sqlite3.connect('../examples/mydata.sqlite')
python 复制代码
con.execute(query)
复制代码
<sqlite3.Cursor at 0x1049931f0>
python 复制代码
con.commit()

然后我们插入几行数据:

python 复制代码
data = [('Atlanta', 'Georgia', 1.25, 6),
        ('Tallahassee', 'Florida', 2.6, 3),
        ('Sacramento', 'California', 1.7, 5)]
python 复制代码
stmt = "INSERT INTO test VALUES(?, ?, ?, ?)"
python 复制代码
con.executemany(stmt, data)
复制代码
<sqlite3.Cursor at 0x1049932d0>
python 复制代码
con.commit()

大部分pythonSQL驱动(PyODBC, psycopg2, MySQLdb, pymssql, 等)返回a list of tuple,当从一个表格选择数据的时候:

python 复制代码
cursor = con.execute('select * from test')
python 复制代码
rows = cursor.fetchall()
python 复制代码
rows
复制代码
[('Atlanta', 'Georgia', 1.25, 6),
 ('Tallahassee', 'Florida', 2.6, 3),
 ('Sacramento', 'California', 1.7, 5)]

我们可以把list of tuples传递给DataFrame,但是我们也需要column names,包含cursordescription属性:

python 复制代码
cursor.description
复制代码
(('a', None, None, None, None, None, None),
 ('b', None, None, None, None, None, None),
 ('c', None, None, None, None, None, None),
 ('d', None, None, None, None, None, None))
python 复制代码
pd.DataFrame(rows, columns=[x[0] for x in cursor.description])

| | a | b | c | d |
| 0 | Atlanta | Georgia | 1.25 | 6 |
| 1 | Tallahassee | Florida | 2.60 | 3 |

2 Sacramento California 1.70 5

我们不希望每次询问数据库的时候都重复以上步骤,这样对计算机很不好(逐步对计算机系统或文件做小改动导致大的损害)。SQLAlchemy计划是一个六星的Python SQL工具箱,它能抽象出不同SQL数据库之间的不同。pandas有一个read_sql函数,能让我们从SQLAlchemy connection从读取数据。这里我们用SQLAlchemy连接到同一个SQLite数据库,并从之前创建的表格读取数据:

python 复制代码
import sqlalchemy as sqla
python 复制代码
db = sqla.create_engine('sqlite:///../examples/mydata.sqlite')
python 复制代码
pd.read_sql('select * from test', db)

| | a | b | c | d |
| 0 | Atlanta | Georgia | 1.25 | 6 |
| 1 | Tallahassee | Florida | 2.60 | 3 |

2 Sacramento California 1.70 5
相关推荐
计算机安禾几秒前
【数据结构与算法】第44篇:堆(Heap)的实现
c语言·开发语言·数据结构·c++·算法·排序算法·图论
ZC跨境爬虫几秒前
海南大学交友平台开发实战 day9(头像上传存入 SQLite+BLOB 存储 + 前后端联调避坑全记录)
前端·数据库·python·sqlite
FreakStudio1 分钟前
嘉立创开源:应该是全网MicroPython教程最多的开发板
python·单片机·嵌入式·大学生·面向对象·并行计算·电子diy
上天_去_做颗惺星 EVE_BLUE7 分钟前
接口自动化测试全流程:pytest 用例收集、并行执行、Allure 报告合并与上传
python·pytest
chushiyunen11 分钟前
python fastapi使用、uvicorn
开发语言·python·fastapi
成都易yisdong14 分钟前
实现三北方向转换计算器(集成 WMM2025 地磁模型)
开发语言·windows·算法·c#·visual studio
落魄江湖行16 分钟前
基础篇六 Nuxt4 状态管理:useState 的正确用法
前端·vue.js·typescript·nuxt4
白露与泡影19 分钟前
2026 全新 Java 面试题汇总(含答案)
java·开发语言
Trouvaille ~21 分钟前
【MySQL篇】内置函数:数据处理的利器
数据库·mysql·面试·数据清洗·数据处理·dql·基础入门
jerrywus22 分钟前
手机控制 AI 编程?Paseo 让你随时随地跑 Claude Code / Codex
前端·agent·claude