SQLAlchemy 奇怪的查数据库行为

Python 小记 2020-10-16 5564 字 1018 浏览 点赞

起步

进行数十亿数据压测时,测试环境发生了一个很奇怪的现象:在对 SQLAlchemy 查出的结果做遍历操作时,随着集合越大,遍历的时间跟着变长,是肉眼可见的变长。

根据现象,一位腾讯的程序员向我提出了疑问:会不会遍历过程中 SQLAlchemy 又去查询了数据库?

环境准备

-- DDL
CREATE TABLE `student` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(10) NOT NULL,
  `age` int(4) NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

-- DML
INSERT INTO `test`.`student`(`name`, `age`) VALUES ('小庭', 16), ('小英', 17), ('小慧', 10);

奇怪的查数据库行为

在使用 SQLAlchemy 时,为了更方便的使用,我自己对 session 做了一层封装:

from contextlib import contextmanager

session = sessionmaker(create_engine("mysql+pymysql://root:123456@192.168.111.136/test"))

@contextmanager
def MSession():
    """ session 管理器 """
    sess = session()
    yield sess
    sess.commit()

这样做的好处是,我不用每一次都显式地调用 session() 和 commit()。代码组织起来也好看许多。完整代码如下:

from contextlib import contextmanager

from sqlalchemy import Column, String, Integer, create_engine
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext import declarative

Base = declarative.declarative_base()

class Student(Base):
    __tablename__ = "student"
    id = Column(Integer, primary_key=True, autoincrement=True)
    name = Column(String, nullable=False)
    age = Column(Integer, nullable=False)

session = sessionmaker(create_engine("mysql+pymysql://root:123456@192.168.111.136/test"))

@contextmanager
def MSession():
    """ session 管理器 """
    sess = session()
    yield sess
    sess.commit()

def main():
    with MSession() as sess:
        # 查询所有学生信息
        students = sess.query(Student).all()
    # 遍历查询结果
    for stu in students:
        print(f"查询结果: {stu.id} {stu.name} {stu.age}")

if __name__ == "__main__":
    main()

如果就这样执行上述代码,可以看到输出合理、一切正常。但要是打开 SQLAlchemy 的 echo 行为,就会发现事情并没有那么简单。

设置 echo=True:

session = sessionmaker(
    create_engine("mysql+pymysql://root:123456@192.168.111.136/test", echo=True)
)

此时再运行代码会看到如下输出(输出的一部分信息):

1. 2020-10-16 17:12:04,147 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
2. 22020-10-16 17:12:04,147 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age 
FROM student
3. 2020-10-16 17:12:04,148 INFO sqlalchemy.engine.base.Engine {}
4. 2020-10-16 17:12:04,158 INFO sqlalchemy.engine.base.Engine COMMIT
5. 2020-10-16 17:12:04,171 INFO sqlalchemy.engine.base.Engine BEGIN (implicit)
6. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age 
FROM student 
WHERE student.id = %(param_1)s
7. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
查询结果: 1 小庭 16
8. 2020-10-16 17:12:04,183 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age 
FROM student 
WHERE student.id = %(param_1)s
9. 2020-10-16 17:12:04,183 INFO sqlalchemy.engine.base.Engine {'param_1': 2}
查询结果: 2 小英 17
10. 2020-10-16 17:12:04,195 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age 
FROM student 
WHERE student.id = %(param_1)s
11. 2020-10-16 17:12:04,195 INFO sqlalchemy.engine.base.Engine {'param_1': 3}
查询结果: 3 小慧 10

程序的执行顺序对应到日志从上到下的输出顺序。可以看到,最开始 sqlalchemy 确实执行了查询语句:

SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age FROM student

对应到代码的:

with MSession() as sess:
    # 查询所有学生信息
    students = sess.query(Student).all()

可是后面的行为就很奇怪了,每 print 一次查询结果之前,就会有一次数据库查询行为:

6. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine SELECT student.id AS student_id, student.name AS student_name, student.age AS student_age 
FROM student 
WHERE student.id = %(param_1)s
7. 2020-10-16 17:12:04,173 INFO sqlalchemy.engine.base.Engine {'param_1': 1}
查询结果: 1 小庭 16
...

假设 students 是一个包含 1w 个元素的集合,那么这里就会有 1w 次查询,对应 1w 次 io 操作。且不说这样查询多此一举,重要的是大量 io 操作会拖慢程序的运行速度。

expire_on_commit

起初我是在网上找原因,无果。继而到 SQLAlchemy 源码里来回逛几圈,终于在 Session 类的注释中找到了造成上述现象的原因。

# SQLAlchemy 源码,非关键内容省略
class Session(_SessionClassMethods):
    """
    Manages persistence operations for ORM-mapped objects.
    The Session's usage paradigm is described at :doc:`/orm/session`.
    """
    ...
    def __init__(
        self,
        bind=None,
        autoflush=True,
        expire_on_commit=True,
        ...
    ):
    """
        :param expire_on_commit:  Defaults to ``True``. When ``True``, all
           instances will be fully expired after each :meth:`~.commit`,
           so that all attribute/object access subsequent to a completed
           transaction will load from the most recent database state.
    """

大概意思就是,当 expire_on_commit=True 时,commit 之后所有实例都会过期,之后再访问这些过期实例的属性时,SQLAlchemy 会重新去数据库加载实例对应的数据记录。

如何解决

知道原因以后,解决方案就有了。目前我想到了两种:

  1. 把用到实例的逻辑移到 commit 操作之前,如:
with MSession() as sess:
    # 查询所有学生信息
    students = sess.query(Student).all()
    # 遍历查询结果
    for stu in students:
        print(f"查询结果: {stu.id} {stu.name} {stu.age}")

但这样的缺点是,当你需要处理实例的逻辑很耗时时,意味着会产生一个长事务。无论何种情况下,长事务都应该尽量避免。

  1. expire_on_commit 置为 False:
@contextmanager
def MSession():
    """ session 管理器 """
    sess = session(expire_on_commit=False)  # 关闭 expire_on_commit 行为
    yield sess
    sess.commit()


本文由 Guan 创作,采用 知识共享署名 3.0,可自由转载、引用,但需署名作者且注明文章出处。

还不快抢沙发

添加新评论