我的目标:找到Q,使得Q = inv(chol(S))* X,其中chol(S)是S的较低的cholesky因式分解.
当然,一个简单的解决方案是
cholS = scipy.linalg.cholesky( S,lower=True) scipy.linalg.solve( cholS,X )
我的问题:这个解决方案在python中比在Matlab中尝试相同时要慢得多(2倍).以下是一些时间实验:
timeit np.linalg.solve( cholS,X) 1 loops,best of 3: 1.63 s per loop timeit scipy.linalg.solve_triangular( cholS,X,lower=True) 1 loops,best of 3: 2.19 s per loop timeit scipy.linalg.solve( cholS,best of 3: 2.81 s per loop [matlab] cholS \ X 0.675 s [matlab using only one thread via -singleCompThread] cholS \ X 1.26 s
基本上,我想知道:(1)我可以在python中达到Matlab的速度吗?和(2)为什么scipy版本这么慢?
解决者应该能够利用chol(S)是三角形的事实.然而,使用numpy.linalg.solve()比scipy.linalg.solve_triangular()快,即使numpy调用根本不使用三角形结构.是什么赋予了?当我的矩阵是三角形时,matlab求解器似乎自动检测,但python不能.
我很乐意使用自定义调用BLAS / LAPACK例程来求解三角线性系统,但我真的不想自己编写代码.
作为参考,我使用scipy版本11.0和Enthought python发行版(它使用英特尔的MKL库进行矢量化),所以我认为我应该能够达到类似Matlab的速度.
解决方法
我发现这个线程绊倒了numpy.linalg.solve和scipy.linalg.solve(和scipy的lu_solve等)之间的一些差异.我没有Enthought的基于MKL的Numpy / Scipy,但我希望我的发现能够以某种方式帮助你.
使用Numpy和Scipy的预构建二进制文件(32位,在Windows 7上运行):
>当求解向量X时(即X为160×1),我看到numpy.linalg.solve和scipy.linalg.solve之间有显着差异. Scipy运行时是1.23x numpy,这是我认为实质的.
>然而,大部分差异似乎是由于scipy解决无效条目的检查.当将check_finite = False传递到scipy.linalg.solve中时,scipy的解决运行时是1.02x numpy.
> Scipy的解决使用破坏性更新,即overwrite_a = True,overwrite_b = True比numpy的解决(这是非破坏性的)稍快一些. Numpy的解决运行时是1.021x破坏性scipy.linalg.solve. scipy只是check_finite = False具有运行时1.04x的破坏性情况.总之,破坏性的scipy.linalg.solve比这些情况中的任何一个都要快一些.
>以上是一个向量X.如果我使X是一个宽阵列,具体是160乘以10000,与check_finite = False的scipy.linalg.solve基本上与check_finite = False一样快,overwrite_a = True,overwrite_b = True. Scipy的解决(没有任何特殊的关键字)运行时是1.09x这个“不安全”(check_finite = False)调用. Numpy的解决方案的运行时间为1.03x scipy对于这个阵列X的最快.
> scipy.linalg.solve_triangular在这两种情况下提供了显着的加速,但是您必须关闭输入检查,即传入check_finite = False.最快解决的运行时间分别为5.68x和1.76x solve_triangular,对于vector和array X,分别为check_finite = False.
具有破坏性计算的solve_triangular(overwrite_b = True)使您无法在check_finite = False之上加快速度(实际上对于阵列X的情况实际上有些伤害).
> I,ignoramus,以前不知道solve_triangular,并使用scipy.linalg.lu_solve作为三角求解器,即代替solve_triangular(cholS,X)做lu_solve((cholS,numpy.arange(160)),X)(都产生相同的答案).但是我发现以这种方式使用的lu_solve对于向量X的情况是运行时1.07x不安全的solve_triangular,而对于数组X的情况,其运行时为1.76x.我不知道为什么lu_solve对于数组X比向量X要慢得多,但是教训是使用solve_triangular(无需检查).
>将数据复制到Fortran格式似乎并不重要.也不会转换为numpy.matrix.
我也可以将我的非MKL Python库与单线程(maxNumCompThreads = 1)Matlab 2013a进行比较.上述最快的Python实现对于向量X情况具有4.5倍的运行时间,对于胖矩阵X的情况,运行时间长6.3倍.
然而,这里是Python脚本,我用于基准测试,也许有人用MKL加速的Numpy / Scipy可以发布他们的数字.请注意,我只是注释掉n = 10000行以禁用胖矩阵X的情况,并执行n = 1向量的情况. (抱歉.)
import scipy.linalg as sla import numpy.linalg as nla from numpy.random import RandomState from timeit import timeit import numpy as np RNG = RandomState(69) m=160 n=1 #n=10000 Ac = RNG.randn(m,m) if 1: Ac = np.triu(Ac) bc = RNG.randn(m,n) Af = Ac.copy("F") bf = bc.copy("F") if 0: # Save to Matlab format import scipy.io as io io.savemat("b_%d.mat"%(n,),dict(A=Ac,b=bc)) import sys sys.exit(0) def lapper(fn,source,**kwargs): Alocal = source[0].copy() blocal = source[1].copy() fn(Alocal,blocal,**kwargs) laps = (1000 if n<=1 else 100) def printer(t,s=''): print ("%g seconds,%d laps," % (t/float(laps),laps)) + s return t/float(laps) t=[] print "C" t.append(printer(timeit(lambda: lapper(sla.solve,(Ac,bc)),number=laps),"scipy.solve")) t.append(printer(timeit(lambda: lapper(sla.solve,bc),check_finite=False),"scipy.solve,infinite-ok")) t.append(printer(timeit(lambda: lapper(nla.solve,"numpy.solve")) #print "F" # Doesn't seem to matter #printer(timeit(lambda: lapper(sla.solve,(Af,bf)),number=laps)) #printer(timeit(lambda: lapper(nla.solve,number=laps)) print "sla with tweaks" t.append(printer(timeit(lambda: lapper(sla.solve,overwrite_a=True,overwrite_b=True,"scipy.solve destructive")) print "Tri" t.append(printer(timeit(lambda: lapper(sla.solve_triangular,"scipy.solve_triangular")) t.append(printer(timeit(lambda: lapper(sla.solve_triangular,"scipy.solve_triangular,inf-ok")) t.append(printer(timeit(lambda: lapper(sla.solve_triangular,"scipy.solve_triangular destructive")) print "LU" piv = np.arange(m) t.append(printer(timeit(lambda: lapper( lambda X,b: sla.lu_solve((X,piv),b,"LU")) print "all times:" print t
输出上述脚本的矢量情况,n = 1:
C 0.000739405 seconds,1000 laps,scipy.solve 0.000624746 seconds,scipy.solve,infinite-ok 0.000590003 seconds,numpy.solve sla with tweaks 0.000608365 seconds,scipy.solve destructive Tri 0.000208711 seconds,scipy.solve_triangular 9.38371e-05 seconds,scipy.solve_triangular,inf-ok 9.37682e-05 seconds,scipy.solve_triangular destructive LU 0.000100215 seconds,LU all times: [0.0007394047886284343,0.00062474593940593,0.0005900030818282472,0.0006083650710913095,0.00020871054023307778,9.383710445114923e-05,9.37682389063692e-05,0.00010021534750467032]
上述脚本的输出矩阵情况n = 10000:
C 0.118985 seconds,100 laps,scipy.solve 0.113687 seconds,infinite-ok 0.115569 seconds,numpy.solve sla with tweaks 0.113122 seconds,scipy.solve destructive Tri 0.0725959 seconds,scipy.solve_triangular 0.0634396 seconds,inf-ok 0.0638423 seconds,scipy.solve_triangular destructive LU 0.1115 seconds,LU all times: [0.11898513112988955,0.11368747217793944,0.11556863916356903,0.11312182352918797,0.07259593807427585,0.0634396208970783,0.06384230931663318,0.11150022257648459]
请注意,上述Python脚本可以将其数组保存为Matlab .MAT数据文件.这是目前禁用的(如果0,抱歉),但如果启用,您可以测试Matlab的速度在完全相同的数据.这是Matlab的时序脚本:
clear q = load('b_10000.mat'); A=q.A; b=q.b; clear q matrix_time = timeit(@() A\b) q = load('b_1.mat'); A=q.A; b=q.b; clear q vector_time = timeit(@() A\b)
您将需要Mathworks文件交换:http://www.mathworks.com/matlabcentral/fileexchange/18798-timeit-benchmarking-function的时间函数.它产生以下输出:
matrix_time = 0.0099989 vector_time = 2.2487e-05
这个经验分析的结果是,至少在Python中,当你有一个三角形系统时,只要使用scipy.linalg.solve_triangular,至少使用check_finite = False关键字参数来实现快速和非破坏性解决方案.