Go implementation of BLAS (Basic Linear Algebra Subprograms)

Any function is implemented in generic Go and if it is justified, it is optimized for AMD64 (using SSE2 instructions).

AMD64 implementation uses MOVUPS/MOVUPD instructions if all strides equal to 1 so it run fast on Nehalem, Sandy Bridge and newer processors but relatively slow on older processors.

Any implemented function has its own unity test and benchmark.

Implemented functions

Level 1

Sdsdot, Sdot, Ddot, Snrm2, Dnrm2, Sasum, Dasum, Isamax, Idamax, Sswap, Dswap, Scopy, Dcopy, Saxpy, Daxpy, Sscal, Dscal, Srotg, Drotg, Srot, Drot

Level 2

not implemented

Level 3

not implemented

####Example benchmarks

Function Generic Go Optimized for AMD64
Ddot 2825 ns/op 895 ns/op
Dnrm2 2787 ns/op 597 ns/op
Dasum 3145 ns/op 560 ns/op
Sdsdot 3133 ns/op 1733 ns/op
Sdot 2832 ns/op 508 ns/op