文件 distances.h

namespace faiss

实现具有多种变体的 k 均值聚类。

此源代码根据 MIT 许可证获得许可，该许可证位于此源代码树的根目录中的 LICENSE 文件中。

IDSelector 旨在定义要处理的向量子集（用于删除或作为搜索的子集）

PQ4 SIMD 打包和累加函数

基本内核使用 bbs = nb * 2 * 16 向量累加 nq 查询向量，并为此生成一个输出矩阵。对于 nq * nb <= 4 很有趣，否则寄存器溢出变得太大。

这些函数的实现分布在 3 个 cpp 文件中，以减少并行编译时间。模板被显式实例化。

此文件包含用于计算距离的内核的回调。

在整个库中，向量以 float * 指针的形式提供。当批量处理（添加/搜索）多个向量时，大多数算法可以得到优化。在这种情况下，它们作为矩阵传入。当 n 个大小为 d 的向量作为 float * x 提供时，向量 i 的分量 j 是

x[ i * d + j ]

其中 0 <= i < n 且 0 <= j < d。换句话说，矩阵始终是紧凑的。指定矩阵大小时，我们称其为 n*d 矩阵，这意味着行优先存储。

I/O 函数可以读取/写入到文件名、文件句柄或抽象介质的对象。

读取函数返回应该使用 delete 释放的对象。这些对象中的所有引用都归该对象所有。

倒排列表的定义 + 一些实现接口的常见类。

由于 IVF（倒排文件）索引对于大规模用例非常有用，因此我们将与它们相关的一些函数分组到这个小库中。大多数函数都适用于 IndexIVF 和嵌入在 IndexPreTransform 中的 IndexIVF。

在此文件中，实现了 L2 和内积之外的额外度量标准

实现了一些神经网络层，主要用于支持 QINCo

定义一些将转换应用于一组向量的对象。这些通常是预处理步骤。

函数

float fvec_L2sqr(const float *x, const float *y, size_t d): 两个向量之间的平方 L2 距离。

float fvec_inner_product(const float *x, const float *y, size_t d): 内积

float fvec_L1(const float *x, const float *y, size_t d): L1距离。

float fvec_Linf(const float *x, const float *y, size_t d): 无穷范数距离。

void fvec_inner_product_batch_4(const float *x, const float *y0, const float *y1, const float *y2, const float *y3, const size_t d, float &dis0, float &dis1, float &dis2, float &dis3): 内积的特殊版本，用于计算x和yi之间的4个距离，面向性能优化。

void fvec_L2sqr_batch_4(const float *x, const float *y0, const float *y1, const float *y2, const float *y3, const size_t d, float &dis0, float &dis1, float &dis2, float &dis3): L2sqr的特殊版本，用于计算x和yi之间的4个距离，面向性能优化。

void pairwise_L2sqr(int64_t d, int64_t nq, const float *xq, int64_t nb, const float *xb, float *dis, int64_t ldq = -1, int64_t ldb = -1, int64_t ldd = -1)

计算向量集合之间的两两距离

参数:

d – 向量的维度
nq – 查询向量的数量
nb – 数据库向量的数量
xq – 查询向量 (大小 nq * d)
xb – 数据库向量 (大小 nb * d)
dis – 输出距离 (大小 nq * nb)
ldq, ldb, ldd – 矩阵的步幅

void fvec_inner_products_ny(float *ip, const float *x, const float *y, size_t d, size_t ny)

void fvec_L2sqr_ny(float *dis, const float *x, const float *y, size_t d, size_t ny)

void fvec_L2sqr_ny_transposed(float *dis, const float *x, const float *y, const float *y_sqlen, size_t d, size_t d_offset, size_t ny)

size_t fvec_L2sqr_ny_nearest(float *distances_tmp_buffer, const float *x, const float *y, size_t d, size_t ny)

size_t fvec_L2sqr_ny_nearest_y_transposed(float *distances_tmp_buffer, const float *x, const float *y, const float *y_sqlen, size_t d, size_t d_offset, size_t ny)

float fvec_norm_L2sqr(const float *x, size_t d): 向量的平方范数

void fvec_norms_L2(float *norms, const float *x, size_t d, size_t nx)

计算一组向量的L2范数

参数:

norms – 输出范数，大小为 nx
x – 向量集合，大小为 nx * d

void fvec_norms_L2sqr(float *norms, const float *x, size_t d, size_t nx): 与 fvec_norms_L2 相同，但计算的是平方范数

void fvec_renorm_L2(size_t d, size_t nx, float *x)

void inner_product_to_L2sqr(float *dis, const float *nr1, const float *nr2, size_t n1, size_t n2)

void fvec_add(size_t d, const float *a, const float *b, float *c)

计算向量的 c := a + b

c 和 a 可以重叠，c 和 b 也可以重叠

参数:

a – 大小为 d
b – 大小为 d
c – 大小为 d

void fvec_add(size_t d, const float *a, float b, float *c)

计算 c := a + b，其中 a, c 为向量，b 为标量

c 和 a 可以重叠

参数:

a – 大小为 d
c – 大小为 d

void fvec_sub(size_t d, const float *a, const float *b, float *c)

计算向量的 c := a - b

c 和 a 可以重叠，c 和 b 也可以重叠

参数:

a – 大小为 d
b – 大小为 d
c – 大小为 d

void fvec_inner_products_by_idx(float *ip, const float *x, const float *y, const int64_t *ids, size_t d, size_t nx, size_t ny)

计算 x 和由 ids 定义的 ny 个向量的子集 y 之间的内积

ip(i, j) = inner_product(x(i, :), y(ids(i, j), :))

参数:

ip – 输出数组，大小为 nx * ny
x – 第一个向量，大小为 nx * d
y – 第二个向量，大小为 (max(ids) + 1) * d
ids – 从 y 中采样的 id，大小为 nx * ny

void fvec_L2sqr_by_idx(float *dis, const float *x, const float *y, const int64_t *ids, size_t d, size_t nx, size_t ny)

计算 x 和由 ids 定义的 ny 个向量的子集 y 之间的平方 L2 距离

dis(i, j) = inner_product(x(i, :), y(ids(i, j), :))

参数:

dis – 输出数组，大小为 nx * ny
x – 第一个向量，大小为 nx * d
y – 第二个向量，大小为 (max(ids) + 1) * d
ids – 从 y 中采样的 id，大小为 nx * ny

void pairwise_indexed_L2sqr(size_t d, size_t n, const float *x, const int64_t *ix, const float *y, const int64_t *iy, float *dis)

计算 dis[j] = L2sqr(x[ix[j]], y[iy[j]]) 对于所有 j=0..n-1

参数:

x – 大小为 (max(ix) + 1, d)
y – 大小为 (max(iy) + 1, d)
ix – 大小为 n
iy – 大小为 n
dis – 大小为 n

void pairwise_indexed_inner_product(size_t d, size_t n, const float *x, const int64_t *ix, const float *y, const int64_t *iy, float *dis)

计算 dis[j] = inner_product(x[ix[j]], y[iy[j]]) 对于所有 j=0..n-1

参数:

x – 大小为 (max(ix) + 1, d)
y – 大小为 (max(iy) + 1, d)
ix – 大小为 n
iy – 大小为 n
dis – 大小为 n

void knn_inner_product(const float *x, const float *y, size_t d, size_t nx, size_t ny, float_minheap_array_t *res, const IDSelector *sel = nullptr)

返回ny个向量y中，对于每个nx个向量x，以内积最大值为标准，返回k个最近邻。

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 ny * d
res – 结果堆结构，也提供k。输出时已排序

void knn_inner_product(const float *x, const float *y, size_t d, size_t nx, size_t ny, size_t k, float *distances, int64_t *indexes, const IDSelector *sel = nullptr)

返回ny个向量y中，对于每个nx个向量x，以点积度量为标准，返回k个最近邻。

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 ny * d
distances – 输出距离，大小为 nq * k
indexes – 输出向量id，大小为 nq * k

void knn_L2sqr(const float *x, const float *y, size_t d, size_t nx, size_t ny, float_maxheap_array_t *res, const float *y_norm2 = nullptr, const IDSelector *sel = nullptr)

返回ny个向量y中，对于每个nx个向量x，以L2距离为标准，返回k个最近邻。

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 ny * d
res – 结果堆结构，也提供k。输出时已排序
y_norm2 – (可选) y向量的范数 (nullptr 或者大小为 ny)
sel – 在向量的这个子集中搜索

void knn_L2sqr(const float *x, const float *y, size_t d, size_t nx, size_t ny, size_t k, float *distances, int64_t *indexes, const float *y_norm2 = nullptr, const IDSelector *sel = nullptr)

返回ny个向量y中，对于每个nx个向量x，以L2距离为标准，返回k个最近邻。

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 ny * d
distances – 输出距离，大小为 nq * k
indexes – 输出向量id，大小为 nq * k
y_norm2 – (可选) y向量的范数 (nullptr 或者大小为 ny)
sel – 在向量的这个子集中搜索

void knn_inner_products_by_idx(const float *x, const float *y, const int64_t *subset, size_t d, size_t nx, size_t ny, size_t nsubset, size_t k, float *vals, int64_t *ids, int64_t ld_ids = -1)

在由 ID 索引的 ny 个向量集中，为 nx 个查询查找最大内积邻居。可能对重新排序预先选择的向量列表有用

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 (max(ids) + 1) * d
ids – 要考虑的数据库向量子集，大小为 (nx, nsubset)
res – 结果结构
ld_ids – ids 数组的步幅。-1：使用 nsubset，0：所有查询处理相同的子集

void knn_L2sqr_by_idx(const float *x, const float *y, const int64_t *subset, size_t d, size_t nx, size_t ny, size_t nsubset, size_t k, float *vals, int64_t *ids, int64_t ld_subset = -1)

在由 ID 索引的 ny 个向量集中，为 nx 个查询查找最近邻。可能对重新排序预先选择的向量列表有用

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 (max(ids) + 1) * d
subset – 要考虑的数据库向量子集，大小为 (nx, nsubset)
res – 结果结构
ld_subset – subset 数组的步幅。-1：使用 nsubset，0：所有查询处理相同的子集

void range_search_L2sqr(const float *x, const float *y, size_t d, size_t nx, size_t ny, float radius, RangeSearchResult *result, const IDSelector *sel = nullptr)

返回 nx 个向量 x 中每一个向量在 ny 个向量 y 中的 k 个最近邻，关于最大内积

参数:

x – 查询向量，大小为 nx * d
y – 数据库向量，大小为 ny * d
radius – x 向量周围的搜索半径
result – 结果结构

void range_search_inner_product(const float *x, const float *y, size_t d, size_t nx, size_t ny, float radius, RangeSearchResult *result, const IDSelector *sel = nullptr): 与 range_search_L2sqr 相同，用于内积相似度

void compute_PQ_dis_tables_dsub2(size_t d, size_t ksub, const float *centroids, size_t nx, const float *x, bool is_inner_product, float *dis_tables): PQ2 的专用函数

void fvec_madd(size_t n, const float *a, float bf, const float *b, float *c)

计算 c := a + bf * b，对于表 a、b 和 c

参数:

n – 表的大小
a – 大小 n
b – 大小 n
c – 结果表，大小 n

int fvec_madd_and_argmin(size_t n, const float *a, float bf, const float *b, float *c)

与 fvec_madd 相同，也返回结果表最小值的索引

返回值:: 表 c 的最小值的索引

变量

FAISS_API int distance_compute_blas_threshold

FAISS_API int distance_compute_blas_query_bs

FAISS_API int distance_compute_blas_database_bs