File utils.h

namespace faiss

实现具有多种变体的 k-means 聚类。

此源代码根据 MIT 许可证获得许可，该许可证位于此源树的根目录中的 LICENSE 文件中。

IDSelector 旨在定义要处理的向量子集（用于移除或作为搜索子集）

PQ4 SIMD 打包和累积函数

基本内核使用 bbs = nb * 2 * 16 向量累积 nq 查询向量，并为此生成一个输出矩阵。对于 nq * nb <= 4 来说很有意思，否则寄存器溢出会变得太大。

这些函数的实现分布在 3 个 cpp 文件中，以减少并行编译时间。模板被显式实例化。

此文件包含计算距离的内核的回调。

在整个库中，向量以 float * 指针的形式提供。当批量处理（添加/搜索）多个向量时，大多数算法都可以优化。在这种情况下，它们作为矩阵传递。当大小为 d 的 n 个向量作为 float * x 提供时，向量 i 的分量 j 为

x[ i * d + j ]

其中 0 <= i < n 且 0 <= j < d。换句话说，矩阵始终是紧凑的。在指定矩阵大小时，我们称它为 n*d 矩阵，这意味着行优先存储。

I/O 函数可以读取/写入到文件名、文件句柄或抽象介质的对象。

读取函数返回应使用 delete 释放的对象。这些对象中的所有引用都由该对象拥有。

倒排列表的定义 + 实现接口的一些常见类。

由于 IVF（倒排文件）索引对于大规模用例非常有用，因此我们将与它们相关的一些函数分组到这个小库中。大多数函数都适用于 IndexIVF 和嵌入在 IndexPreTransform 中的 IndexIVF。

此文件中是 L2 和内积之外的额外指标的实现

实现一些神经网络层，主要用于支持 QINCo

定义一些对一组向量应用变换的对象通常这些是预处理步骤。

函数

std::string get_compile_options(): 获取编译选项

std::string get_version()

double getmillisecs(): 自某个任意时期以来经过的毫秒数

size_t get_mem_usage_kb(): 以 kB 为单位获取当前 RSS 使用情况

uint64_t get_cycles()

void reflection(const float *u, float *x, size_t n, size_t d, size_t nu)

void matrix_qr(int m, int n, float *a)

计算 m > n 的 QR 分解的 Q

参数:: a – 大小为 n * m：输入矩阵和输出 Q

void ranklist_handle_ties(int k, int64_t *idx, const float *dis): 距离应该已经排序。对具有相同距离的索引进行排序

size_t ranklist_intersection_size(size_t k1, const int64_t *v1, size_t k2, const int64_t *v2): 计算 v1 和 v2 之间的公共元素的数量。算法 = 排序 + 二分查找，以避免重复计数副本

size_t merge_result_table_with(size_t n, size_t k, int64_t *I0, float *D0, const int64_t *I1, const float *D1, bool keep_min = true, int64_t translation = 0)

将一个结果表合并到另一个结果表中

参数:

I0, D0 – 第一个结果表，大小为 (n, k)
I1, D1 – 第二个结果表，大小为 (n, k)
keep_min – 如果为 true，则保留最小值，否则保留最大值
translation – 将此值添加到所有 I1 的索引

返回值:

从第二个表中获取的值的数量

double imbalance_factor(int n, int k, const int64_t *assign): 平衡的分配的 IF 为 1

double imbalance_factor(int k, const int *hist): 相同，以直方图作为输入

int ivec_hist(size_t n, const int *v, int vmax, int *hist): 计算 v 上的直方图

void bincode_hist(size_t n, size_t nbits, const uint8_t *codes, int *hist)

计算代码数组上位的直方图

参数:

codes – 大小(n, nbits / 8)
hist – 大小(nbits)：代码数组中 1 的数量

uint64_t ivec_checksum(size_t n, const int32_t *a): 计算表上的校验和。

uint64_t bvec_checksum(size_t n, const uint8_t *a): 计算表上的校验和。

void bvecs_checksum(size_t n, size_t d, const uint8_t *a, uint64_t *cs)

计算矩阵行的校验和

参数:

n – 行数
d – 每行的大小
a – 要处理的矩阵，大小为 n * d
cs – 输出校验和，大小为 n

const float *fvecs_maybe_subsample(size_t d, size_t *n, size_t nmax, const float *x, bool verbose = false, int64_t seed = 1234)

如果向量数量过多，则随机子采样一组向量

参数:

d – 向量的维度
n – 输入时：输入向量的数量，输出：输出向量的数量
nmax – 要保留的最大向量数
x – 输入数组，大小为 *n-by-d
seed – 用于采样的随机种子

返回值:

x 或使用 new [] 分配的具有 *n 向量的数组

void binary_to_real(size_t d, const uint8_t *x_in, float *x_out)

将二进制向量转换为 +1/-1 值的浮点向量。

参数:

d – 向量的维度（8 的倍数）
x_in – 输入二进制向量（大小为 d / 8 的 uint8_t 表）
x_out – 输出浮点向量（大小为 d 的浮点表）

void real_to_binary(size_t d, const float *x_in, uint8_t *x_out)

将浮点向量转换为二进制向量。大于 0 的分量转换为 1，其他分量转换为 0。

参数:

d – 向量的维度（8 的倍数）
x_in – 输入浮点向量（大小为 d 的浮点表）
x_out – 输出二进制向量（大小为 d / 8 的 uint8_t 表）

uint64_t hash_bytes(const uint8_t *bytes, int64_t n): 一个合理的哈希函数

bool check_openmp(): 是否遵守了 OpenMP 注释。

template<typename T> struct CombinerRangeKNN

#include <utils.h>

此类用于在 contrib.exhaustive_search.range_search_gpu 中组合范围和 knn 搜索结果

公共函数

inline CombinerRangeKNN(int64_t nq, size_t k, T r2, bool keep_max): 是否保留最大值而不是最小值。

void compute_sizes(int64_t *L_res): 大小 nq + 1

void write_result(T *D_res, int64_t *I_res): 阶段 2：调用者分配 D_res 和 I_res（大小为 L_res[nq]）阶段 3：填充 D_res 和 I_res

公共成员

int64_t nq

size_t k: 查询数量

T r2: knn 搜索部分的邻居数

bool keep_max: 范围搜索半径

const int64_t *I = nullptr: Knn 搜索结果。

const T *D = nullptr: 大小 nq * k

const bool *mask = nullptr

大小 nq * k

可选：范围搜索结果（如果 mask 为 NULL 则忽略）

const int64_t *lim_remain = nullptr: knn 结果有效的掩码，大小为 nq

const T *D_remain = nullptr: 大小 nrange + 1

const int64_t *I_remain = nullptr: 大小 lim_remain[nrange]

const int64_t *L_res = nullptr: 大小 lim_remain[nrange]

struct CodeSet

公共函数

inline explicit CodeSet(size_t d)

void insert(size_t n, const uint8_t *codes, bool *inserted)

公共成员

size_t d

std::set<std::vector<uint8_t>> s