文件 DeviceUtils.h

定义

CUDA_VERIFY(X): 用于测试 CUDA 函数返回状态的包装器。

CUDA_TEST_ERROR(): 用于同步探测 CUDA 错误的包装器。

namespace faiss

实现了 k-means 聚类以及许多变体。

此源代码根据 MIT 许可证授权，该许可证位于此源树的根目录中的 LICENSE 文件中。

IDSelector 旨在定义要处理的向量子集（用于删除或作为搜索的子集）

PQ4 SIMD 打包和累积函数

基本内核使用 bbs = nb * 2 * 16 向量累积 nq 查询向量，并生成该输出矩阵。这对于 nq * nb <= 4 很有用，否则寄存器溢出太大。

这些函数的实现分布在 3 个 cpp 文件中，以减少并行编译时间。模板被显式实例化。

此文件包含用于计算距离的内核的回调。

在整个库中，向量以 float * 指针的形式提供。当批量处理（添加/搜索）多个向量时，大多数算法可以得到优化。在这种情况下，它们以矩阵的形式传入。当大小为 d 的 n 个向量以 float * x 的形式提供时，向量 i 的分量 j 是

x[ i * d + j ]

其中 0 <= i < n 且 0 <= j < d。换句话说，矩阵始终是紧凑的。在指定矩阵的大小时，我们称其为 n*d 矩阵，这意味着行优先存储。

I/O 函数可以读取/写入到文件名、文件句柄或抽象介质的对象。

读取函数返回应使用 delete 释放的对象。这些对象中的所有引用都归该对象所有。

反向列表的定义 + 一些实现该接口的常用类。

由于 IVF（反向文件）索引对于大规模用例非常有用，因此我们将与它们相关的一些函数组合到这个小型库中。大多数函数都适用于 IndexIVF 和嵌入在 IndexPreTransform 中的 IndexIVF。

在此文件中，是 L2 和内积之外的额外度量的实现

实现了一些神经网络层，主要为了支持 QINCo

定义了一些将变换应用于一组向量的对象。通常，这些是预处理步骤。

namespace gpu

函数

int getCurrentDevice(): 返回当前线程本地的GPU设备。

void setCurrentDevice(int device): 设置当前线程本地的GPU设备。

int getNumDevices(): 返回可用的GPU设备数量。

void profilerStart(): 启动CUDA性能分析器 (通过SWIG暴露)

void profilerStop(): 停止CUDA性能分析器 (通过SWIG暴露)

void synchronizeAllDevices(): 将CPU与所有设备同步 (相当于对每个设备执行cudaDeviceSynchronize)

const cudaDeviceProp &getDeviceProperties(int device): 返回给定设备缓存的 cudaDeviceProp。

const cudaDeviceProp &getCurrentDeviceProperties(): 返回当前设备缓存的 cudaDeviceProp。

int getMaxThreads(int device): 返回给定GPU设备可用的最大线程数

int getMaxThreadsCurrentDevice(): 等效于 getMaxThreads(getCurrentDevice())

dim3 getMaxGrid(int device): 返回给定GPU设备的最大网格大小。

dim3 getMaxGridCurrentDevice(): 等效于 getMaxGrid(getCurrentDevice())

size_t getMaxSharedMemPerBlock(int device): 返回给定GPU设备可用的最大共享内存。

size_t getMaxSharedMemPerBlockCurrentDevice(): 等效于 getMaxSharedMemPerBlock(getCurrentDevice())

int getDeviceForAddress(const void *p): 对于给定的指针，返回它是否位于设备上 (deviceId >= 0) 还是位于主机上 (-1)。

bool getFullUnifiedMemSupport(int device): 给定的设备是否支持完全统一内存共享主机内存？

bool getFullUnifiedMemSupportCurrentDevice(): 等效于 getFullUnifiedMemSupport(getCurrentDevice())

bool getTensorCoreSupport(int device): 给定的设备是否支持张量核心运算？

bool getTensorCoreSupportCurrentDevice(): 等效于 getTensorCoreSupport(getCurrentDevice())

int getWarpSize(int device): 返回给定GPU设备的warp大小。

int getWarpSizeCurrentDevice(): 等效于 getWarpSize(getCurrentDevice())

size_t getFreeMemory(int device): 返回给定设备上当前可用的内存量。

size_t getFreeMemoryCurrentDevice(): 等效于 getFreeMemory(getCurrentDevice())

template<typename L1, typename L2> void streamWaitBase(const L1 &listWaiting, const L2 &listWaitOn): 调用一个流集合来等待。

template<typename L1> void streamWait(const L1 &a, const std::initializer_list<cudaStream_t> &b): 这些版本允许使用 initializer_list 作为参数，因为否则 {…} 没有类型

template<typename L2> void streamWait(const std::initializer_list<cudaStream_t> &a, const L2 &b)

inline void streamWait(const std::initializer_list<cudaStream_t> &a, const std::initializer_list<cudaStream_t> &b)

class DeviceScope

#include <DeviceUtils.h>

RAII 对象用于设置当前设备，并在销毁时恢复先前的设备

公共函数

explicit DeviceScope(int device)

~DeviceScope()

私有成员

int prevDevice_

class CublasHandleScope

#include <DeviceUtils.h>

RAII 对象用于管理 cublasHandle_t。

公共函数

CublasHandleScope()

~CublasHandleScope()

inline cublasHandle_t get()

私有成员

cublasHandle_t blasHandle_

class CudaEvent

公共函数

explicit CudaEvent(cudaStream_t stream, bool timer = false): 创建一个事件并在该流中记录它。

CudaEvent(const CudaEvent &event) = delete

CudaEvent(CudaEvent &&event) noexcept

~CudaEvent()

inline cudaEvent_t get()

void streamWaitOnEvent(cudaStream_t stream): 在此流中等待此事件。

void cpuWaitOnEvent(): 让 CPU 等待此事件完成。

CudaEvent &operator=(CudaEvent &&event) noexcept

CudaEvent &operator=(CudaEvent &event) = delete

私有成员

cudaEvent_t event_