Skip to content

LangChain向量存储

向量存储

向量存储

LangChain 中的 Vector Stores(向量存储) 是用于存储文本嵌入(embeddings)并支持语义相似性搜索的核心组件。它在构建检索增强生成(RAG, Retrieval-Augmented Generation)系统中扮演关键角色。

主要作用

  • 存储嵌入向量:将文档通过嵌入模型转换为高维向量后存入。
  • 执行相似性搜索:根据查询文本的嵌入,找出与之语义最接近的文档。

参考资料

统一接口

LangChain 为所有向量数据库提供了统一的抽象接口,使得切换底层实现无需修改业务逻辑。核心方法包括:

方法功能
add_documents(documents, ids)向向量库中添加带元数据的文档
delete(ids)根据 ID 删除文档
similarity_search(query, k=4, filter=None)执行语义相似性搜索,可指定返回数量 k 和元数据过滤条件

说明:以上方法有相应的异步版本

向量数据库技术选型

主流向量数据库对比

工具模式易用性性能成本适用场景
Milvus开源 / 托管中(需部署)可控大规模、自定义部署
Chroma开源 / 轻量高(本地 / 托管)原型与小规模应用
Pinecone托管 SaaS高(免运维)高(低延迟)中高快速上线、生产级 RAG

Milvus使用

Docker安装Milvus单机版:《Milvus安装与使用》

安装依赖

shell
pip install -qU langchain-milvus -i https://mirrors.aliyun.com/pypi/simple --trusted-host=mirrors.aliyun.com

创建Milvus向量数据库

python
from pymilvus import Collection, MilvusException, connections, db, utility

conn = connections.connect(host="127.0.0.1", port=19530)
db_name = "milvus_demo"
try:
    existing_databases = db.list_database()
    if db_name in existing_databases:
        print(f"Database '{db_name}' already exists.")

        # Use the database context
        db.using_database(db_name)

        # Drop all collections in the database
        collections = utility.list_collections()
        for collection_name in collections:
            collection = Collection(name=collection_name)
            collection.drop()
            print(f"Collection '{collection_name}' has been dropped.")

        db.drop_database(db_name)
        print(f"Database '{db_name}' has been deleted.")
    else:
        print(f"Database '{db_name}' does not exist.")
        database = db.create_database(db_name)
        print(f"Database '{db_name}' created successfully.")
except MilvusException as e:
    print(f"An error occurred: {e}")

初始化向量数据库实例

python
from langchain_milvus import Milvus

URI = "http://127.0.0.1:19530"
vector_store = Milvus(
    embedding_function=embeddings,
    connection_args={"uri": URI, "token": "root:Milvus", "db_name": "milvus_demo"},
    index_params={"index_type": "FLAT", "metric_type": "L2"},
    consistency_level="Strong",
    drop_old=False,  # set to True if seeking to drop the collection with that name if it exists
)

向量数据库写入文档

python
from uuid import uuid4
from langchain_core.documents import Document

document_1 = Document(
    page_content="I had chocolate chip pancakes and scrambled eggs for breakfast this morning.",
    metadata={"source": "tweet"},
)
document_2 = Document(
    page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
    metadata={"source": "news"},
)
document_3 = Document(
    page_content="Building an exciting new project with LangChain - come check it out!",
    metadata={"source": "tweet"},
)
documents = [document_1, document_2, document_3]
uuids = [str(uuid4()) for _ in range(len(documents))]
# 向向量库中添加带元数据的文档
vector_store.add_documents(documents=documents, ids=uuids)

文档检索示例

输入用户查询语句,返回与查询语义相似的 Document 列表。

python
### 直接相似性搜索
query = "LangChain provides abstractions to make working with LLMs easy"
results = vector_store.similarity_search(
    query=query,
    k=2,
    expr='source == "tweet"',
)

for res in results:
    print(f"* {res.page_content} [{res.metadata}]")
    
    
### 带分数的搜索(用于调试相关性)
results = vector_store.similarity_search_with_score("What is RAG?", k=1)
for doc, score in results:
    print(f"[Score: {score:.2f}] {doc.page_content}")