从零开始的RAG:索引 
前言:分块 
我们没有明确涵盖文档分块/拆分。
关于文档分块的优秀评论,请参阅Greg Kamradt的视频:
https://www.youtube.com/watch?v=8OJC21T2SL4 
环境 
(1) 包
1 ! pip install langchain_community tiktoken langchain-openai langchainhub chromadb langchain youtube-transcript-api pytube 
 
(2) LangSmith
https://docs.smith.langchain.com/ 
1 2 3 4 import  osos.environ['LANGCHAIN_TRACING_V2' ] = 'true'  os.environ['LANGCHAIN_ENDPOINT' ] = 'https://api.smith.langchain.com'  os.environ['LANGCHAIN_API_KEY' ] = <your-api-key> 
 
(3) API 密钥
1 os.environ['OPENAI_API_KEY' ] = <your-api-key> 
 
第12部分:多表示索引 
流程:
文档:
https://blog.langchain.dev/semi-structured-multi-modal-rag/ 
https://python.langchain.com/docs/modules/data_connection/retrievers/multi_vector 
论文:
https://arxiv.org/abs/2312.06648 
1 2 3 4 5 6 7 8 from  langchain_community.document_loaders import  WebBaseLoaderfrom  langchain_text_splitters import  RecursiveCharacterTextSplitterloader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/" ) docs = loader.load() loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/" ) docs.extend(loader.load()) 
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import  uuidfrom  langchain_core.documents import  Documentfrom  langchain_core.output_parsers import  StrOutputParserfrom  langchain_core.prompts import  ChatPromptTemplatefrom  langchain_openai import  ChatOpenAIchain = (     {"doc" : lambda  x: x.page_content}     | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}" )     | ChatOpenAI(model="gpt-3.5-turbo" ,max_retries=0 )     | StrOutputParser() ) summaries = chain.batch(docs, {"max_concurrency" : 5 }) 
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from  langchain.storage import  InMemoryByteStorefrom  langchain_openai import  OpenAIEmbeddingsfrom  langchain_community.vectorstores import  Chromafrom  langchain.retrievers.multi_vector import  MultiVectorRetrievervectorstore = Chroma(collection_name="summaries" ,                      embedding_function=OpenAIEmbeddings()) store = InMemoryByteStore() id_key = "doc_id"  retriever = MultiVectorRetriever(     vectorstore=vectorstore,     byte_store=store,     id_key=id_key, ) doc_ids = [str (uuid.uuid4()) for  _ in  docs] summary_docs = [     Document(page_content=s, metadata={id_key: doc_ids[i]})     for  i, s in  enumerate (summaries) ] retriever.vectorstore.add_documents(summary_docs) retriever.docstore.mset(list (zip (doc_ids, docs))) 
 
1 2 3 query = "Memory in agents"  sub_docs = vectorstore.similarity_search(query,k=1 ) sub_docs[0 ] 
 
1 2 retrieved_docs = retriever.get_relevant_documents(query,n_results=1 ) retrieved_docs[0 ].page_content[0 :500 ] 
 
相关想法是父文档检索器 。
第13部分:RAPTOR 
流程:
深入视频:
https://www.youtube.com/watch?v=jbGchdTL7d0 
论文:
https://arxiv.org/pdf/2401.18059.pdf 
完整代码:
https://github.com/langchain-ai/langchain/blob/master/cookbook/RAPTOR.ipynb 
第14部分:ColBERT 
RAGatouille使得使用ColBERT变得简单。
ColBERT为段落中的每个token生成一个上下文影响的向量。
ColBERT同样为查询中的每个token生成向量。
然后,每个文档的得分是每个查询嵌入与任何文档嵌入的最大相似性的总和:
请参阅这里 和这里 以及这里 。
1 ! pip install -U ragatouille 
 
1 2 from  ragatouille import  RAGPretrainedModelRAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0" ) 
 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 import  requestsdef  get_wikipedia_page (title: str  ):    """      检索维基百科页面的完整文本内容。     :param title: str - 维基百科页面的标题。     :return: str - 页面的完整文本内容作为原始字符串。     """          URL = "https://en.wikipedia.org/w/api.php"           params = {         "action" : "query" ,         "format" : "json" ,         "titles" : title,         "prop" : "extracts" ,         "explaintext" : True ,     }          headers = {"User-Agent" : "RAGatouille_tutorial/0.0.1 (ben@clavie.eu)" }     response = requests.get(URL, params=params, headers=headers)     data = response.json()          page = next (iter (data["query" ]["pages" ].values()))     return  page["extract" ] if  "extract"  in  page else  None  full_document = get_wikipedia_page("Hayao_Miyazaki" ) 
 
1 2 3 4 5 6 RAG.index(     collection=[full_document],     index_name="Miyazaki-123" ,     max_document_length=180 ,     split_documents=True , ) 
 
1 2 results = RAG.search(query="What animation studio did Miyazaki found?" , k=3 ) results 
 
1 2 retriever = RAG.as_langchain_retriever(k=3 ) retriever.invoke("What animation studio did Miyazaki found?" )