尝试本地部署microsoft/autogen项目

看到官方的概述的时候有点惊讶到了，这更像是LLM分布式集群的感觉，一个Agent下发任务给到多个Agent，但是同以往不一样的是，每一层级的Agent都是能够双相通信，这样的话我前面说到的“层级”其实就是不准确了，它更像是平级的团队。

简单说一下这个项目

项目README的概述：

这个项目是FLAML的一个分支。

autogen已经从FLAML毕业，成为一个新的项目。

AutoGen是一个框架，它可以使用多个可以相互对话来解决任务的智能体来开发LLM应用。AutoGen智能体是可定制的，可对话的，并且无缝地允许人类参与。它们可以在使用LLM、人类输入和工具的各种模式下运行。

AutoGen使得用最小的努力构建下一代LLM应用成为可能。它简化了复杂LLM工作流的编排、自动化和优化。它最大化了LLM模型的性能，并克服了它们的弱点。

它支持复杂工作流的多种对话模式。通过可定制和可对话的智能体，开发者可以使用AutoGen构建多种对话模式，涉及对话自主性、智能体数量和> 智能体对话拓扑。

它提供了一系列具有不同复杂度的工作系统。这些系统涵盖了各个领域和复杂度的广泛应用。这展示了AutoGen如何轻松地支持多种对话模式。

AutoGen提供了一个替代openai.Completion或openai.ChatCompletion的增强推理API。它允许轻松地进行性能调优、实用工具（如API统一和> 缓存）和高级使用模式（如错误处理、多配置推理、上下文编程等）。

AutoGen是由微软、宾夕法尼亚州立大学和华盛顿大学的合作研究项目提供支持的。

然后查阅官方给到的文档去使用本地模型完成部署文档地址

本地部署

原文：

Clone FastChat

FastChat provides OpenAI-compatible APIs for its supported models, so you can use FastChat as a local drop-in replacement for OpenAI APIs. However, its code needs minor modification in order to function properly.
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
Download checkpoint

ChatGLM-6B is an open bilingual language model based on General Language Model (GLM) framework, with 6.2 billion > parameters. ChatGLM2-6B is its second-generation version.

Before downloading from HuggingFace Hub, you need to have Git LFS installed.
git clone https://huggingface.co/THUDM/chatglm2-6b
Initiate server

First, launch the controller
python -m fastchat.serve.controller
Then, launch the model worker(s)
python -m fastchat.serve.model_worker --model-path chatglm2-6b
Finally, launch the RESTful API server
python -m fastchat.serve.openai_api_server --host localhost --port 8000
Normally this will work. However, if you encounter error like this, commenting out all the lines containing > finish_reason in fastchat/protocol/api_protocal.py and fastchat/protocol/openai_api_protocol.py will fix the > problem. The modified code looks like:
class CompletionResponseChoice(BaseModel):
    index: int
    text: str
    logprobs: Optional[int] = None
    # finish_reason: Optional[Literal["stop", "length"]]

class CompletionResponseStreamChoice(BaseModel):
    index: int
    text: str
    logprobs: Optional[float] = None
    # finish_reason: Optional[Literal["stop", "length"]] = None
Interact with model using oai.Completion

Now the models can be directly accessed through openai-python library as well as autogen.oai.Completion and autogen.oai.ChatCompletion.
from autogen import oai

# create a text completion request
response = oai.Completion.create(
    config_list=[
        {
            "model": "chatglm2-6b",
            "api_base": "http://localhost:8000/v1",
            "api_type": "open_ai",
            "api_key": "NULL", # just a placeholder
        }
    ],
    prompt="Hi",
)
print(response)

# create a chat completion request
response = oai.ChatCompletion.create(
    config_list=[
        {
            "model": "chatglm2-6b",
            "api_base": "http://localhost:8000/v1",
            "api_type": "open_ai",
            "api_key": "NULL",
        }
    ],
    messages=[{"role": "user", "content": "Hi"}]
)
print(response)
If you would like to switch to different models, download their checkpoints and specify model path when launching model worker(s).

interacting with multiple local LLMs

If you would like to interact with multiple LLMs on your local machine, replace the model_worker step above with a multi model variant:
python -m fastchat.serve.multi_model_worker \
    --model-path lmsys/vicuna-7b-v1.3 \
    --model-names vicuna-7b-v1.3 \
    --model-path chatglm2-6b \
    --model-names chatglm2-6b
The inference code would be:
from autogen import oai

# create a chat completion request
response = oai.ChatCompletion.create(
    config_list=[
        {
            "model": "chatglm2-6b",
            "api_base": "http://localhost:8000/v1",
            "api_type": "open_ai",
            "api_key": "NULL",
        },
        {
            "model": "vicuna-7b-v1.3",
            "api_base": "http://localhost:8000/v1",
            "api_type": "open_ai",
            "api_key": "NULL",
        }
    ],
    messages=[{"role": "user", "content": "Hi"}]
)
print(response)

总结一下其实就是将chatglm2开放为一个接口提供给AutoGen调用，同理将模型封装为API都能提供给AutoGen调用。

实践出真知

安装`fastchat`

安装`serve`库

然后启动本地的LLM模型

因为我的模型路径为 /home/kers/lab/transformers/models/THUDM/chatglm2-6b/，所以启动命令如下：

python -m fastchat.serve.model_worker --model-path /home/kers/lab/transformers/models/THUDM/chatglm2-6b/

出现问题

报了一个错误：RuntimeError: The NVIDIA driver on your system is too old (found version 11020). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.
意思就是我的 CUDA 版本太老了，图中我是用 nvidia-smi 输出了相关信息，能看到我的版本是 11.2 那么我们更新驱动。

这个驱动的版本是 12.2 目前的最新版本。

更新完成了，

PluginsKers's Blog