Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -5,26 +5,38 @@ language: | |
| 5 | 
             
            ---
         | 
| 6 | 
             
            # CogAgent
         | 
| 7 |  | 
| 8 | 
            -
            ## Introduction
         | 
| 9 | 
            -
             | 
| 10 | 
             
            **CogAgent** is an open-source visual language model improved based on **CogVLM**. 
         | 
| 11 |  | 
| 12 | 
             
            📖 Paper: https://arxiv.org/abs/2312.08914
         | 
| 13 |  | 
| 14 | 
            -
             | 
| 15 | 
            -
             | 
| 16 | 
            -
             | 
| 17 | 
            -
             | 
| 18 | 
            -
             | 
| 19 | 
            -
             | 
| 20 | 
            -
             | 
| 21 | 
            -
             | 
| 22 | 
            -
             | 
| 23 | 
            -
             | 
|  | |
| 24 |  | 
| 25 | 
            -
             | 
|  | |
|  | |
|  | |
| 26 |  | 
| 27 | 
            -
             | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 28 |  | 
| 29 | 
             
            1. Supports higher resolution visual input and dialogue question-answering. It supports ultra-high-resolution image inputs of **1120x1120**.
         | 
| 30 |  | 
| @@ -124,7 +136,7 @@ Then run: | |
| 124 | 
             
            ```bash
         | 
| 125 | 
             
            python cli_demo_hf.py --bf16
         | 
| 126 | 
             
            ```
         | 
| 127 | 
            -
             | 
| 128 |  | 
| 129 | 
             
            ## License
         | 
| 130 |  | 
|  | |
| 5 | 
             
            ---
         | 
| 6 | 
             
            # CogAgent
         | 
| 7 |  | 
|  | |
|  | |
| 8 | 
             
            **CogAgent** is an open-source visual language model improved based on **CogVLM**. 
         | 
| 9 |  | 
| 10 | 
             
            📖 Paper: https://arxiv.org/abs/2312.08914
         | 
| 11 |  | 
| 12 | 
            +
            🚀 GitHub: For more information such as demo, fine-tuning, and query prompts, please refer to [Our GitHub](https://github.com/THUDM/CogVLM/)
         | 
| 13 | 
            +
             | 
| 14 | 
            +
            ## Reminder
         | 
| 15 | 
            +
             | 
| 16 | 
            +
            **This is the ``cogagent-vqa`` version of CogAgent checkpoint.**
         | 
| 17 | 
            +
             | 
| 18 | 
            +
            We have open-sourced two versions of CogAgent checkpoints, and you can choose one based on your needs. 
         | 
| 19 | 
            +
             | 
| 20 | 
            +
            1. ``cogagent-chat``: This model has strong capabilities in **GUI Agent, visual multi-turn dialogue, visual grounding,** etc.
         | 
| 21 | 
            +
             | 
| 22 | 
            +
               If you need GUI Agent and Visual Grounding functions, or need to conduct multi-turn dialogues with a given image, we recommend using this version of the model.
         | 
| 23 |  | 
| 24 | 
            +
            3. ``cogagent-vqa``: This model has *stronger* capabilities in **single-turn visual dialogue**.
         | 
| 25 | 
            +
              
         | 
| 26 | 
            +
               If you need to **work on VQA leaderboards** (such as MMVET, VQAv2), we recommend using this model.
         | 
| 27 | 
            +
               
         | 
| 28 |  | 
| 29 | 
            +
            ## Introduction
         | 
| 30 | 
            +
             | 
| 31 | 
            +
            CogAgent-18B has 11 billion visual and 7 billion language parameters.
         | 
| 32 | 
            +
             | 
| 33 | 
            +
            CogAgent demonstrates **strong performance** in image understanding and GUI agent:
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            1. CogAgent-18B **achieves state-of-the-art generalist performance on 9 cross-modal benchmarks**, including: VQAv2, MM-Vet, POPE, ST-VQA, OK-VQA, TextVQA, ChartQA, InfoVQA, DocVQA. 
         | 
| 36 | 
            +
             | 
| 37 | 
            +
            2. CogAgent-18B significantly **surpasses existing models on GUI operation datasets**, including AITW and Mind2Web.
         | 
| 38 | 
            +
             | 
| 39 | 
            +
            In addition to all the **features** already present in **CogVLM** (visual multi-round dialogue, visual grounding), **CogAgent**:
         | 
| 40 |  | 
| 41 | 
             
            1. Supports higher resolution visual input and dialogue question-answering. It supports ultra-high-resolution image inputs of **1120x1120**.
         | 
| 42 |  | 
|  | |
| 136 | 
             
            ```bash
         | 
| 137 | 
             
            python cli_demo_hf.py --bf16
         | 
| 138 | 
             
            ```
         | 
| 139 | 
            +
             | 
| 140 |  | 
| 141 | 
             
            ## License
         | 
| 142 |  | 
