FlowCE: First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs’ abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts.

Release

release the evaluation script.
🔥🔥🔥 We release the dataset (https://github.com/360AILABNLP/FlowCE/tree/main/flowce_benchmark).

Usage and License Notices: The data, code, and checkpoint are intended and licensed for research use only. They are also restricted to use that follow the license agreement of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

FlowCE Benchmark Dataset

FlowCE is built upon 500 real-world flowcharts, ensuring an ample diversity in each chart and across five dimensions in real flowchart scenarios, including reasoning, information extraction, localization recognition, summarization, and logical verification.

The full datasets can be obtained from the following address: https://github.com/360AILABNLP/FlowCE/tree/main/flowce_benchmark

For your attention: The image data used in this work is solely for scientific research purposes, and only the source link address for each image is made available. if you want to get the images, just download these images from the 5 different task image urls file given.

Process of creating and evaluating FlowCE

Data samples of FlowCE

covers 5 evaluation dimensions. Each evaluation dimension contains human-annotated question-answer pairs.

Result

1. Statistics of compared API-based and open-source MLLMs

2.Detailed evaluation results on FlowCE across different models

License

The content of this project itself is licensed under LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
flowce_benchmark		flowce_benchmark
Paper-Arxiv-orange.svg		Paper-Arxiv-orange.svg
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FlowCE: First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Release

FlowCE Benchmark Dataset

Process of creating and evaluating FlowCE

Data samples of FlowCE

Result

License

About

Releases

Packages

360AILABNLP/FlowCE

Folders and files

Latest commit

History

Repository files navigation

FlowCE: First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models

Release

FlowCE Benchmark Dataset

Process of creating and evaluating FlowCE

Data samples of FlowCE

Result

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages