Paperpolisher: AI-Powered Academic Paper Translation and Polishing Assistant
Paperpolisher is an AI-powered tool using Baidu's ERNIE large model and Comate to translate and polish Chinese academic papers into high-quality English, leveraging large paper datasets and retrieval augmentation, streamlining code generation and improving acceptance chances for submissions to top conferences.
With the rapid iteration of technology, machine translation has made ordinary cross-language communication no longer a barrier. However, in specialized fields, many needs remain unmet. For instance, academic paper translation with extremely high professional thresholds requires special attention to professional terminology, cultural background, paper sentence structures, and grammar. Regular machine translation often struggles to meet these demands,困扰着许多希望能够在海外期刊发表论文的学生用户。
Is it possible to develop an application specifically for paper translation and polishing to improve the professionalism of paper translation and avoid the embarrassing situation of manuscript rejection by reviewers? Wang Rongsheng, a second-year graduate student at Macau Polytechnic University, and his laboratory teammates found the solution in Baidu's PaddlePaddle (飞桨) Community. The excellent Chinese understanding capability of the ERNIE large model provides a solid foundation for Chinese-English academic language translation. Additionally, Baidu Comate, an AI coding product under ERNIE, further improved the development efficiency of the application.
The development of such an application stems from Wang Rongsheng's own experience. In early May this year, Wang Rongsheng and his teammates finally received a response to their MICCAI 2024 submission. To their embarrassment, the email explicitly stated "Please note to proofread multiple times before paper submission" and listed multiple specific issues in wording, sentence structure, semantics, and even spelling. "Usually such papers would be required for minor revisions. If there are many errors, reviewers would consider the authors lacking professional competence," Wang said.
MICCAI 2024 is one of the most influential international academic conferences in the field of medical image analysis, and the competition for paper submissions is extremely fierce. According to official statistics, only 11% of papers are accepted, and 35% are rejected in the early stage. Missing such a high-level academic paper publication opportunity due to language and writing issues would be a huge regret for Wang and his teammates.
Based on this, Wang and his laboratory teammates decided to develop a paper translation and polishing tool. There are two key factors affecting paper translation quality: first, high-quality paper datasets for model learning, which directly affects the quality of output; second, more accurate Chinese-English text pairs to help the model fully understand paper translation points. Ultimately, they sought help from Baidu's ERNIE large model and intelligent coding assistant Baidu Comate.
In the paper dataset acquisition phase:
Wang and his teammates selected papers from multiple top conferences including CVPR and ICML over the years as basic data, extracting text content from paper titles, abstracts, introductions, methods, experiments, results, and discussion sections. The English text content of these papers has an overall high expression level, with a total of more than 30,000 papers.
"We needed to write a large amount of simple repetitive code to complete dataset collection. Baidu Comate helped us effectively improve work efficiency in this环节," Wang introduced. Through Baidu Comate's real-time code completion, comment generation, and other functions, they completed a large amount of code for paper data crawling, processing, and cleaning, automatically extracting text from papers to complete massive paper data collection, reducing manual processing workload and improving overall work efficiency by about 50%.
In the Chinese-English text pair production phase:
Wang and his teammates chose to use the ERNIE large model with stronger Chinese capabilities to create Chinese-English data pairs from the collected English materials. "As a native Chinese large model, ERNIE has higher accuracy and fluency in understanding Chinese questions and generating Chinese content. The quality of Chinese-English data pairs generated by ERNIE is also higher," Wang introduced.
In this process, Baidu Comate's "Comate Open Platform" and "AutoWork" functions further reduced their workload. The "Comate Open Platform" function refers to Baidu Comate opening third-party developer tools and online services to support knowledge extension and capability extension, allowing development teams to connect their own or third-party capabilities and services to the programming site, helping development teams build their customized capabilities and build R&D assistants more suitable for their teams. AutoWork can deeply understand local codebases and internal private domain knowledge. Developers only need to clarify development "goals" and "intentions," and AutoWork can automatically retrieve necessary background knowledge, independently analyze product requirements, match the best solutions, and generate code to quickly meet development needs.
By mounting the ERNIE large model API documentation in the knowledge center of the "Comate Open Platform," Wang and his teammates no longer needed to spend大量时间翻阅文档、理解技术逻辑,只需通过"AutoWork"功能使用自然语言发出指令或提问,AutoWork就能够直接根据 API 文档中的编码规范和要求,迅速生成对应代码,快速实现调用文心大模型进行数据翻译的开发需求。
Finally, using the generated text data pairs to build a knowledge base, and applying retrieval augmentation technology can further improve paper translation and polishing quality, producing high-quality English papers.
Currently, "Paperpolisher 论文助手" has been officially launched in the PaddlePaddle Community Application Center. This application features Chinese-English translation, English long sentence simplification, and English polishing functions. Simply upload a paper with one click, and the large model will automatically generate high-quality content that conforms to English paper writing norms.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.