Publications

ViRED: Prediction of Visual Relations in Engineering Drawings

Chao Gu, Ke Lin, Yiyang Luo

Under Review

2D Vision Object Detection

To accurately understand engineering drawings, it is essential to establish the correspondence between images and their description tables within the drawings. Existing document understanding methods predominantly focus on text as the main modality, which is not suitable for documents containing substantial image information. In the field of visual relation detection, the structure of the task inherently limits its capacity to assess relationships among all entity pairs in the drawings. To address this issue, we propose a vision-based relation detection model, named ViRED, to identify the associations between tables and circuits in electrical engineering drawings. Our model mainly consists of three parts: a vision encoder, an object encoder, and a relation decoder. We implement ViRED using PyTorch to evaluate its performance. To validate the efficacy of ViRED, we conduct a series of experiments. The experimental results indicate that, within the engineering drawing dataset, our approach attained an accuracy of 96% in the task of relation prediction, marking a substantial improvement over existing methodologies. The results also show that ViRED can inference at a fast speed even when there are numerous objects in a single engineering drawing.

Geo-LLaVA: A Large Multi-Modal Model for Solving Geometry Math Problems with Meta In-Context Learning

Shihao Xu*, Yiyang Luo*, Wei Shi

ACM MM 2024 LGM3A Workshop

Multi-Modality RAG

Abstract

Context-Aware Indoor Point Cloud Object Generation through User Instructions

Yiyang Luo*, Ke Lin*, Chao Gu

ACM MM 2024

CCF A CORE A* 3D Vision Generation

Abstract | Code | Project Page

Indoor scene modification has emerged as a prominent area within computer vision, particularly for its applications in Augmented Reality (AR) and Virtual Reality (VR). Traditional methods often rely on pre-existing object databases and predetermined object positions, limiting their flexibility and adaptability to new scenarios. In response to this challenge, we present a novel end-to-end multi-modal deep neural network capable of generating point cloud objects seamlessly integrated with their surroundings, driven by textual instructions. Our work proposes a novel approach in scene modification by enabling the creation of new environments with previously unseen object layouts, eliminating the need for pre-stored CAD models. Leveraging Point-E as our generative model, we introduce innovative techniques such as quantized position prediction and Top-K estimation to address the issue of false negatives resulting from ambiguous language descriptions. Furthermore, we conduct comprehensive evaluations to showcase the diversity of generated objects, the efficacy of textual instructions, and the quantitative metrics, affirming the realism and versatility of our model in generating indoor objects. To provide a holistic assessment, we incorporate visual grounding as an additional metric, ensuring the quality and coherence of the scenes produced by our model. Through these advancements, our approach not only advances the state-of-the-art in indoor scene modification but also lays the foundation for future innovations in immersive computing and digital environment creation.

Lost in Overlap: Exploring Watermark Collision in LLMs.

Yiyang Luo*, Ke Lin*, Chao Gu*

Under Review

Preprint Security Watermarking

Abstract

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread use of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks like question answering and paraphrasing. This study focuses on dual watermark collisions, where two watermarks are present simultaneously in the same text. The research demonstrates that watermark collision poses a threat to detection performance for detectors of both upstream and downstream watermark algorithms.

Zero-shot Generative Linguistic Steganography

Ke Lin, Yiyang Luo, Zijian Zhang, and Luo Ping

NAACL 2024

CCF B CORE A Security Steganography

Abstract | Project Page

Generative linguistic steganography attempts to hide secret messages into covertext. Previous studies have generally focused on the statistical differences between the covertext and stegotext, however, ill-formed stegotext can readily be identified by humans. In this paper, we propose a novel zero-shot approach based on in-context learning for linguistic steganography to achieve better perceptual and statistical imperceptibility. We also design several new metrics and reproducible language evaluations to measure the imperceptibility of the stegotext. Our experimental results indicate that our method produces 1.926× more innocent and intelligible stegotext than any other method.

Uniﬁed Feature Fusion Network with Path Router for Multi-task Image Restoration

Zhou Jingyuan, Leong Chaktou, Luo Yiyang, Lin Minyi, Liao Wantong, Li, Congduan

ICCT 2021

2D Vision MoE

Abstract

Image restoration is an important low-level vision task, which includes various sub-tasks such as deraining, dehazing, denoising, raindrop removal, etc. Although current researches have achieved significant results in various sub-tasks, only a few of them are designed for multiple degradation factors. However, in the actual natural environment, the weather is complex and changeable so the networks designed for a single task are usually inapplicable. In this paper, we propose an unified network that can effectively restore images in a variety of weather conditions (rain, haze, and raindrops). The network is mainly divided into three parts. The first part is the shared multi-expert feature extraction module. We introduce the multi-task learning with multi-gate mixture-of-experts(MMoE) architecture and propose the smooth dilated residual group to extract the low-level features. Furthermore, the gate fusion sub-network is proposed to weigh and sum the output of each expert, so the correlation and difference between tasks can be captured. The second part is the path router sub-network, which can select branches of different tasks, and only open one branch at a time. The third part is multi-gated feature fusion branch, which can extract high-level feature and fuse different levels of features. Finally, we add the output of the third part to the input image and get the clean image. Our model cam deal with a variety of weather conditions and the experiments show competitive results compared with state-of-the-art models for single tasks.

Yiyang Luo, Lawrence

Publications