Research | 3D Vision & Robotics Lab @ UNIST

Research

Research Interests

Our research group (3D Vision & Robotics Lab) mainly focuses on 3D computer vision with a particular focus on geometric and physical aspects. Specifically, our goal is to give the capability to a system (e.g., robots and autonomous vehicles) to understand and interpret various data in a manner that is similar to the way humans use their senses to relate to the world around them.

To achieve this goal, we focus on processing and analyzing various sensor data such as image, video, 3D point cloud, and other sensory data. Currently, we are working on the following research agenda:

3D Scene Understanding

3D Scene Understanding seeks to interpret the geometry, semantics, and spatial relationships that define a scene’s composition. This research aims to capture not only the shape and layout of scenes, but also the meaning and context of the objects within them, enabling richer and more accurate scene interpretation. By combining detailed structural analysis with semantic understanding, we work toward systems that can comprehend complex 3D environments in a way that is both precise and context-aware. To this end, we are actively researching tasks such as 3D structural understanding, language-guided 3D scene understanding and any related topics.

3D Structural Understanding

3D structural understanding is the process of extracting various information (e.g., depth, normal, layout, detection, semantics) from single or multiple images in various environments such as indoor and outdoor scenarios. This task is a fundamental basis for diverse useful application (e.g., scene reconstruction, novel view synthesis, immersive media creation). Our research explores 3D scene understanding in diverse scenarios, including challenging settings like panoramic imagery, where unique geometric characteristics require careful representation and processing.

[Publication]
– HUSH: Holistic Panoramic 3D Scene Understanding using Spherical Harmonics (CVPR 2025)

Language-Grounded 3D Scene Understanding

Language-Grounded 3D Scene Understanding integrates natural language with advanced 3D representations, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS), to create scene models that are both geometrically precise and semantically rich. By aligning 3D geometry and appearance with language-based concepts, this approach enables open-vocabulary recognition, semantic localization, and context-aware querying directly within a 3D environment. Our group explores this direction to bridge the gap between human communication and machine perception, allowing systems to interpret, search, and interact with complex scenes using natural language descriptions.

SLAM/Sensor Fusion

mast3rslam-ezgif.com-video-to-gif-converter

Simultaneous Localization and Mapping (SLAM) and sensor fusion are core technologies, enabling intelligent robots to perceive and navigate complex environments. SLAM allows a robot to build a map of an unknown space while simultaneously determining its own position within that map. Sensor fusion enhances this capability by integrating data from multiple complementary sensors—such as LiDAR, cameras and IMUs—to achieve more accurate, robust, and reliable state estimation. Leveraging SLAM and sensor fusion, our research aims to develop systems that can operate in dynamic, unstructured environments. To this end, we are actively researching tasks such as visual SLAM, collaborative SLAM, event camera-based perception and any related topics.

Visual SLAM

Visual Simultaneous Localization and Mapping (Visual SLAM) is the task of estimating a system’s own state in real-time by analyzing visual patterns from sequential multi-view images. By extracting and tracking visual features across consecutive images, it reconstructs the 3D structure of the environment while continuously refining the camera’s trajectory. Our research focuses on developing algorithms to enhance robustness and reliability, particularly in challenging scenarios such as texture-less surfaces and dynamic environments.

[Project]
– [Done] 멀티 카메라 기반의 슬램 시스템 개발 및 연구 (클로봇)

Collaborative SLAM

Collaborative SLAM goes a step further by addressing SLAM in multi-agent (robot) systems, where multiple agents work together to simultaneously build a shared map of an environment and localize themselves within it. With inter-robot interaction, agents can achieve more accurate and comprehensive mapping over larger or more complex areas. We explores algorithms and communication strategies that facilitate effective cooperation among agents in indoor environments (e.g., hospital, office), which are commonly encountered by home or service robots.

[Publication]
– A Benchmark Dataset for Collaborative SLAM in Service Environments (RA-L 2024)

[Project]
– [On-going] AI Bots 협업 플랫폼 및 자기 조직 인공지능 기술 개발 (정보통신기획평가원)

Event Camera-based Perception

Compared to conventional cameras that capture full images at fixed intervals, event cameras asynchronously detect only changes in brightness at each pixel. This unique characteristic enables them to sense reliably in high-speed or low-light environments. To fully leverage the low-latency and illumination-robust nature of event cameras, we are researching both standalone and fused use of event sensors with other sensory modalities, addressing diverse tasks such as depth estimation, sensor calibration, and occupancy prediction. Furthermore, our goal is to extend these capabilities into platform-agnostic perception systems that can be deployed across various platforms, including ground robots, quadruped robots, and aerial drones.

[Project]
– [On-going] 이기종 에이전트 간 적응 가능한 3차원 공간 인지를 위한 동적 이벤트 카메라 기반 융합 센서팩 개발 (한국연구재단)

3D Perception

3D perception plays a pivotal role in enabling intelligent agents—such as robots, autonomous vehicles, and AR/VR systems—to understand and interact with the physical world in a meaningful way. By perceiving, interpreting, and reasoning about their environments in three dimensions, these systems can make informed decisions, navigate complex scenes, and perform tasks with a high level of autonomy. Our group focuses on developing robust 3D perception systems that empower intelligent agents to operate reliably in real-world conditions. To this end, we are actively researching tasks such as depth estimation, 3D occupancy prediction, traversability estimation, and any related topics.

Depth Estimation

Depth estimation is the task of predicting the distance from the camera to points in a scene using 2D image inputs. It is a fundamental problem in 3D vision that enables machines to understand scene geometry for critical tasks such as obstacle avoidance, navigation, and 3D reconstruction. Traditional approaches often rely on pinhole camera models and assume uniform depth distributions. However, real-world applications—especially in driving scenarios using wide-angle or fisheye cameras—face unique challenges such as nonlinear distortion and varying depth scales across the field of view. Our research develops robust depth estimation algorithms, improving accuracy and reliability in diverse and challenging environments.

[Publication]
– SlaBins: Fisheye Depth Estimation using Slanted Bins on Road Environments (ICCV 2024)

[Project]
– [Done] 차량의 측면 카메라를 이용한 도로 환경 인식 알고리즘 개발 (42dot)

3D Occupancy Prediction

3D Occupancy Prediction estimates a complete 3D semantic voxel map from RGB images, including occluded areas. In this task, it is crucial to accurately predict the complete 3D scene from 2D images, which lack explicit geometric information. To address the 2D–3D discrepancy caused by camera perspective projection, our research explores techniques that incorporate geometric cues, such as vanishing points into learning frameworks to better understand the 3D scene from a single image.

[Publication]
– VPOcc: Exploiting Vanishing Point for 3D Semantic Occupancy Prediction (IROS 2025)

Traversability Estimation

Traversibility estimation aims to detect traversable areas in unstructured environments, enabling effectie local path planning. This task is critical for autonomous navigation in complex environments, such as off-road terrains, urban streets, or indoor spaces. Unlike simple obstacle detection, traversability estimation requires a deeper understanding of the scene’s geometric structure (e.g., slopes, steps), semantic context (e.g. grass vs. pavements), and dynamic elements (e.g., moving objects, changing terrain), making it an essential capability for safe and adaptive robot behavior.

[Project]
– [On-going] 다중 모빌리티 운용을 위한 AI 기반 응용기술 개발 (한국전자통신연구원)

3D Generation

3D Generation research develops algorithms that can create high-fidelity 3D shapes, objects, or entire scenes using generative models such as diffusion models or GANs. These methods aim to capture the diversity and realism of real-world geometry, enabling applications in computer graphics, content creation, simulation, and virtual reality. Our group focuses on advancing generative modeling techniques for high-quality and diverse 3D object and human creation. By adopting off-the-shelf generative models, we design novel ideas about architectures and representations for synthesizing realistic 3D content in diverse contexts.

3D Object/Scene Generation

Object/scene generation refers to synthesizing 3D shapes—such as furniture, tools, or other artifacts—or scenes from learned data distributions. Simple methods and traditional representations like voxels and point clouds often face limitations, such as low resolution or the need for expensive post-processing for mesh extraction. Our research explores novel approaches of generation pipeline and other representations, such as triplanes, that can produce high-resolution results more efficiently.

[Publication]
– Diffusion-based Signed Distance Fields for 3D Shape Generation (CVPR 2023)

3D Human Generation

Human generation addresses the task of creating 3D human avatars or motions, often in interactive or dynamic contexts. It is a critical component for applications such as virtual reality, gaming and human-robot interaction. We aim to synthesize realistic 3D human poses and motions under diverse contextual constraints, e.g., interactions with static environments, dynamic interactions between multiple humans, and alignment with temporal cues.

[Publication]
– Pose-guided 3D Human Generation in Indoor Scene (AAAI 2023)
– ContactGen: Contact-Guided Interactive 3D Human Generation for Partners (AAAI 2024)

3D Reconstruction

3D Reconstruction focuses on recovering accurate and detailed 3D representations of objects or environments from partial, noisy real-world data, such as images, depth scans, or LiDAR measurements. The goal is to infer the underlying geometry and appearance, even with incomplete or ambiguous observations. This field supports a wide range of applications in robotics, autonomous systems, digital archiving, and AR/VR, and involves challenges like handling real-world noise, maintaining structural consistency, and generalizing to diverse scenes. We are actively researching tasks such as Dynamic 3D reconstruction, Novel View Synthesis and any related topics.

Dynamic 3D Reconstruction

Dynamic 3D reconstruction aims to recover animatable 3D models of articulated subjects—such as humans and animals— that change over time, from limited visual inputs like monocular images or videos. Unlike static reconstruction, it requires not only capturing the shape and texture but also modeling motion-capable structures, making it significantly more challenging under sparse or occluded observations. Our research explores learning-based methods with advanced 3D representations, such as 3D Gaussian Splatting, to produce animatable, realistic reconstructions from sparse observations.

[Publication]
– DogRecon: Canine Prior-Guided Animatable 3D Gaussian Dog Reconstruction From A Single Image (IJCV 2025)

Novel View Synthesis

Novel View Synthesis (NVS) is the task of photorealistic rendering of a scene from novel, unseen viewpoints, given one or more images captured from known camera poses. Although Neural Radiance Fields and 3D Gaussian Splatting have demonstrated photorealistic rendering, they often struggle with inaccurate initialization and inconsistent color appearance across different viewpoints. We address these limitations by developing robust scene representations and learning strategies that ensure geometric consistency, accurate appearance modeling even in challenging real-world conditions.

[Project]
– [Done] 산학 과제 (오늘의 집)

Collaborative Robot

Collaborative Robots are designed to safely and intelligently operate alongside humans by perceiving and interacting with objects in complex 3D environments. By leveraging advanced 3D vision and perception, they can recognize, localize, and understand objects and their spatial relationships, even under clutter, occlusion, or ambiguous observations. The goal is to enable reliable and intelligent interaction with the physical world, supporting applications in manufacturing, logistics, and human-robot collaboration. This field involves challenges such as robust object detection in dynamic scenes, reasoning about occlusions, and adapting to diverse environments. We are actively researching perception-driven manipulation tasks, including vision-guided robotic grasping, multimodal policy learning and any related topics.

Multimodal Policy Learning

4_Research_1-project-teaser_250528-기초연구실

Multimodal policy learning focuses on integrating diverse sensory and symbolic inputs—such as vision, touch, and language—to enable robots to perceive, reason, and act in complex environments. By combining complementary modalities, robots can overcome the limitations of any single sensor, allowing for more robust perception under occlusion, ambiguity, or noise. The goal is to learn policies that generalize across tasks and environments, supporting applications in manipulation, human-robot interaction, and adaptive control. This field involves challenges such as aligning heterogeneous data, learning cross-modal representations, and ensuring real-time decision-making.

[Project]
– [On-going] 다중 모달 인식에서 지능형 상호작용까지: 물리적 탐색 기반 조작이 가능한 지능형 에이전트 개발 (한국연구재단)

Vision-guided Robotic Grasping

Vision-guided robotic grasping focuses on enabling robots to reliably detect, localize, and grasp objects using visual perception. By leveraging 3D vision techniques such as depth sensing, point clouds, and object segmentation, robots can reason about object geometry, pose, and spatial relationships, even under clutter, occlusion, or uncertain observations. The goal is to achieve robust and adaptive grasping across diverse environments and object types. This field faces challenges including grasping novel or deformable objects, operating in complex environments with occlusions or moving objects, and planning stable grasps under real-world noise.

3D VISION

3D Vision is essential for truly understanding the world, capturing depth, structure, and spatial relationships that 2D images cannot convey. We develop methods to capture rich spatial information, recover detailed geometry, and understand complex structures and environments. By pushing the boundaries of 3D perception, our research lays the foundation for innovative applications across diverse fields, from immersive media to spatial intelligence.

Robot vision

Robot Vision empowers machines to see, understand, and navigate the world with intelligence and autonomy. By combining advanced sensing, perception, and computation, we enable robots to understand complex scenes, localize themselves accurately, and make intelligent decisions in real time. Together, these capabilities form the foundation for autonomous systems that can operate reliably in dynamic, real-world environments.

SLAM / Sensor Fusion

Click Here

3D Perception

Click Here

Collaborative Robot

Click Here