In InterLinK, we envision a novel AI-based framework that will empower smart agents and robotic systems to visually recognize and anticipate high-level semantics of Human-Object(s) Interactions (HOI) based on deep neural networks, knowledge graphs and visual reasoning.
The development of a novel, semantically-enriched, hierarchical representation of HOI comprising three main components: (i) graph-based modelling of the semantics, appearance and dynamics of human or hand(s) and object(s) at multiple levels of abstraction, adaptive to the scale of the available observations, (ii) learning the temporal semantic structure of HOI as spatio-temporal scene graphs and (iii) reasoning about action-objects relationships using semantic information based on knowledge graphs.
The acquisition of a new dataset of visual and motion data demonstrating fine-grained HOI during daily living activities (ADL) using household objects, at multiple observation scales. Rich annotations will be provided with respect to the semantics of HOI at multiple-levels of abstraction, 2D and 3D ground truth data of the geometry and pose of humans, hands and objects during manipulation tasks.
The development of a novel visual-semantic method for recognition of fine-grained HOI and reasoning in long videos using Deep Learning (DL) and Knowledge Graphs (KG).
The development of a novel vision-based method for short- and long-term prediction and anticipation of fine-grained HOI in long videos using Deep Learning and high-level semantics based on KG.
The research project is supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the
“3rd Call for H.F.R.I. Research Projects to support Post-Doctoral Researchers” (Project Number: 07678).
2023 – 2025
Ιn collaboration with the Institute of Computer Science, Foundation of Research and Technology- Hellas (FORTH),
Computer Vision & Robotics Lab (CVRL) and the Information System Lab (ISL).
Konstantinos Papoutsakis is a postdoctoral researcher at the Department of Management, Science and Technology at the Hellenic Mediterranean University in Crete, and is also affiliated to the Computer Vision and Robotics Laboratory of the Institute of Computer Science at FORTH. He earned his diploma in Computer Engineering and Informatics from the University of Patras in Greece in 2007, his M.Sc. and PhD degrees in Computer Science from the University of Crete in 2010 and 2019, respectively. His main research interests fall in the areas of computer vision, machine learning and visual perception for robotics with emphasis on human motion analysis, video segmentation, action segmentation and recognition and human-computer and human-robot interaction. He has also been actively involved in EU funded projects as a Computer Vision Engineer and during his postgraduate studies. For more information: http://users.ics.forth.gr/~papoutsa/
Filippos Gouidis is a PhD candidate at the university of Crete. He received a Diploma in Civil Engineering from the aristotelian university of Thessaloniki and a BSc in Computer Science from the university of Crete. He holds a MSc in Computer Science and Engineering from the university of Crete and a MSc in Cognitive Sciences from the university of Crete. His research interests focus on image classification and detection, knowledge representation and zero-shot learning.
Victoria Manousaki is a Postdoctoral Researcher at the Department of Management Science and Technology of HMU also collaborating with the Computational Vision and Robotics Laboratory (CVRL) at FORTH. She receined her Ph.D. from the Computer Science department of the University of Crete. Her research interests lie in the topics of action/activity prediction, anticipation and recognition in human-object interactions.
Constantinos Panagiotakis is an Associate Professor and Head of the DMST at the HMU and Director of DataLab, host of the InterLinK project. His research interests span the areas of image and video analysis, data modelling, machine learning, 3D animation, signal processing, multimedia and pattern recognition. He has published more than 90 articles in international conferences and journals.
Dimitris Plexousakis is a Professor at the Computer Science Department (CSD) of the University of Crete (UoC), Director of FORTH-ICS and head of the Information Systems Laboratory at FORTH-ICS. His research interests span the areas of Knowledge Representation, Knowledge Base Design, Formal reasoning systems, Semantic Web and more. He has published over 180 articles in international conferences and journals. He has extensive experience in the scientific coordination of national and European research projects.
Theodore Patkos is a Principal Researcher (Grade B) at FORTH-ICS. His research activities revolve around the fields of knowledge representation and non-monotonic reasoning, contextual and common-sense reasoning, multi-agent and cognitive systems, argumentation and formal representation models for the Semantic Web. He has served as PI and as a researcher in national and European projects, has co-authored more than 55 scientific papers in peer-reviewed conferences and journals, including IJCAI, KR, TPLP, AAMAS and LPAR.
Presentation of InterLinK project at the International Young Scientists Conference for young researchers and professionals
in computational science, Artificial Intelligence, Big Data and Machine Learning. ITMO, October 2023 in Abu Dhabi.
Link: https://github.com/itmo-ai/YSC-2023-Papers
v. Manousaki*, K. Bacharidis*, K. Papoutsakis and A. Argyros, "VLMAH: Visual-Linguistic Modeling of Action History for Effective Action Anticipation", In IEEE International Conference on Computer Vision Workshops (ACVR 2023), Paris, France, October 2023. (*Equal Contribution)
Absract: Although existing methods for action anticipation have shown considerably improved performance on the predictability of future events in videos, the way they exploit information related to past actions is constrained by time duration and encoding complexity. This paper addresses the task of action anticipation by taking into consideration the history of all executed actions throughout long, procedural activities. A novel approach noted as Visual-Linguistic Modeling of Action History (VLMAH) is proposed that fuses the immediate past in the form of visual features as well as the distant past based on a cost-effective form of linguistic constructs (semantic labels of the nouns, verbs, or actions). Our approach generates accurate near-future action predictions during procedural activities by leveraging information on the long- and short-term past. Extensive experimental evaluation was conducted on three challenging video datasets containing procedural activities, namely the Meccano, the Assembly-101, and the 50Salads. The results confirm that using long-term action history improves action anticipation and enhances the SOTA Top-1 accuracy.
Paper: Read online
Code: Visit Site
Filippos Gouidis, Konstantinos Papoutsakis, Theodore Patkos, Antonis Argyros and Dimitris Plexousakis, "Exploring the Impact of Knowledge Graphs on Zero-Shot Visual Object State Classification", In International Conference on Computer Vision Theory and Applications 2024, Rome, Italy.
Absract: In this work, we explore the potential of Knowledge Graphs (KGs) towards an effective Zero-Shot Learning (ZSL) approach for Object State Classification (OSC) in images. For this problem, the performance of tradi- tional supervised learning methods is hindered mainly by data scarcity, as they attempt to encode the highly varying visual features of a multitude of combinations of object state and object type classes (e.g. open bottle, folded newspaper). The ZSL paradigm does indicate a promising alternative to enable the classification of object state classes by leveraging structured semantic descriptions acquired by external commonsense knowl- edge sources. We formulate an effective ZS-OSC scheme by employing a Transformer-based Graph Neural Network model and a pre-trained CNN classifier. We also investigate best practices for both the construction and integration of visually-grounded common-sense information based on KGs. An extensive experimental evaluation is reported using 4 related image datasets, 5 different knowledge repositories and 30 KGs that are constructed semi-automatically via querying known object state classes to retrieve contextual information at different node depths. The performance of vision-language models for ZS-OSC is also assessed. Overall, the obtained results suggest performance improvement for ZS-OSC models on all datasets, while both the size of a KG and the sources utilized for their construction are important for task performance.
Paper: Read online
F. Gouidis, K. Papantoniou, K. Papoutsakis, T. Patkos, A.A. Argyros and D. Plexousakis, "Fusing Domain-Specific Content from Large Language Models into Knowledge Graphs for Enhanced Zero Shot Object State Classification", In AAAI 2024 Spring Symposium on Empowering Machine Learning and Large Language Models with Domain and Commonsense Knowledge, (AAAI-MAKE), Stanford University, USA, March 2024.
Absract: Domain-specific knowledge can significantly contribute to addressing a wide variety of vision tasks. However, the generation of such knowledge entails considerable human labor and time costs. This study investigates the potential of Large Language Models (LLMs) in generating and providing domain-specific information through semantic embeddings. To achieve this, an LLM is integrated into a pipeline that utilizes Knowledge Graphs and pre-trained semantic vectors in the context of the Vision-based Zero-shot Object State Classification task. We thoroughly examine the behavior of the LLM through an extensive ablation study. Our findings reveal that the integration of LLM-based embeddings, in combination with general-purpose pre-trained embeddings, leads to substantial performance improvements. Drawing insights from this ablation study, we conduct a comparative analysis against competing models, thereby highlighting the state-ofthe-art performance achieved by the proposed approach.
Paper: Read online
Victoria Manousaki*, Konstantinos Bacharidis*, Fillipos Gouidis*, Konstantinos Papoutsakis, Dimitris Plexousakis, Antonis Argyros "Anticipating Object State Changes", (Under Review) (*Equal Contribution)
Absract: Anticipating object state changes in images and videos is a challenging problem whose solution has important implications in vision-based scene understanding, automated monitoring systems, and action planning. In this work, we propose the first method for solving this problem. The proposed method predicts object state changes that will occur in the near future as a result of yet unseen human actions. To address this new problem, we propose a novel framework that integrates learnt visual features that represent the recent visual information, with natural language (NLP) features that represent past object state changes and actions. Leveraging the extensive and challenging Ego4D dataset which provides a large-scale collection of first-person perspective videos across numerous interaction scenarios, we introduce new curated annotation data for the object state change anticipation task. An extensive experimental evaluation was conducted that demonstrates the efficacy of the proposed method in predicting object state changes in dynamic scenarios. The proposed work underscores the potential of integrating video and linguistic cues to enhance the predictive performance of video understanding systems. Moreover, it lays the groundwork for future research on the new task of object state change anticipation.
Paper: Read online
Code & Dataset: Visit Site
F. Gouidis, K. Papantoniou, K. Papoutsakis, T. Patkos, A. Argyros and D. Plexousakis "Enabling Visual Intelligence by Leveraging Visual Object States in a Neurosymbolic Framework", Australasian Joint Conference on Artificial Intelligence (AJCAI 2024), (to appear), Melbourne, Australia, November 2024
Absract: This paper investigates the potential of integrating visual object states for developing methods addressing complex visual intelligence tasks such as Long-Term Action anticipation (LTAA) and proposes that this should be achieved with the aid of a Neurosymbolic (NeSy) framework. We consider that this approach could offer significant advancements in applications requiring nuanced understanding and anticipation of future scenarios and could serve as an inspiration for the further development of Nesy methods exhibiting Visual Intelligence.
Paper: Read online
F. Gouidis, K. Papantoniou, K. Papoutsakis, T. Patkos, A.A. Argyros and D. Plexousakis "LLM-aided Knowledge Graph construction for Zero-Shot Visual Object State Classification", IEEE International Conference on Pattern Recognition Systems (ICPRS), University of Westminster, London, UK, July 2024
Absract: The problem of classifying the states of objects using visual information holds great importance in both applied and theoretical contexts. This work focuses on the special case of Zero-shot Object-Agnostic State Classification (ZS-OaSC). To tackle this problem, we introduce an innovative strategy that capitalizes on the capabilities of Graph Neural Networks to learn to project semantic embeddings into visual space and on the potential of Large Language Models (LLMs) to provide rich content for constructing Knowledge Graphs (KGs). Through a comprehensive ablation study, we explore the synergies between LLMs and KGs, uncovering critical insights about their integration in the context of the ZS-OSC problem. Our proposed methodology is rigorously evaluated against current state-ofthe-art (SoA) methods, demonstrating superior performance in various image datasets.
Paper: Read online