Welcome to MLDAS 2023!
October 23-24, 2023
MACHINE LEARNING AND DATA ANALYTICS SYMPOSIUM
MLDAS 2023 is dedicated to fostering connections between researchers, practitioners, students, and industry experts in the fields of machine learning and data science. Our aim is to bridge the gap between cutting-edge academic insights and the practical needs of industry.
This year’s program is focusing on fundamental research in risk-informed decision making and time series modeling. We will also cover applications and challenges of AI in aviation, education, sport analytics and recommendation system as well as the technical hurdles associated with edge computing.
In addition to the technical talks, MLDAS 2023 will feature a panel discussion on responsible AI, where experts will tackle critical ethical and technical considerations in the field of machine learning and data science.
Location: Multipurpose Room, HBKU Research Complex
Organization: Qatar Computing Research Institute (QCRI), HBKU and Boeing Research & Technology.
The symposium is co-chaired by Dragos Margineantu (Boeing), Sanjay Chawla (QCRI, HBKU) and Safa Messaoud (QCRI, HBKU).
Local and Registration Chair: Keivin Isufaj (QCRI, HBKU)
Participation in MLDAS is free. Please fill up this form for in-person attendance.
Day 1 ▼
Sanjay Chawla | Session Chair
09:00 – 09:45
The good, the bad and the ugly truth about AI in education
Abstract : I will bring you to a journey introducing one possible vision for the future of education, giving practical examples of how it could be achieved and what role AI could play in achieving it.
09:45 – 10:30
Learning and Reasoning: A Soccer Analytics Story
Abstract : This talk will discuss our journey to develop novel ways to quantify the performance of professional soccer players and our struggles with how to evaluate the models that power our novel metrics. I will start from the motivation of data-driven scouting where I will highlight why traditional statistics such as goals, assists, and pass completion percentage are insufficient for evaluating player performance. I will then present our approaches for assessing a player’s contributions to a match’s goal difference, where our conceptual framework has been adopted by most major data providers, and measuring the creativity of their passing. A key challenge with these models is to ensure that practitioners will trust them. Concretely, we must ensure that the learned models do not display any unwanted or non-intuitive behavior. This talk will argue that the solution to this problem is to develop techniques that are able to reason about a learned model’s behavior. Moreover, I will advocate that using such approaches is a key part of evaluating learning pipelines, regardless of the problem domain, because it can help debug learned models and the data used to train them. I will present two generic approaches for gaining insight into how any tree ensemble will behavior. First, I will discuss an approach for verifying whether a learned tree ensemble exhibits a wide range of behaviors. Second, I will describe an approach that identifies whether the tree ensemble is at a heightened risk of making a misprediction in a post-deployment setting.
Dragos Margineantu | Session Chair
11:00 – 11:45
How to develop safe AI/ML for Aviation?
Abstract : Machine learning holds a great promise to improve the performance and safety of aviation. However, the data-driven nature and large number of parameters makes it difficult to guarantee the safety of the system. In this talk we will present some recent successes of applying machine learning for several aviation tasks as well as discuss the challenges and how to develop assurance for learning-enabled algorithms in aviation. We demonstrate results from an “AI” pilot, detect and avoid as well as intent prediction.
11:45 – 12:30
Recommendation systems: Challenges and solutions
Abstract : In this talk, I will present Machine Learning solutions for three specific challenges in recommendations systems –
• Node recommendations in directed graphs: Given a directed graph, the problem is to recommend the top-k nodes with the highest likelihood of a link from a query node. We enhance GNNs with dual embeddings and propose adaptive neighborhood sampling techniques to handle asymmetric recommendations.
• Delayed feedback: The problem is to train an ML model in the presence of target labels that may change over time due to delayed feedback of user actions. We employ an importance sampling strategy to deal with delayed feedback – the strategy corrects the bias in both target labels and feature computation, and leverages pre-conversion signals such as clicks.
• Uncertainty in model predictions: For binary classification problems, we show that we can leverage uncertainty estimates for model predictions to improve accuracy. Specifically, we propose algorithms to select decision boundaries with multiple threshold values on model scores, one per uncertainty level, to increase recall without hurting precision.
Amin Sadeghi | Session Chair
14:00 – 14:45
CAMEL: Communicative Agents for “Mind” Exploration of Large Scale Language Model Society
The rapid advancement of conversational and chat-based language models has led to remarkable progress in complex task-solving. However, their success heavily relies on human input to guide the conversation, which can be challenging and time-consuming. This paper explores the potential of building scalable techniques to facilitate autonomous cooperation among communicative agents and provide insight into their “cognitive” processes. To address the challenges of achieving autonomous cooperation, we propose a novel communicative agent framework named role-playing. Our approach involves using inception prompting to guide chat agents toward task completion while maintaining consistency with human intentions. We showcase how role-playing can be used to generate conversational data for studying the behaviors and capabilities of chat agents, providing a valuable resource for investigating conversational language models. Our contributions include introducing a novel communicative agent framework, offering a scalable approach for studying the cooperative behaviors and capabilities of multi-agent systems, and open-sourcing our library to support research on communicative agents and beyond. More details of this CAMEL project can be seen here
14:45 – 15:30
Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models
Abstract : I will discuss Jais and Jais-chat, two state-of-the-art Arabic-centric foundation and instruction-tuned open generative large language models (LLMs). The models are based on the GPT-3 decoder-only architecture and are pretrained on a mixture of Arabic and English texts, including source code in various programming languages. With 13 billion parameters, they demonstrate better knowledge and reasoning capabilities in Arabic than previous open Arabic and multilingual models by a sizable margin, based on extensive evaluation. Moreover, the models are competitive in English compared to English-centric open models of similar size, despite being trained on much less English data. I will discuss the training, the tuning, the safety alignment, and the evaluation, as well as the lessons we learned.
16:00 – 16:45
Uncertainty in Compositional Models
Abstract : Deep learning studies functions represented as compositions of other functions, 𝑓 = 𝑓L ○ … ○ 𝑓1. While there is ample evidence that these type of structures are beneficial for algorithmic design there are significant questions if the same is true when used to build statistical models. In this talk I will try to highlight some of the issues that are inherent to compositional functions. I will talk about the identifiablity issues that, while beneficial for predictive algorithms becomes challenging when building models. Rather than a talk providing solutions my aim is to highlight some issues related to compositional function modelling and aim to stimulate a discussion around these topics. I will however provide some initial results on compositional uncertainty to highlight some of the paths that we are currently exploring.
16:45 – 17:30
Data-driven learning and control for operational resilience in large-scale networked cyberphysical systems
Abstract : Networked cyberphysical systems such as infrastructure networks, supply chains, and social networks are central to our lives. Yet, they often fail catastrophically when faced with unexpected disturbances or extreme events that push these networks far from equilibrium. Further, traditional control techniques do not scale well to such large-scale networks due to complexities arising from the network size, and multi-layered dynamical interactions between the physical networks, computing, communication, and human participants. On the other hand, purely data-driven and learning-based approaches to operating these networks do not provide guarantees on stability, safety, and robustness that are crucial in such safety-critical systems. In this talk, I will present frameworks that bridge data-driven models and learning-based control algorithms with domain-specific properties drawn from network physics, to guarantee operational resilience of large-scale networked dynamical systems under large disturbances. Specifically, I will discuss (i) physics-informed approaches to rapidly learn models of these systems that capture control-relevant properties like dissipativity, and (ii) scalable, compositional, and risk-tunable learning-based control designs that leverage these properties to provably guarantee operational resilience.
Day 2 ▼
Safa Messaoud | Session Chair
08:30 – 09:15
Risk Informed Decisions
Abstract : Scientific discovery is an interplay between observation and experimentation, and this talk looks at how machine learning can guide scientists towards better experiments. We discuss our experience in CSIRO, where we are researching, developing, and applying machine learning for scientific discovery. We consider the goal of designing an experiment such that the measured output is maximised, and illustrate it with an example from genome biology. Many approaches to adaptive experimental design trade off exploration and exploitation by considering the risk or uncertainty of predictive models, hence it is important to expand the class of efficient predictive distributions. We briefly cover some recent work on a flexible class of probability densities, called squared neural families, which have closed form normalization. We conclude by discussing opportunities and challenges in machine learning for scientific discovery.
09:15 – 10:00
Robust Learning Ideas for AI Engineering
Abstract : TBD
10:30 – 11:15
Towards Formal Verification and Robustification of Neural Systems in Aviation
A major challenge in moving ML-based systems, such as ML-based computer vision, from R&D to production is the difficulty in understanding and ensuring their performance on the operational design domain. The standard ML approach is to extensively test models for various inputs. However, testing is inherently limited in coverage, and it is expensive in aviation. In this talk I will present novel verification technologies developed at Imperial College London as part of the recently concluded DARPA Assured Autonomy program and other UK and EU funded efforts.
Verification methods provide guarantees that a model meets its specifications in dense neighbourhood of selected inputs. For example, by using verification methods we can establish whether a model is robust with respect to infinite noise patterns, or infinite lighting perturbations applied to an input. Verification methods can also be tailored to specifications in the latent space and establish the robustness of models against semantic perturbations not definable in the input space (3D pose changes, background changes, etc). Additionally, verification methods can be paired with learning to obtain robust learning methods capable of generating models inherently more robust than those that may be derived with standard methods.
In the presentation I will succinctly cover the key theoretical results leading to some of the existing ML verification technology, illustrate the resulting toolsets and capabilities, and describe some of the use cases developed with our colleagues at Boeing, including centerline distance estimation, object detection, and runway detection.
I will argue that verification and robust learning can be used to obtain models that are inherently more robust, more performant and betterunderstood than present learning and testing approaches.
11:15 – 12:00
Interpretable AI for scientific discovery using symbolic regression
Abstract : We overview the emerging area of symbolic regression (SR) for discovering concise mathematical expression directly from data. Mathematical expressions are directly interpretable and are not only good predictors but can also be used for inferring causal behavior. SR reduces to discovering a unary-binary tree of mathematical symbols that are compatible with data. We overview the current state of the art techniques including the use of transformers for SR. We will conclude with highlight the current shortcomings of SR and suggest directions for future research.
Ferda Ofli | Session Chair
13:30 – 14:15
Frontiers of Foundation Models for Time Series Modeling and Analysis
Abstract : Recent development in deep learning has spurred research advances in time series modeling and analysis. Practical applications of time series raise a series of new challenges, such as multi-resolution, multimodal, missing value, distributeness, and interpretability. In this talk, I will discuss possible paths to foundation models for time series data and future directions for time series research.
14:15 – 15:00
Securing edge workloads for mission critical applications
Abstract : Today, with its near infinite resources, cloud computing can be used for a large number of use cases and scenarios. The value of public cloud has been realized by many organizations and businesses, however there are use cases which require cloud services to have near-real time response with limited or intermittent dependency on the public internet. This introduces the notion of edge computing, where parts of cloud services (aka workloads) are moved to on-prem infrastructure so that these workloads can run with limited or intermittent connectivity to the cloud. One prominent use case for edge computing is Industrial IoT (IIoT) which consists of internet-connected machinery and advanced analytics platforms executing AI and ML workloads and processing data close to where it is produced. Additionally, edge technologies hold a lot of promise for a diverse range of industries, including agriculture, healthcare, financial services, retail, and advertising, etc. In this talk I will highlight some of the work Microsoft is doing across multiple verticals to achieve this goal. I will also present some of the hard technical and research challenges which remain to be solved to make edge computing for mission critical applications a reality.
15:00 – 15:45
Beyond Traditional Threat Hunting: Leveraging Deep Learning for Log Analysis
Abstract : This presentation delves into the transformative potential of deep learning-based AI techniques in the realm of cybersecurity, particularly highlighting the complexities of threat hunting. Identifying threat behaviors within computer systems remains a crucial yet complex task, largely because the process is expert-driven, labor-intensive, and prone to errors. To address this, we will introduce a system designed to search for and pinpoint known threat behaviors within extensive system security logs. Our methodology involves converting security logs into a graph representation, that captures the temporal and causal relations between different types of system entities entities like processes, files, and network sockets. For the search of threat behaviors, our system fosters abilities of graph neural networks, enabling efficient search of expansive graphs. Complementing this, our system draws upon the capabilities of advanced language models to convert textual descriptions of threat behaviors into query behavior graphs.