Workshop on Risk and Robust Markov Decision Processes and Reinforcement Learning
2025-5-26, Sun Yat-Sen University, Guangzhou
中山大学南校园(新港西路135号)管理学院善思堂M101
一、会议议程:
9:00-9:10 | 开幕介绍 | 夏俐(中山大学,教授) | 主持人: |
9:10-9:55 | Practicable robust Markov decision processes | 许欢(上海交通大学,教授) |
|
9:55-10:40 | Efficient algorithms for robust Markov decision processes with s-rectangular ambiguity sets | 何展鹏(香港城市大学,助理教授) | |
10:40-11:00 | 茶歇 | ||
11:00-11:45 | On the foundation and tractability of robust reinforcement learning | 司念(香港科技大学,助理教授) | 张俊玉(中山大学,副教授) |
12:00-14:00 | 午餐 | ||
14:00-14:45 | Optimizing global performance metrics in deep reinforcement learning | Paul Weng (昆山杜克大学,副教授) | 黄永辉(中山大学,副教授) |
14:45-15:30 | The role of mixed discounting in risk-averse sequential decision-making | 黄文杰(香港大学,助理教授) | 吴浩然(中山大学,助理教授) |
15:30-16:15 | Risk-sensitive Markov decision processes with mean-variance optimization | 夏俐(中山大学,教授) | 马帅(启元实验室,研究员) |
二、主题汇报:
Title: Practicable robust Markov decision processes
Abstract:Markov decision processes (MDP) is a standard modeling tool for sequential decision making in a dynamic and stochastic environment. When the model parameters are subject to uncertainty, the "optimal strategy" obtained from MDP can significantly under-perform than the model's prediction. To address this, robust MDP has been developed which is based on worst-case analysis. However, several restrictions of the robust MDP model prevent it from practical success, which I will address in this talk. The first restriction of standard robust MDP is that the modeling of uncertainty is not flexible and can lead to conservative solution. In particular, it requires that the uncertainty set is "rectangular" - i.e., it is a Cartesian product of uncertainty sets of each state. To lift this assumption, we propose an uncertainty model which we call “k-rectangular" that generalizes the concept of rectangularity, and we show that this can be solved efficiently via state augmentation. The second restriction is that it does not take into account the learning issue - i.e., how to adapt the model in an efficient way to reduce the uncertainty. To address this, we devise an algorithm inspired by reinforcement learning that, without knowing the true uncertainty model, is able to adapt its level of protection to uncertainty, and in the long run performs as good as the minimax policy as if the true uncertainty model is known. Indeed, the algorithm achieves similar regret bounds as standard MDP where no parameter is adversarial, which shows that with virtually no extra cost we can adapt robust learning to handle uncertainty in MDPs.
许欢,现任上海交通大学安泰经济与管理学院管理科学系长聘教授。他于上海交通大学自动化系获得学士学位,新加坡国立大学电子工程系获得硕士学位,加拿大麦吉尔大学电子工程系获得博士学位。他曾任教于新加坡国立大学(工业与系统工程系)和佐治亚理工学院(工业与系统工程系),并在2018年至2024年间在阿里巴巴集团工作。他的研究方向是机器学习/人工智能和运筹优化/管理科学,尤其是以鲁棒性为核心工具,在运筹学和人工智能,特别是两者的交互方面进行研究。在《Operations Research》、《Mathematics of Operations Research》、《Journal of Machine Learning Research》等国际知名期刊及 ICML、NeurIPS 等AI顶会上发表学术论文100余篇。累计引用一万余次。连续多年入选Elsvier的全球2%顶尖科学家榜单
Title:Efficient algorithms for robust Markov decision processes with s-rectangular ambiguity sets
Abstract:Robust Markov decision processes (MDPs) have recently attracted significant interest due to their ability to protect MDPs from poor out-of-sample performance due to the presence of ambiguity. In contrast to classical MDPs, which account for stochasticity by modeling the dynamics through a stochastic process with a known transition kernel, a robust MDP optimizes in view of the most adverse transition kernel from an ambiguity set constructed via historical data. In this paper, we develop a unified solution framework for a broad class of robust MDPs with s-rectangular ambiguity sets, where the most adverse transition probabilities are considered independently for each state. Using our algorithms, we show that s-rectangular robust MDPs with 1- and 2-norm or -divergence ambiguity sets can be solved several orders of magnitude faster than with state-of-the-art commercial solvers, and often only a logarithmic factor slower than classical MDPs. We demonstrate the favorable scaling properties of our algorithms on a range of synthetically generated as well as standard benchmark instances.
Clint Chin Pang Ho(何展鹏) is an Assistant Professor in the Department of Data Science at the City University of Hong Kong. Before that he was a Junior Research Fellow in the Imperial College Business School. Clint received a BS in Applied Mathematics from the University of California, Los Angeles (UCLA), an MSc in Mathematical Modeling and Scientific Computing from the University of Oxford, and a PhD in computational optimization from Imperial College London.
Clint's current research focuses on decision making under uncertainty. He studies optimization algorithms and computational methods for structured problems, as well as their applications in machine learning and operations research.
Title: On the foundation and tractability of robust reinforcement learning
Abstract: The main theme of this talk is to investigate the existence or absence of the dynamic programming principle (DPP).
In the first part, we focus on rectangular uncertainty sets and develop a comprehensive modeling framework for distributionally robust Markov decision processes (DRMDPs). This framework requires the decision maker to optimize against the worst-case distributional shift induced by an adversary. By unifying and extending existing formulations, we rigorously construct DRMDPs that accommodate various modeling attributes for both the decision maker and the adversary. These attributes include different levels of adaptability granularity, ranging from history-dependent to Markov and Markov time-homogeneous dynamics. We further explore the flexibility of adversarial shifts by examining SA- and S-rectangularity. Within this DRMDP framework, we analyze conditions under which the DPP holds or fails, systematically studying different combinations of decision-maker and adversary attributes.
In the second part, we extend our analysis beyond rectangular uncertainty sets and introduce the notion of tractability. Surprisingly, we show that, in full generality—without any assumptions on instantaneous rewards—rectangular uncertainty sets are the only tractable models. Our analysis further reveals that existing non-rectangular models, including R-rectangular uncertainty and its generalizations, are only weakly tractable. A key insight underlying our results is the novel simultaneous solvability property, which we identify as central to several fundamental properties of robust MDPs, including the existence of stationary optimal policies and dynamic programming principles. This property enables a unified approach to analyzing the tractability of all uncertainty models, whether rectangular or non-rectangular.
This talk is based on two papers:
https://arxiv.org/abs/2311.09018 and https://arxiv.org/abs/2411.08435.
Nian Si(司念) is an assistant professor at HKUST IEDA. He was a postdoctoral principal researcher at the University of Chicago Booth School of Business, working with Professor Baris Ata. He obtained my Ph.D. in the Department of Management Science and Engineering (MS&E) at Stanford University, where he was advised by Professor Jose Blanchet and closely worked with Professor Ramesh Johari. He was a member of the Stanford Operations Research Group. Previously, He obtained a B.A. in Economics and a B.S. in Mathematics and Applied Mathematics both from Peking University in 2017.
His research lies at the interface of operations research, statistics, machine learning, and economics. He is also interested in real-world operational problems arising from online platforms, including A/B tests, recommendation systems, online advertising, cloud computing, AI, etc.
Title: Optimizing global performance metrics in deep reinforcement learning
Abstract: Deep reinforcement learning (DRL) is a generic and powerful machine learning approach, which has delivered promising results in various application domains from robotics to combinatorial optimization. However, applying DRL to solve real-life problems is not straightforward, since it requires translating the task objective into an DRL objective, where both objectives may not be well-aligned. To tackle this issue, we present several novel techniques extending DRL to directly optimize the global performance metrics that describe the task objective.
Paul Weng is a tenured associate professor at Duke Kunshan University. Before joining Duke Kunshan, he was an associate professor at the University of Michigan-Shanghai Jiao Tong University Joint Institute. Besides, he was a regular or visiting faculty member of many universities (Sorbonne University, Carnegie Mellon University, Sun Yat-sen University, University of Nottingham Ningbo China).
Before joining academia, he was a financial quantitative analyst in London, UK. As a researcher, he regularly publishes in top AI and machine learning venues (e.g., IJCAI, AAAI, ICML…). He has served as an area chair at AAAI and ECAI. Several of his papers received a best paper award (e.g., MIWAI, ALA). His work has been funded both by public funding agencies (NSFC, Shanghai NSF) and private companies (Yahoo, Huawei, Netease).
His main research work lies in artificial intelligence (AI) and machine learning. Notably, it focuses on adaptive control (reinforcement learning, Markov decision process), multi-objective optimization (compromise programming, fair optimization), and preference handling (representation, elicitation, and learning).
Title:The role of mixed discounting in risk-averse sequential decision-making
Abstract:This work proposes a new principled constructive model for risk preference mapping in infinite-horizon cash flow analysis. The model prescribes actions that account for both a traditional discounting by scaling the future incomes and a random interruption time for the cash flow. Data from an existing field experiment provides evidence that supports the use of our proposed mixed discounting model in place of the more traditional one for a significant proportion of participants, i.e. 30% of them. This proportion climbs above 80% when enforcing the use of more reasonable discount factors. On the theoretical side, we shed light on some properties of the new preference model, establishing conditions under which the infinite-horizon risk is finite, and conditions where the mixed discounting model can be seen as either equivalent or providing a bound on the risk perceived by the traditional approach. Finally, an illustrative example on optimal stopping problem shows impacts of employing our mixed discounting model on the optimal threshold policy.
黄文杰博士现任香港大学数据与系统工程系及数据科学研究院助理教授(研究)。他于2014年获上海交通大学工业工程与管理学士学位,2019年获新加坡国立大学工业系统工程与管理博士学位,师从William B. Haskell教授和Tang Loon Ching教授。2019-2021年,他曾在香港中文大学(深圳)数据科学学院与加拿大蒙特利尔决策分析研究组(GERAD)担任联合博士后研究员,合作导师为Erick Delage教授和王子卓教授。他的研究领域涵盖:不确定性决策、数据驱动决策、序贯决策的理论及其在运营管理和可持续发展问题中的应用。其研究项目获香港研资局基金、国家自然科学基金及新加坡国家研究基金会等资助。
Title:Risk-sensitive Markov decision processes with mean-variance optimization
摘要:随着AlphaGo、大语言模型等人工智能技术的飞速发展,强化学习得到了学术界和工业界的日益重视。强化学习的理论基础是马氏决策过程(MDP),目前绝大多数优化算法都是将累积折扣报酬值这一随机变量的数学期望作为优化目标,当优化该随机报酬的高阶统计量乃至概率分布时,将面临Bellman最优方程不成立、动态规划原理失效等本质困难,需要寻求新的方法论。作者长期研究MDP和强化学习中的风险指标优化理论,从灵敏度分析优化角度形成了较为系统的优化方法体系,给出了方差、绝对偏差、VaR、CVaR、夏普率等一系列风险指标的动态优化方法。方差可用于刻画随机动态系统的风险、安全、稳定性、公平性等指标,本报告将主要围绕方差指标介绍作者近年来在MDP均值方差优化问题的研究成果,主要思路是将MDP方差优化问题转化为两层优化形式,针对稳态过程的长期平均方差,提出了两层策略迭代类型优化算法;针对有限阶段累积报酬的均值方差优化,通过状态扩维提出了策略迭代和后向归纳结合的算法。我们发现将该方法应用于金融工程中的多阶段资产组合均值方差优化问题时,可与经典文献结果(Li&Ng, 2000)相统一,但我们的方法可应用于其他更广泛的MDP均值方差优化问题,例如新能源储能联合出力的波动性控制问题,供应链管理或收益管理中考虑风险的动态优化问题,排队系统中的公平性指标动态优化问题等。
夏俐,麻花传MDR免费版教授。于2002年和2007年在清华大学自动化系获得学士和博士学位,2011年至2019年在清华大学自动化系任教,2019年调入中山大学。主要研究方向为马氏决策过程、强化学习、排队论、随机博弈等理论研究,以及在能源、金融等领域的应用研究。发表论文100余篇,获得10余项中国和美国发明专利,主持5项国家自然科学基金项目和多项华为腾讯等公司的合作研究项目。担任IEEE Transactions on Automation Science and Engineering、Discrete Event Dynamic Systems、Journal of the Operations Research Society of China、Journal of Systems Science and Systems Engineering等国际权威SCI期刊的副主编(AE)等学术兼职。
