Home
News!
Background
Research
Awards
Projects
Funding
Services
Students
Publications
Teaching

Sudeepa Roy

[photograph] class='iconDetails' />
    <div class=

                               Associate Professor
                               Department of Computer Science
                               Duke University
                               308 Research Drive
                               Campus Box 90129
                               Durham, NC 27708-0129

                               Office: D325 LSRC Building
                               Phone: (919)-660-6596
                               Fax: (919) 660-6519
                               E-mail: sudeepa AT cs DOT duke DOT edu      





News



Past news updates....

Background

I joined the Department of Computer Science at Duke University in Fall 2015.
I am a member of the Duke Database Group (a.k.a. Duke Database Devils; more about Duke Blue Devils),
which is part of the Duke Systems Group.

Before joining Duke, I was a postdoctoral research associate in the Department of Computer Science and Engineering,
University of Washington where I worked with Prof. Dan Suciu and the database group.

I graduated from the University of Pennsylvania with a Ph.D. in Computer and Information Science where I was advised by
Prof. Susan Davidson and Prof. Sanjeev Khanna. During my Ph.D., I did two internships at IBM Research, Almaden.



Research

I am broadly interested in data and information management with a focus on foundational aspects of databases and
big data analysis. My current research focuses on building tools and techniques to help users leverage the maximum benefit
from the available data. While my ongoing work on causality and explanations in databases directly aims to assist users get
deep insights into data by providing causal analysis and rich explanations to their questions, they often motivate the questions we studied in our recent work data repair and query optimization. My earlier work in the areas of data and workflow provenance,
probabilistic databases, and crowd-sourcing
probed into compelling, fundamental questions that need to be answered
to enable end-to-end processing and analysis of unstructured, noisy, and unreliable data in today's world while preserving its entire context.

See my publications.


Awards



Projects



Funding



Services

Organization / Advisory

Award Committee Member

Program Committee Member

External Reviewer



Teaching



Students

I am fortunate to work with a number of wonderful graduate/undergraduate students and postdocs at Duke!
(and the list below does not include the great students/postdocs advised my colleagues at Duke and other schools I work with).

Current students / postdocs

Former students and postdocs


      Duke CS+ undergraduate summer internship mentoring:
      James Lim (2021), Allen Pan (2021), Zachary Zheng (2021), Alexander Bendeck (2020), Jeffrey Luo (2020)

Publications    

    Book Chapters

  1. Trends in Explanations: Understanding and Debugging Data-driven Systems [pdf].
        (with Boris Glavic and Alexandra Meliou)
        Foundations and Trends in Databases, Vol 11, No. 3, 2021

  2. Uncertain Data Lineage [pdf].
        Encyclopedia of Database Systems, 2nd edition, Springer, 2018.

  3. Provenance: Privacy and Security [pdf].
        (with Susan Davidson)
        Encyclopedia of Database Systems, 2nd edition, Springer, 2018.

  4. Tutorials


  5. Causality and Explanations in Databases [pdf] [slides].
        (with Alexandra Meliou and Dan Suciu)
        International Conference on Very Large Data Bases (VLDB) 2014.

  6. Journal Publications


  7. FLAME: A Fast Large-scale Almost Matching Exactly Approach to Causal Inference [pdf] [arxiv].
        (with Tianyu Wang, Marco Morucci, M. Usaid Awan, Yameng Liu, Cynthia Rudin, and Alexander Volfovsky)
        Journal of Machine Learning Research (JMLR), Vol. 22, No. 31, pages 1−41, 2021.

  8. Computing Optimal Repairs for Functional Dependencies [pdf].
        (with Ester Livshits and Benny Kimelfeld)
        ACM Transactions on Database Systems (TODS), Vol. 45, Issue 1, pages 4:1--4:46, 2020 (best paper special issue).

  9. Exact Model Counting of Query Expressions: Limitations of Propositional Methods [pdf].
        (with Paul Beame, Jerry Li, and Dan Suciu)
        ACM Transactions on Database Systems (TODS), Vol. 42, Issue 1, pages 1:1-1:46, 2017.
        (Preliminary versions in ICDT 2014 and UAI 2013)

  10. Answering Conjunctive Queries with Inequalities [pdf].
        (with Paraschos Koutris, Tova Milo, and Dan Suciu)
        Theory of Computing Systems (TOCS), Springer, Vol. 61, Number 1, pages 2-30, 2017.
        (A preliminary version appeared in ICDT 2015)

  11. Top-k and Clustering with Noisy Comparisons [pdf].
        (with Susan B. Davidson, Sanjeev Khanna, and Tova Milo)
        ACM Transactions on Database Systems (TODS), Vol. 39, Issue 4, pages 35:1--35:39, 2014 (best paper special issue).
        (A preliminary version appeared in ICDT 2013)

  12. Invited Articles


  13. Toward Interpretable and Actionable Data Analysis with Explanations and Causality [pdf].
        PVLDB, Vol 15(12), 2022 (Article for the VLDB Early Career Research Award)

  14. Making AI Machines Work for Humans in FoW [pdf].
        (with Sihem Amer-Yahia, Senjuti Basu Roy, Lei Chen, Atsuyuki Morishima, James Abello Monedero, Pierre Bourhis, François Charoy, Marina Danilevsky, Gautam Das, Gianluca Demartini, Shady Elbassuoni, David Gross-Amblard, Emilie Hoareau, Munenari Inoguchi, Jared B. Kenworthy, Itaru Kitahara, Dongwon Lee, Yunyao Li, Ria Mae Borromeo, Paolo Papotti, H. Raghav Rao, Pierre Senellart, Keishi Tajima, Saravanan Thirumuruganathan, Marion Tommasi, Kazutoshi Umemoto, Andrea Wiggins, and Koichiro Yoshida)
        SIGMOD Record 2020 (49(2), pages 30-35)

  15. On Benchmarking for Crowdsourcing and Future of Work Platforms [pdf].
        (with Ria Mae Borromeo, Lei Chen, Abhishek Dubey, and Saravanan Thirumuruganathan)
        IEEE Data Engineering Bulletin 2019 (42(4), pages 46-54)

  16. Query Perturbation Analysis: An Adventure of Database Researchers in Fact-Checkings [pdf].
        (with Jun Yang, Pankaj K. Agarwal, Brett Walenz, You Wu, Cong Yu, and Chengkai Li)
        IEEE Data Engineering Bulletin 2018 (41(3), pages 28-42)

  17. On the Complexity of Evaluating Order Queries with the Crowd [pdf].
        (with Benoit Groz and Tova Milo)
        IEEE Data Engineering Bulletin 2015 (38(3), pages 44-58)

  18. Conference Publications


  19. The Cost of Representation by Subset Repairs.
        (with Yuxi Liu*, Fangzhu Shen*, Kushagra Ghosh, Amir Gilad, and Benny Kimelfeld)
        To Appear in Proceedings of the VLDB Endowment (PVLDB), 2025.

  20. Qr-Hint: Actionable Hints for Guided SQL Query Debugging.
        (with Yihao Hu, Amir Gilad, Kristin Stephens-Martinez, and Jun Yang)
        To Appear in ACM SIGMOD International Conference on Management of Data (SIGMOD), 2024.

  21. Summarized Causal Explanations For Aggregate Views.
        (with Brit Youngmann, Amir Gilad, and Michael Cafarella)
        To Appear in ACM SIGMOD International Conference on Management of Data (SIGMOD), 2024.

  22. Evaluating Datalog over Semirings: A Grounding-based Approachg.
        (with Hangdong Zhao, Shaleen Deep, Paris Koutris, and Val Tannen)
        To Appear in ACM Principles of Database Systems (PODS), 2024.

  23. Evaluating Pre-Trial Programs Using Interpretable Machine Learning Matching Algorithms for Causal Inference.
        (with Travis Seale-Carlisle*, Saksham Jain*, Courtney Lee, Caroline Levenson, Swathi Ramprasad, Brandon Garrett, Cynthia Rudin, and Alexander Volfovsky)
        To Appear in AAAI Conference on Artificial Intelligence (AAAI), 2024, AI for Social Impact (AISI) special track.

  24. DP-PQD: Privately Detecting Per-Query Gaps In Synthetic Data Generated By Black-Box Mechanisms.
        (with Shweta Patwa, Danyu Sun, Amir Gilad, and Ashwin Machanavajjhala)
        Proceedings of the VLDB Endowment (PVLDB), Vol 17 (1), 2023.

  25. Explaining Differentially Private Query Results With DPXPlain.
        (with Tingyu Wang, Yuchao Tao, Amir Gilad, and Ashwin Machanavajjhala)
        Proceedings of the VLDB Endowment (PVLDB), 2023, Demonstration Track.

  26. Characterizing and Verifying Queries Via CINSGEN.
        (with Hanze Meng, Zhengjie Miao, Amir Gilad, and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2023, Demonstration Track.

  27. Causal What-If and How-To Analysis Using HypeR.
        (with Fangzhu Shen, Kayvon Heravi, Oscar Gomez, Sainyam Galhotra, Amir Gilad, and Babak Salimi)
        International Conference on Data Engineering, Demonstration Track, 2023.

  28. DPXPlain: Privately Explaining Aggregate Qery Answers. [pdf]
        (with Yuchao Tao, Amir Gilad, and Ashwin Machanavajjhala)
        Proceedings of the VLDB Endowment (PVLDB), Vol 16 (1), 2022.

  29. HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach.
        (with Sainyam Galhotra*, Amir Gilad*, and Babak Salimi)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022.

  30. Selectivity Functions of Range Queries are Learnable.
        (with Xiao Hu, Yuxi Liu, Haibo Xiu, Pankaj Agarwal, Debmalya Panigrahi, and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022.

  31. Understanding Queries by Conditional Instances. [arxiv].
        (with Amir Gilad*, Zhengjie Miao*, and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2022.

  32. CaJaDE: Explaining Query Results by Augmenting Provenance with Context.
        (with Chenjie Li, Juseung Lee, Zhengjie Miao, and Boris Glavic)
        Proceedings of the VLDB Endowment (PVLDB), Vol 15, demonstration track, 2022.

  33. Putting Things into Context: Rich Explanations for Query Answers using Join Graphs. [pdf] [arxiv].
        (with Chenjie Li, Zhengjie Miao, Qitian Zeng, and Boris Glavic)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2021.

  34. Properties of Inconsistency Measures for Databases [pdf].
        (with Ester Livshits, Rina Kochirgan, Segev Tsur, Ihab Ilyas, and Benny Kimelfeld)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2021.

  35. Aggregated Deletion Propagation for Counting Conjunctive Query Answers [pdf] [full version]
        (with Xiao Hu, Shouzhuo Sun, Shweta Patwa, and Debmalya Panigrahi)
        Proceedings of the VLDB Endowment (PVLDB), Vol 14, 2020.

  36. I-Rex: An Interactive Relational Query Explainer for SQL [pdf].
        (with Zhengjie Miao, Tiangang Chen, Alexander Bendeck, Kevin Day, and Jun Yang)
        Proceedings of the VLDB Endowment (PVLDB), Vol 13, demonstration track, 2020.

  37. MuSe: Multiple Deletion Semantics for Data Repair [pdf].
        (with Amir Gilad, Yihao Hu, and Daniel Deutch)
        Proceedings of the VLDB Endowment (PVLDB), Vol 13, demonstration track, 2020.

  38. On Multiple Semantics for Declarative Database Repairs [pdf] [arxiv].
        (with Amir Gilad and Daniel Deutch)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2020.

  39. Computing Local Sensitivities of Counting Queries with Joins [pdf] [arxiv].
        (with Yuchao Tao, Xi He, and Ashwin Machanavajjhala)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2020.

  40. Causal Relational Learning [pdf] [arxiv].
        (with Babak Salimi, Harsh Parikh, Moe Kayali, Lise Getoor, and Dan Suciu)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2020.

  41. Adaptive Hyper-box Matching for Interpretable Individualized Treatment Effect Estimation [arxiv].
        (with Marco Morucci*, Vittorio Orlandi*, Cynthia Rudin, and Alexander Volfovsky)
        To appear in Conference on Uncertainty in Artificial Intelligence (UAI), 2020.

  42. Almost-Matching-Exactly for Treatment Effect Estimation under Network Interference [arxiv].
        (with M. Usaid Awan*, Marco Morucci*, Vittorio Orlandi*, Cynthia Rudin, and Alexander Volfovsky)
        International Conference on Artificial Intelligence and Statistics (AISTATS), 2020.

  43. Learning to Sample: Counting with Complex Queries [arxiv].
        (with Brett Walenz, Stavros Sintos, and Jun Yang)
        Proceedings of the VLDB Endowment (PVLDB), Vol 13, 2019.

  44. Almost Matching Exactly With Instrumental Variables [arxiv].
        (with M.Usaid Awan*, Yameng Liu*, Marco Morucci*, Cynthia Rudin, and Alexander Volfovsky)
        Conference on Uncertainty in Artificial Intelligence (UAI) 2019.

  45. CAPE: Explaining Outliers by Counterbalancing [pdf].
        (with Zhengjie Miao*, Qitian Zeng*, Chenjie Li, Boris Glavic, and Oliver Kennedy)
        Proceedings of the VLDB Endowment (PVLDB), Vol 12, demonstration track, 2019.

  46. LensXPlain: Visualizing and Explaining Contributing Subsets for Aggregate Query Answers [pdf].
        (with Zhengjie Miao and Andrew Lee)
        Proceedings of the VLDB Endowment (PVLDB), Vol 12, demonstration track, 2019.

  47. Almost-Exact Matching with Replacement for Causal Inference [arxiv].
        (with Awn Dieng*, Yameng Liu*, Cynthia Rudin, and Alexander Volfovsky)
        International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.

  48. RATest: Explaining Wrong Queries Using Small Examples [pdf].
        (with Zhengjie Miao and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), demonstration track, 2019.

  49. Explaining Wrong Queries Using Small Examples [pdf].
        (with Zhengjie Miao and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2019.

  50. Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances [pdf].
        (with Zhengjie Miao*, Qitian Zeng*, and Boris Glavic)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2019.

  51. iQCAR: inter-Query Contention Analyzer for Data Analytics Frameworks [pdf].
        (with Prajakta Kalmegh and Shivnath Babu)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), 2019.

  52. Interactive Summarization and Exploration of Top Aggregate Query Answers [pdf].
        (with Yuhao Wen, Xiaodan Zhu, and Jun Yang)
        Proceedings of the VLDB Endowment (PVLDB) 2018, Vol 11 Issue 13/VLDB 2019.

  53. Computing Optimal Repairs for Functional Dependencies [arxiv].
        (with Ester Livshits and Benny Kimelfeld)
        ACM Principles of Database Systems (PODS) 2018.

  54. iQCAR: A demonstration of an Inter-query Contention Analyzer for Cluster Computing Frameworks [pdf].
        (with Prajakta Kalmegh, Harrison Lundberg, Frederick Xu, and Shivnath Babu)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), demonstration track, 2018.

  55. QAGView: Interactively Summarizing High-Valued Aggregate Query Answers [pdf].
        (with Yuhao Wen, Xiaodan Zhu, and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD), demonstration track, 2018.

  56. Optimizing Iceberg Queries with Complex Joins [pdf].
        (with Brett Walenz and Jun Yang)
        ACM SIGMOD International Conference on Management of Data (SIGMOD) 2017.

  57. Explaining Query Answers with Explanation-Ready Databases [pdf] [slides].
        (with Laurel Orr and Dan Suciu)
        Proceedings of the VLDB Endowment (PVLDB) Vol 9/VLDB 2016.

  58. Answering Conjunctive Queries with Inequalities [pdf].
        (with Paraschos Koutris, Tova Milo, and Dan Suciu)
        International Conference on Database Theory (ICDT) 2015

  59. A Formal Approach to Finding Explanations for Database Queries [pdf] [slides].
        (with Dan Suciu)
        ACM SIGMOD International Conference on Management of Data (SIGMOD) 2014.

  60. Circuits for Datalog Provenance [pdf] [slides].
        (with Daniel Deutch, Tova Milo, and Val Tannen)
        International Conference on Database Theory (ICDT) 2014.

  61. Model Counting of Query Expressions: Limitations of Propositional Methods [pdf].
        (with Paul Beame, Jerry Li, and Dan Suciu)
        International Conference on Database Theory (ICDT) 2014.
        Invited to ACM TODS as one of the best papers in ICDT 2014

  62. Lower Bounds for Exact Model Counting and Applications in Probabilistic Databases [pdf] [slides].
        (with Paul Beame, Jerry Li, and Dan Suciu)
        Conference on Uncertainty in Artificial Intelligence (UAI) 2013.

  63. Provenance-based Dictionary Refinement in Information Extraction [pdf] [slides].
        (with Laura Chiticariu, Vitaly Feldman, Frederick R Reiss and Huaiyu Zhu)
        ACM SIGMOD International Conference on Management of Data (SIGMOD) 2013.

  64. Using the Crowd for Top-k and Group-by Queries [pdf] [slides].
        (with Susan B. Davidson, Sanjeev Khanna and Tova Milo)
        International Conference on Database Theory (ICDT) 2013.
        Invited to ACM TODS as one of the best papers in ICDT 2013

  65. A Propagation Model for Provenance Views of Public/Private Workflows [pdf] [slides].
        (with Susan B. Davidson and Tova Milo)
        International Conference on Database Theory (ICDT) 2013.

  66. Queries with Difference on Probabilistic Databases [pdf] [slides].
        (with Sanjeev Khanna and Val Tannen)
        International Conference on Very Large Data Bases (VLDB) 2011.

  67. Provenance Views for Module Privacy [pdf] [slides].
        (with Susan B. Davidson, Sanjeev Khanna, Tova Milo, and Debmalya Panigrahi)
        ACM Principles of Database Systems (PODS) 2011.

  68. Faster Query Answering in Probabilistic Databases using Read-Once Functions [pdf] [slides].
        (with Vittorio Perduca and Val Tannen)
        International Conference on Database Theory (ICDT) 2011.

  69. Enabling Privacy in Provenance-Aware Workflow Systems [pdf].
        (with Susan Davidson, Sanjeev Khanna, Julia Stoyanovich, Val Tannen, Yi Chen and Tova Milo)
        Vision Track, Conference on Innovative Data Systems Research (CIDR) 2011.

  70. An Optimal Labeling Scheme for Workflow Provenance Using Skeleton Labels [pdf].
        (with Zhuowei Bao, Susan Davidson and Sanjeev Khanna)
        ACM SIGMOD International Conference on Management of Data (SIGMOD) 2010.

  71. Optimizing User Views for Workflows [pdf] [slides].
        (with Olivier Biton, Susan Davidson and Sanjeev Khanna)
        International Conference on Database Theory (ICDT) 2009.

  72. STCON in Directed Unique-Path Graphs [pdf] [slides].
        (with Sampath Kannan and Sanjeev Khanna)
        Foundations of Software Technology and Theoretical Computer Science (FSTTCS) 2008.

  73. Automatic Translation of Simulink Models into Input Language of a Model Checker [pdf].
        (with Meenakshi B. and Abhishek Bhatnagar)
        International Conference on Formal Engineering Methods (ICFEM) 2006.

  74. Workshop, Poster, and Other Publications


  75. I-Rex: An Interactive Relational Query Debugger for SQL [link].
        (with Yihao Hu, Zhengjie Miao, James Leong, James Lim, Zachary Zheng, Kristin Stephens-Martinez, and Jun Yang)
        ACM Technical Symposium on Computer Science Education (SIGCSE), Demonstration, 2022.

  76. AME: Interpretable Almost Exact Matching for Causal Inference [link].
        (with Haoning Jiang, Tommy Howell, Neha Gupta, Vittorio Perduca, Marco Morucci, Harsh Parikh, Cynthia Rudin, and Alexander Volfovsky)
        Conference on Neural Information Processing Systems (NeurIPS), Demonstration, 2021.

  77. iQCAR: Inter-Query Contention Analyze [pdf].
        (with Prajakta Kalmegh and Shivnath Babu)
        Symposium on Cloud Computing (SOCC), Poster, 2018.

  78. Hiding Data and Structure in Workflow Provenance [pdf].
        (with Susan B. Davidson and Zhuowei Bao)
        Invited paper, International Workshop on Databases in Networked Information Systems (DNIS) 2011.

  79. Privacy Issues in Scientific Workflow Provenance [pdf] [slides].
        (with Susan Davidson, Sanjeev Khanna and Sarah Cohen Boulakia)
        International Workshop on Workflow Approaches to New Data-centric Science (WANDS) 2010.


  80. * = equal contributions

Ph.D. Dissertation

    Provenance and Uncertainty [pdf].
    Sudeepa Roy
    University of Pennsylvania, August 2012




Patents

  1. Refining a dictionary for information extraction.
        (with Laura Chiticariu, Vitaly Feldman, Frederick Reiss, and Huaiyu Zhu)
        Assignee: International Business Machines Corporation (IBM)
        Publication Number: US 8775419 B2,  2014

  2. Automatic Translation of Simulink Models into Input Language of a Model Checker.
        (with Meenakshi B. and Abhishek Bhatnagar)
        Assignee: Honeywell International Inc.
        Publication Number: US 7698668 B2,  2010



Miscellaneous

Reports

On "Go With the Winners" Algorithm [pdf].
    Sudeepa Roy
    M. Tech. Thesis, IIT Kanpur, 2006.
    Advisors: Prof. Manindra Agrawal and Prof. Somenath Biswas