The objective of DAPSA – Data Access with Privacy, Security, and Accountability: Towards Virtual Research Rooms is to analyze and try to solve tough problems related to privacy, security, and accountability in data science.

Data about people is key to making informed decisions and driving research in medical and social sciences. But how can such data be used without sacrificing privacy? Here we form a unique collaboration of core researchers and researchers from social science and law, together with the Israeli Central Bureau of Statistics (ICBS). The research team will develop new theoretically grounded methods and algorithms for private, secure, and accountable data access and deploy these in virtual research rooms used by researchers at TAU and other Israeli research institutions.

The project is funded by the Data Science Excellence program at the Israel Council for Higher Education through TAD – Center for Artificial Intelligence & Data Science at Tel Aviv University.

Open source code created as part of DAPSA our Github repository:

  • synthesize_mobility_chains: The repository contains code that analyze long mobility chains repositories (such as the location traces of an app user for several weeks) and synthesizes location traces using short-term memory networks (LSTMs), Markov Chains (MC), and variable-order Markov models
    (VMMs).
  • evaluate_synthetic_time_series: The repository contains code that evaluates measures for synthetic time series data, such as synthesis of the locations a set of people visit over a couple of weeks. The scripts compare the synthetic data to the original data and analyze how well the synthesis preserves the privacy of the original data subjects, the statistical similarity, the per-instance similarity, and the diversity of the synthetic data.
  • GPTalyze open-source Github repository: The library utilizes ChatGPT’s API to analyze short textual snippets, such as tweets, employing ChatGPT’s zero-shot-like abilities to summarize the discussed topics in a textual corpus and perform other Natural Language Processing (NLP) tasks, such as sentiment analysis and emotion detection.

The first paper in this project, “Synthesis of Longitudinal Human Location Sequences: Balancing Utility and Privacy“, was recently accepted to the ACM Transactions on Knowledge Discovery from Data (TKDD).