Maya Benarous’ paper, “Synthesis of Longitudinal Human Location Sequences: Balancing Utility and Privacy“, was just published at the ACM Transactions on Knowledge Discovery from Data (TKDD). The paper, written with Maya’s two co-advisors, Eran Toch and Irad Ben-Gal, looks at synthesizing long sequences of people’s whereabouts. People’s location data is continuously tracked from many devices and sensors, enabling the ongoing analysis of sensitive information that can re-identify individuals or reveal sensitive information.

We have analyzed the use of different synthetic data generation models for long location sequences, including long short-term memory networks (LSTMs), Markov Chains, and variable-order Markov models (VMMs). The paper analyzes different performance measures, such as data similarity and privacy, and introduces different measurements to quantify each of these measures.  Her experiments, based on the anonymous data of 300 thousand users, show that different models can be used with different data analysis applications, such as traffic prediction or lifestyle analysis.

We also released two libraries on Github: one for synthesizing mobility chains and the other for evaluating synthesizers.

Here is the abstract:

People’s location data are continuously tracked from various devices and sensors, enabling an ongoing analysis of sensitive information that can violate people’s privacy and reveal confidential information. Synthetic data have been used to generate representative location sequences yet to maintain the users’ privacy. Nonetheless, the privacy-accuracy tradeoff between these two measures has not been addressed systematically. In this article, we analyze the use of different synthetic data generation models for long location sequences, including extended short-term memory networks (LSTMs), Markov Chains (MC), and variable-order Markov models (VMMs). We employ different performance measures, such as data similarity and privacy, and discuss the inherent tradeoff. Furthermore, we introduce other measurements to quantify each of these measures. Based on the anonymous data of 300 thousand cellular-phone users, our work offers a road map for developing policies for synthetic data generation processes. We propose a framework for building data generation models and evaluating their effectiveness regarding those accuracy and privacy measures.