Overview | Coloring Privacy | |

Color is one of the important features in visualization. For example, colors are used to represent the population of a region on a map. The darker the color uses, the more people a region could have. The coded information with colors or other visual attributes seems safer towards privacy breach compared to the raw data, but there could still be potential privacy attacks based on the visual information. This project is motivated by the possible privacy breach issue in the visualization process via colors. When applying common privacy-preserving techniques including Laplacian Mechanism and Exponential Mechanism for ε-differential privacy, we have addressed their corresponding challenges and possible solutions. Moreover, information visualization techniques have their own set of criteria for evaluation, such as traditional Nielsen's set of heuristic evaluation. Hence, we would like to explore the special utilities of visualization, such as the contrasting effect of colors, to minimize any loss in the effectiveness and efficiency of visualization techniques, or in other words, any loss in the utilities of visualization due to privacy-preserving mechanism.

Visualization aims at enhancing data analysts' understanding on data given. We generalize this process in the following model, as shown in Figure 1. Intuitively, the model is an interaction between a data user and a database. The user will have a graphic interface to interact with the database. This interface will generate queries to the database, where raw information or semi-processed information is stored. A further query computation will be processed on the data and generate a query result. In the end, data will be mapped onto a view, which will be presented to the user.

Figure 1: Visualization Query Model |
Figure 2: Coloring conversion for the query result |

**Example:** Given a fact table D of n tuples with column attributes: individual ID and their corresponding location. The user request to create a view on the number of individuals at each geographic grid on the map with corresponding colors. For instance, in Figure 2, given the fact table D, we could derive a frequency matrix FreqM. If we have nine colors available, denoted by the color set C = {0,1,..., 8}, we map color 1 to frequency 0, color 3 to frequency 1, color 5 to frequency 2 and color 7 to frequency 3. Based on this coloring assignment, we could get a coloring matrix ColorM.

**Possible Privacy Breach:** Simply using those colors to represent the frequency on the map could breach privacy. If a taxi changes its location from one grid cell to another, the frequency for these two grid cells will change and hence the coloring. For example, in Figure 4, the first individual's location changes from (0, 1) to (2, 1), it could result in dierent coloring matrix as shown in Figure 3(a). Then the adversary could see the dierence highlight in the red box of the two coloring matrix ColorM and ColorM', due to the change of one tuple in the fact table D. Common mechanism such as Laplacian Mechanism or Exponential Mechanism could be used to preserve ε-differential privacy, but there are challenges to each of them.

Figure 3: Potential privacy breach due to one tuple change in the fact table

**Coloring Utility: **

Accuracy: measured by the difference between the output coloring matrix ColorM and the optimal coloring matrix ColorM*.Contrast and Analogy: measured by the difference in the contrast between adjacent grid cells from its optimal contrast.

Perturbed Coloring via Laplacian Mechanism: Noises could be added to the colors as shown in Figure 3(b), such that there will be a high probability for the two fact table to have the same output coloring matrix. There are two challenges here. Firstly, the coloring values are integers and secondly, colors hav maximum and minimum values, i.e. bounded values. Hence, we have developed Bounded Discrete Laplacian Mechanism and proved that it satisfies 2ε-differential privacy.Sampling via Exponential Mechanism: A scoring function has been constructed and the probability for each possible output is computed based on the exponential function of corresponding score. The challenges here are no trivial c.d.f form and large sample space. We have come out with an indexing and histogram method for sampling. A simplified version for the scoring function have been computed for possible faster sampling.

- Exploring on the visual attributes, for example, colors have three attributes: hue, saturation and lightness.
- Merging analous nodes to reduce storage and computation cost.
- Exploring privacy on range coloring.

Details and experiment results could refer to the projec presentation slide

Some testing maps can be found here