{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "gkv0-iHdQR7N" }, "source": [ "# Simple table test\n", "[![View On GitHub](https://img.shields.io/badge/View_in_Github-grey?logo=github)](https://github.com/Qrlew/docs/blob/main/tutorials/postgres_simple_table.ipynb)\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Qrlew/pyqrlew/blob/main/examples/postgres_simple_table.ipynb)\n", "\n", "In this notebook, we demonstrate the application of `qrlew` to an existing database, showcasing its capability to compile SQL queries into differential privacy.\n", "\n", "For the purpose of this demonstration, we focus on a single table, while more intricate use cases are explored in other notebooks." ] }, { "cell_type": "markdown", "metadata": { "id": "ZUq42jHlQR7P" }, "source": [ "## Install the database" ] }, { "cell_type": "markdown", "metadata": { "id": "eO8s7I5KQR7Q" }, "source": [ "Initially, we establish a PostgreSQL database.\n", "\n", "This section maybe skipped if you have already access to a Postgres database." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "id": "rdN3zDBeQR7Q" }, "outputs": [], "source": [ "%%capture\n", "# Load the database\n", "# Inspired by https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/postgresql.ipynb#scrollTo=YUj0878jPyz7\n", "!sudo apt-get -y -qq update\n", "!sudo apt-get -y -qq install postgresql-14 graphviz\n", "# Start postgresql server\n", "!sudo service postgresql start\n", "# Set password\n", "!sudo -u postgres psql -U postgres -c \"ALTER USER postgres PASSWORD 'pyqrlew-db'\"\n", "!sudo -u postgres psql -U postgres -c \"CREATE DATABASE my_database\"" ] }, { "cell_type": "markdown", "metadata": { "id": "6HSIusfpQR7R" }, "source": [ "We create the `census_table` table with its synthetic version `synthetic census_table`.\n", "\n", "Note that the synthetic tables are not mandatory and can be used when `qrlew`` cannot compile into DP the query." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "DlPlTqPAQR7R" }, "outputs": [], "source": [ "import psycopg2\n", "\n", "connect_pars= {\n", " 'dbname':'my_database',\n", " 'user':'postgres',\n", " 'port':5432,\n", " 'password':'pyqrlew-db',\n", " 'host':'localhost',\n", "}\n", "conn = psycopg2.connect(**connect_pars)\n", "\n", "# Create a cursor\n", "cursor = conn.cursor()\n", "\n", "sql_script = \"\"\"\n", "CREATE SCHEMA census_schema;\n", "\n", "CREATE TABLE census_schema.census_table (\n", " id SERIAL PRIMARY KEY,\n", " first_name VARCHAR(50),\n", " last_name VARCHAR(50),\n", " age INT,\n", " gender VARCHAR(10),\n", " occupation VARCHAR(50),\n", " city VARCHAR(50),\n", " state VARCHAR(50)\n", ");\n", "\n", "-- Inserting sample data\n", "INSERT INTO census_schema.census_table (first_name, last_name, age, gender, occupation, city, state)\n", "VALUES\n", " ('John', 'Doe', 30, 'Male', 'Engineer', 'New York', 'NY'),\n", " ('Jane', 'Doe', 28, 'Female', 'Engineer', 'Los Angeles', 'CA'),\n", " ('Bob', 'Smith', 45, 'Male', 'Analyst', 'Chicago', 'IL'),\n", " ('Alice', 'Johnson', 35, 'Female', 'Data Scientist', 'San Francisco', 'CA'),\n", " ('Tom', 'Brown', 50, 'Male', 'Lawyer', 'Miami', 'FL'),\n", " ('Emily', 'Davis', 32, 'Female', 'Software Developer', 'Seattle', 'WA'),\n", " ('Charlie', 'Miller', 40, 'Male', 'Accountant', 'Boston', 'MA'),\n", " ('Eva', 'Williams', 29, 'Female', 'Marketing Manager', 'Austin', 'TX'),\n", " ('Mike', 'Jones', 38, 'Male', 'Analyst', 'Denver', 'CO'),\n", " ('Sophia', 'Taylor', 42, 'Female', 'Analyst', 'Atlanta', 'GA'),\n", " ('John', 'Doe', 30, 'Male', 'Engineer', 'New York', 'NY'),\n", " ('Jane', 'Doe', 28, 'Female', 'Engineer', 'Los Angeles', 'CA'),\n", " ('Bob', 'Smith', 45, 'Male', 'Analyst', 'Chicago', 'IL'),\n", " ('Alice', 'Johnson', 35, 'Female', 'Data Scientist', 'San Francisco', 'CA'),\n", " ('Tom', 'Brown', 50, 'Male', 'Lawyer', 'Miami', 'FL'),\n", " ('Emily', 'Davis', 32, 'Female', 'Software Developer', 'Seattle', 'WA'),\n", " ('Charlie', 'Miller', 40, 'Male', 'Accountant', 'Boston', 'MA'),\n", " ('Eva', 'Williams', 29, 'Female', 'Marketing Manager', 'Austin', 'TX'),\n", " ('Mike', 'Jones', 38, 'Male', 'Analyst', 'Denver', 'CO'),\n", " ('Sophia', 'Taylor', 42, 'Female', 'Analyst', 'Atlanta', 'GA'),\n", " ('Alex', 'Johnson', 33, 'Male', 'Analyst', 'Portland', 'OR'),\n", " ('Megan', 'White', 27, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Daniel', 'Clark', 36, 'Male', 'Analyst', 'Houston', 'TX'),\n", " ('Olivia', 'Martinez', 31, 'Female', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('William', 'Lee', 48, 'Male', 'Professor', 'Philadelphia', 'PA'),\n", " ('Emma', 'Garcia', 29, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Ryan', 'Harris', 41, 'Male', 'Data Scientist', 'Dallas', 'TX'),\n", " ('Ava', 'Taylor', 26, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('Christopher', 'Anderson', 37, 'Male', 'Engineer', 'Detroit', 'MI'),\n", " ('Sophie', 'Wilson', 34, 'Female', 'HR Specialist', 'Raleigh', 'NC'),\n", " ('David', 'Miller', 29, 'Male', 'Engineer', 'New York', 'NY'),\n", " ('Sophia', 'Taylor', 34, 'Female', 'Engineer', 'Los Angeles', 'CA'),\n", " ('Bob', 'Smith', 42, 'Male', 'Analyst', 'Chicago', 'IL'),\n", " ('Alice', 'Johnson', 39, 'Female', 'Data Scientist', 'San Francisco', 'CA'),\n", " ('Tom', 'Brown', 37, 'Male', 'Lawyer', 'Miami', 'FL'),\n", " ('Emily', 'Davis', 28, 'Female', 'Software Developer', 'Seattle', 'WA'),\n", " ('Charlie', 'Miller', 45, 'Male', 'Accountant', 'Boston', 'MA'),\n", " ('Eva', 'Williams', 31, 'Female', 'Marketing Manager', 'Austin', 'TX'),\n", " ('Mike', 'Jones', 36, 'Male', 'Analyst', 'Denver', 'CO'),\n", " ('Sophia', 'Taylor', 49, 'Female', 'Analyst', 'Atlanta', 'GA'),\n", " ('Alex', 'Johnson', 37, 'Male', 'Analyst', 'Portland', 'OR'),\n", " ('Megan', 'White', 27, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Daniel', 'Clark', 41, 'Male', 'Analyst', 'Houston', 'TX'),\n", " ('Olivia', 'Martinez', 32, 'Female', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('William', 'Lee', 47, 'Male', 'Professor', 'Philadelphia', 'PA'),\n", " ('Emma', 'Garcia', 30, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Ryan', 'Harris', 40, 'Male', 'Entrepreneur', 'Dallas', 'TX'),\n", " ('Ava', 'Taylor', 26, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('Christopher', 'Anderson', 35, 'Male', 'Engineer', 'Detroit', 'MI'),\n", " ('Sophie', 'Wilson', 33, 'Female', 'HR Specialist', 'Raleigh', 'NC'),\n", " ('Grace', 'Brown', 55, 'Female', 'Analyst', 'New York', 'NY'),\n", " ('Brian', 'Taylor', 50, 'Male', 'Professor', 'Los Angeles', 'CA'),\n", " ('Lily', 'Harrison', 42, 'Female', 'Engineer', 'Chicago', 'IL'),\n", " ('Kevin', 'Garcia', 55, 'Male', 'Analyst', 'San Francisco', 'CA'),\n", " ('Sophie', 'Smithson', 48, 'Female', 'Lawyer', 'Miami', 'FL'),\n", " ('John', 'Clarkston', 38, 'Male', 'Software Developer', 'Seattle', 'WA'),\n", " ('Emma', 'Millerson', 40, 'Female', 'Accountant', 'Boston', 'MA'),\n", " ('David', 'Williamson', 45, 'Male', 'Marketing Manager', 'Austin', 'TX'),\n", " ('Ava', 'Johnsonson', 38, 'Female', 'Analyst', 'Denver', 'CO'),\n", " ('Daniel', 'Anderson', 42, 'Male', 'Data Scientist', 'Atlanta', 'GA'),\n", " ('Michael', 'Martinson', 33, 'Male', 'Engineer', 'Portland', 'OR'),\n", " ('Sophia', 'Whiterson', 27, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Robert', 'Davidson', 41, 'Male', 'Analyst', 'Houston', 'TX'),\n", " ('Grace', 'Wilson', 45, 'Female', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('Andrew', 'Leeson', 48, 'Male', 'Professor', 'Philadelphia', 'PA'),\n", " ('Emily', 'Garcia', 39, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Christopher', 'Sanders', 41, 'Male', 'Entrepreneur', 'Dallas', 'TX'),\n", " ('Olivia', 'Martinez', 36, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('Sophie', 'Brownson', 43, 'Female', 'Engineer', 'Detroit', 'MI'),\n", " ('Matthew', 'Taylors', 46, 'Male', 'HR Specialist', 'Raleigh', 'NC');\n", "\n", "CREATE TABLE census_schema.synthetic_census_table (\n", " id SERIAL PRIMARY KEY,\n", " first_name VARCHAR(50),\n", " last_name VARCHAR(50),\n", " age INT,\n", " gender VARCHAR(10),\n", " occupation VARCHAR(50),\n", " city VARCHAR(50),\n", " state VARCHAR(50)\n", ");\n", "\n", "-- Inserting sample data\n", "INSERT INTO census_schema.synthetic_census_table (first_name, last_name, age, gender, occupation, city, state)\n", "VALUES\n", " ('Adam', 'Johnson', 31, 'Male', 'Engineer', 'New York', 'NY'),\n", " ('Sophie', 'Clark', 29, 'Female', 'Analyst', 'Los Angeles', 'CA'),\n", " ('David', 'Smith', 46, 'Male', 'Data Scientist', 'Chicago', 'IL'),\n", " ('Emma', 'Brown', 36, 'Female', 'Software Developer', 'San Francisco', 'CA'),\n", " ('James', 'Taylor', 51, 'Male', 'Lawyer', 'Miami', 'FL'),\n", " ('Olivia', 'Miller', 33, 'Female', 'Accountant', 'Seattle', 'WA'),\n", " ('Lucas', 'Williams', 41, 'Male', 'Marketing Manager', 'Boston', 'MA'),\n", " ('Ava', 'Davis', 30, 'Female', 'Analyst', 'Austin', 'TX'),\n", " ('Adam', 'Jones', 39, 'Male', 'Engineer', 'Denver', 'CO'),\n", " ('Sophie', 'Johnson', 43, 'Female', 'Analyst', 'Atlanta', 'GA'),\n", " ('Alex', 'Smith', 34, 'Male', 'Software Developer', 'Portland', 'OR'),\n", " ('Mia', 'White', 28, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Daniel', 'Clarkson', 37, 'Male', 'Lawyer', 'Houston', 'TX'),\n", " ('Oliver', 'Martinez', 32, 'Male', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('Sophia', 'Lee', 49, 'Female', 'Professor', 'Philadelphia', 'PA'),\n", " ('Lily', 'Garcia', 30, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Matthew', 'Harris', 42, 'Male', 'Entrepreneur', 'Dallas', 'TX'),\n", " ('Emily', 'Taylor', 27, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('William', 'Anderson', 36, 'Male', 'Engineer', 'Detroit', 'MI'),\n", " ('Sophia', 'Wilson', 34, 'Female', 'HR Specialist', 'Raleigh', 'NC'),\n", " ('Ethan', 'Miller', 29, 'Male', 'Engineer', 'New York', 'NY'),\n", " ('Mia', 'Clark', 35, 'Female', 'Software Developer', 'Los Angeles', 'CA'),\n", " ('Jack', 'Smithson', 43, 'Male', 'Analyst', 'Chicago', 'IL'),\n", " ('Ava', 'Johnsonson', 38, 'Female', 'Data Scientist', 'San Francisco', 'CA'),\n", " ('Henry', 'Brown', 36, 'Male', 'Lawyer', 'Miami', 'FL'),\n", " ('Sophie', 'Davis', 27, 'Female', 'Accountant', 'Seattle', 'WA'),\n", " ('Noah', 'Williamson', 45, 'Male', 'Marketing Manager', 'Boston', 'MA'),\n", " ('Mia', 'Williams', 29, 'Female', 'Analyst', 'Austin', 'TX'),\n", " ('Ethan', 'Jones', 38, 'Male', 'Engineer', 'Denver', 'CO'),\n", " ('Ava', 'Taylor', 42, 'Female', 'Analyst', 'Atlanta', 'GA'),\n", " ('Alex', 'Johnson', 33, 'Male', 'Software Developer', 'Portland', 'OR'),\n", " ('Mia', 'White', 27, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Daniel', 'Clark', 41, 'Male', 'Lawyer', 'Houston', 'TX'),\n", " ('Olivia', 'Martinez', 32, 'Female', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('William', 'Lee', 47, 'Male', 'Professor', 'Philadelphia', 'PA'),\n", " ('Emma', 'Garcia', 30, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Ryan', 'Harris', 40, 'Male', 'Entrepreneur', 'Dallas', 'TX'),\n", " ('Ava', 'Taylor', 26, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('Christopher', 'Anderson', 35, 'Male', 'Engineer', 'Detroit', 'MI'),\n", " ('Sophie', 'Wilson', 33, 'Female', 'HR Specialist', 'Raleigh', 'NC'),\n", " ('Grace', 'Brown', 55, 'Female', 'Analyst', 'New York', 'NY'),\n", " ('Brian', 'Taylor', 50, 'Male', 'Professor', 'Los Angeles', 'CA'),\n", " ('Lily', 'Harrison', 42, 'Female', 'Engineer', 'Chicago', 'IL'),\n", " ('Kevin', 'Garcia', 55, 'Male', 'Analyst', 'San Francisco', 'CA'),\n", " ('Sophie', 'Smithson', 48, 'Female', 'Lawyer', 'Miami', 'FL'),\n", " ('John', 'Clarkston', 38, 'Male', 'Software Developer', 'Seattle', 'WA'),\n", " ('Emma', 'Millerson', 40, 'Female', 'Accountant', 'Boston', 'MA'),\n", " ('David', 'Williamson', 45, 'Male', 'Marketing Manager', 'Austin', 'TX'),\n", " ('Ava', 'Johnsonson', 38, 'Female', 'Analyst', 'Denver', 'CO'),\n", " ('Daniel', 'Anderson', 42, 'Male', 'Data Scientist', 'Atlanta', 'GA'),\n", " ('Michael', 'Martinson', 33, 'Male', 'Engineer', 'Portland', 'OR'),\n", " ('Sophia', 'Whiterson', 27, 'Female', 'Data Scientist', 'San Diego', 'CA'),\n", " ('Robert', 'Davidson', 41, 'Male', 'Analyst', 'Houston', 'TX'),\n", " ('Grace', 'Wilson', 45, 'Female', 'Data Scientist', 'Phoenix', 'AZ'),\n", " ('Andrew', 'Leeson', 48, 'Male', 'Professor', 'Philadelphia', 'PA'),\n", " ('Emily', 'Garcia', 39, 'Female', 'Data Scientist', 'Minneapolis', 'MN'),\n", " ('Christopher', 'Sanders', 41, 'Male', 'Entrepreneur', 'Dallas', 'TX'),\n", " ('Olivia', 'Martinez', 36, 'Female', 'Product Manager', 'San Jose', 'CA'),\n", " ('Sophie', 'Brownson', 43, 'Female', 'Engineer', 'Detroit', 'MI'),\n", " ('Matthew', 'Taylors', 46, 'Male', 'HR Specialist', 'Raleigh', 'NC')\n", "\"\"\"\n", "cursor.execute(sql_script)\n", "\n", "# Commit the changes\n", "conn.commit()\n", "\n", "# Close the cursor and connection\n", "cursor.close()\n", "conn.close()" ] }, { "cell_type": "markdown", "metadata": { "id": "2XTYdjWhQR7S" }, "source": [ "# Use qrlew on the database" ] }, { "cell_type": "markdown", "metadata": { "id": "Xs69AaXOQR7S" }, "source": [ "At this stage, we have a postgres dataset that can be connected with `qrlew`." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "j8RWjR4cQR7S" }, "outputs": [], "source": [ "%%capture\n", "!pip install -U pyqrlew matplotlib graphviz" ] }, { "cell_type": "markdown", "metadata": { "id": "E1S7u-GMQR7S" }, "source": [ "The `dataset_from_database` function, located in `pyqrlew.io.dataset`, enables the creation of a Dataset object from a SQLAlchemy engine.\n", "It takes the following parameters:\n", "- `name`: the dataset's name,\n", "- `engine`: the SQLAlchemy engine establishing a connection to the database,\n", "- `schema_name`: the optional name of the schema,\n", "- `ranges`: if set to True, we use the the minimum and maximum values of numeric and datetime columns for setting their bounds.\n", " **This may violates differential privacy** and should be use if these values are public.\n", " Otherwise, use `False`. In that case, the bounds have to be provided in the SQL query,\n", "- `possible_values_threshold`: an optional integer that serves as a threshold for identifying a column as categorical. If a column surpasses this threshold, its categories are published. Similar to the ranges parameter, exercising this option may compromise differential privacy." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "7-4WwTHAQR7T" }, "outputs": [], "source": [ "from pyqrlew.io.dataset import dataset_from_database\n", "from sqlalchemy import create_engine\n", "\n", "connect_pars= {\n", " 'dbname':'my_database',\n", " 'user':'postgres',\n", " 'port':5432,\n", " 'password':'pyqrlew-db',\n", " 'host':'localhost',\n", "}\n", "engine = create_engine(f'postgresql+psycopg2://{connect_pars[\"user\"]}:{connect_pars[\"password\"]}@{connect_pars[\"host\"]}:{connect_pars[\"port\"]}/{connect_pars[\"dbname\"]}')\n", "ds = dataset_from_database(\n", " name=\"census\",\n", " engine=engine,\n", " schema_name=\"census_schema\",\n", " ranges=True,\n", " possible_values_threshold=5\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "20zBpJEmQR7T" }, "source": [ "`qrlew` connects to the database via the SQLAlchemy engine and fetch the tables." ] }, { "cell_type": "markdown", "metadata": { "id": "9SvsAaZhQR7U" }, "source": [ "The table `\"census_schema\".\"census_table\"` is represented as a `Table` variant of a `Relation`.\n", "This specific representation encapsulates all columns, their respective types, and optionally includes information about bounds and possible values.\n", "\n", "In the provided example, we observe the bounds specification for the integer column `age` and the declaration of possible values for the string column `gender`." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 201 }, "id": "oUZ-hdwRQR7U", "outputId": "63a44077-b5e3-4833-ff4e-299aec48d5d3" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_px87\n", "\n", "\n", "\n", "graph_px87\n", "\n", "CENSUS_SCHEMA_CENSUS_TABLE size ∈ int{70}\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import graphviz\n", "display(graphviz.Source(ds.census_schema.census_table.relation().dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "QpW4KTS_QR7U" }, "source": [ "Any SQL query can be transformed into a `Relation`, a `qrlew` object that allow to more easily the compilation into differential privacy." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 441 }, "id": "BEETjOfYQR7U", "outputId": "2f989cfc-18c3-4365-e284-ab920dde1444" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_c_nh\n", "\n", "\n", "\n", "graph_c_nh\n", "\n", "REDUCE_RTS_ size ∈ int[0 70]\n", "field_1wy3 = sum(field_z_jj) ∈ int[0 3850]\n", "GROUP BY (field_d4_o)\n", "\n", "\n", "\n", "graph_sqxd\n", "\n", "MAP_WTB7 size ∈ int[0 70]\n", "field_z_jj = age ∈ int[26 55]\n", "field_d4_o = gender ∈ str{Female, Male}\n", "\n", "\n", "\n", "graph_c_nh->graph_sqxd\n", "\n", "\n", "\n", "\n", "\n", "graph_px87\n", "\n", "CENSUS_SCHEMA_CENSUS_TABLE size ∈ int{70}\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n", "graph_sqxd->graph_px87\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "import pandas as pd\n", "query = \"SELECT sum(age) FROM census_schema.census_table GROUP BY gender\"\n", "relation = ds.relation(query)\n", "display(graphviz.Source(relation.dot()))" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 590 }, "id": "NboNwp_pQR7U", "outputId": "672386f9-90b1-492d-a539-12f755e1ad4c" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_lkbb\n", "\n", "\n", "\n", "graph_lkbb\n", "\n", "MAP_VDVC size ∈ int[0 70]\n", "gender = field_d4_o ∈ str{Female, Male} UNIQUE\n", "count_all = field_i4xy ∈ int[0 70]\n", "\n", "\n", "\n", "graph_fz6f\n", "\n", "REDUCE_Z0RD size ∈ int[0 70]\n", "field_d4_o = first(field_d4_o) ∈ str{Female, Male} UNIQUE\n", "field_i4xy = count(field_n9r3) ∈ int[0 70]\n", "GROUP BY (field_d4_o)\n", "\n", "\n", "\n", "graph_lkbb->graph_fz6f\n", "\n", "\n", "\n", "\n", "\n", "graph_gfjv\n", "\n", "MAP_W7KV size ∈ int[0 70]\n", "field_n9r3 = 1 ∈ int{1}\n", "field_d4_o = gender ∈ str{Female, Male}\n", "WHERE (age = 10)\n", "\n", "\n", "\n", "graph_fz6f->graph_gfjv\n", "\n", "\n", "\n", "\n", "\n", "graph_px87\n", "\n", "CENSUS_SCHEMA_CENSUS_TABLE size ∈ int{70}\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n", "graph_gfjv->graph_px87\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "query = \"SELECT gender as gender, COUNT(*) AS count_all FROM census_schema.census_table WHERE age = 10 GROUP BY gender\"\n", "relation = ds.relation(query)\n", "display(graphviz.Source(relation.dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "vsKdOqcJQR7U" }, "source": [ "Let's convert this `Relation` into differential privacy with the `rewrite_with_differential_privacy` method that outputs:\n", "- `privacy_unit`: the privacy unit may be defined as a column within the current table or even another table. Alternatively, each row can be protected.\n", "- `synthetic_data`: this optional list associates each original table to its synthetic version. If these tables are provided, any queries that can't be compiled into DP will be executed on the synthetic data\n", "- `epsilon_delta`: a dictionnary with the $\\varepsilon$ and $\\delta$ privacy parameters." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "mRTKmSDfQR7U" }, "outputs": [], "source": [ "privacy_unit = [\n", " (\"census_table\", [], \"_PRIVACY_UNIT_ROW_\"), # each row is protected\n", "]\n", "synthetic_data = None\n", "\n", "dp_relation = relation.rewrite_with_differential_privacy(\n", " dataset=ds,\n", " synthetic_data=synthetic_data,\n", " privacy_unit=privacy_unit,\n", " epsilon_delta={\"epsilon\": 1.0, \"delta\": 1e-3}\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "ZksZxGQqQR7V" }, "source": [ "The result of `rewrite_with_differential_privacy` is an object that holds both the differentially private `Relation` and a `PrivateQuery` containing the invoked mechanisms during the compilation." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 1000 }, "id": "zN_4fJt2QR7V", "outputId": "c82d26d7-bcce-4531-b906-445b69dc72b4" }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "graph_rhz9\n", "\n", "\n", "\n", "graph_rhz9\n", "\n", "MAP_VDVC size ∈ int[0 140]\n", "gender = field_d4_o ∈ str{Female, Male} UNIQUE\n", "count_all = field_i4xy ∈ int[0 140]\n", "\n", "\n", "\n", "graph_i4m8\n", "\n", "MAP_09T5 size ∈ int[0 140]\n", "field_d4_o = field_d4_o ∈ str{Female, Male} UNIQUE\n", "field_i4xy = field_i4xy ∈ int[0 140]\n", "\n", "\n", "\n", "graph_rhz9->graph_i4m8\n", "\n", "\n", "\n", "\n", "\n", "graph_a1i4\n", "\n", "MAP_ITYA size ∈ int[0 140]\n", "field_i4xy = cast_as_integer(_COUNT_field_n9r3) ∈ int[0 140]\n", "field_d4_o = field_d4_o ∈ str{Female, Male} UNIQUE\n", "\n", "\n", "\n", "graph_i4m8->graph_a1i4\n", "\n", "\n", "\n", "\n", "\n", "graph_rteh\n", "\n", "MAP_M6IS size ∈ int[0 140]\n", "field_d4_o = field_d4_o ∈ str{Female, Male} UNIQUE\n", "_COUNT_field_n9r3 = least(140, greatest(0, (coalesce(_COUNT_field_n9r3, 0) + (3.776479532659047 * (sqrt((-2 * ln(random()))) ...\n", "\n", "\n", "\n", "graph_a1i4->graph_rteh\n", "\n", "\n", "\n", "\n", "\n", "graph_aluv\n", "\n", "REDUCE_TDCW size ∈ int[0 140]\n", "field_d4_o = first(field_zyse) ∈ str{Female, Male} UNIQUE\n", "_COUNT_field_n9r3 = sum(field_8d51) ∈ option(float[0 140])\n", "GROUP BY (field_zyse)\n", "\n", "\n", "\n", "graph_rteh->graph_aluv\n", "\n", "\n", "\n", "\n", "\n", "graph_j9ct\n", "\n", "MAP_ORR7 size ∈ int[0 140]\n", "field_8d51 = _CLIPPED__ONE_field_n9r3 ∈ option(float[0.000603681610520369 1])\n", "field_zyse = field_d4_o ∈ str{Female, Male}\n", "\n", "\n", "\n", "graph_aluv->graph_j9ct\n", "\n", "\n", "\n", "\n", "\n", "graph_ms3c\n", "\n", "MAP_M3CM size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ float{0, 1}\n", "field_d4_o = field_d4_o ∈ str{Female, Male}\n", "_ONE_field_n9r3 = _ONE_field_n9r3 ∈ option(float{1})\n", "_SCALE_FACTOR__PRIVACY_UNIT_ = _SCALE_FACTOR__PRIVACY_UNIT_ ∈ str\n", "_SCALE_FACTOR__ONE_field_n9r3 = _SCALE_FACTOR__ONE_field_n9r3 ∈ option(float[0.000603681610520369 1])\n", "_CLIPPED__ONE_field_n9r3 = (_ONE_field_n9r3 * _SCALE_FACTOR__ONE_field_n9r3) ∈ option(float[0.000603681610520369 1])\n", "\n", "\n", "\n", "graph_j9ct->graph_ms3c\n", "\n", "\n", "\n", "\n", "\n", "graph_n6pl\n", "\n", "JOIN_0GI9 size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = _LEFT_._PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _LEFT_._PRIVACY_UNIT_WEIGHT_ ∈ float{0, 1}\n", "field_d4_o = _LEFT_.field_d4_o ∈ str{Female, Male}\n", "_ONE_field_n9r3 = _LEFT_._ONE_field_n9r3 ∈ option(float{1})\n", "_SCALE_FACTOR__PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_ ∈ str\n", "_SCALE_FACTOR__ONE_field_n9r3 = _RIGHT_._ONE_field_n9r3 ∈ option(float[0.000603681610520369 1])\n", "INNER ON (_LEFT_._PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_)\n", "\n", "\n", "\n", "graph_ms3c->graph_n6pl\n", "\n", "\n", "\n", "\n", "\n", "graph_0uw9\n", "\n", "MAP_FRO8 size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str UNIQUE\n", "_ONE_field_n9r3 = (1 / greatest(1, (sqrt(_ONE_field_n9r3) / 1))) ∈ option(float[0.000603681610520369 1])\n", "\n", "\n", "\n", "graph_n6pl->graph_0uw9\n", "\n", "\n", "\n", "\n", "\n", "graph_v35s\n", "\n", "MAP_66HX size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = coalesce(cast_as_text(_PRIVACY_UNIT_), _PRIVACY_UNIT_DEFAULT_) ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = coalesce(_PRIVACY_UNIT_WEIGHT_, 0) ∈ float{0, 1}\n", "field_d4_o = field_d4_o ∈ str{Female, Male}\n", "_ONE_field_n9r3 = case(is_null(field_n9r3), 0, 1) ∈ option(float{1})\n", "\n", "\n", "\n", "graph_n6pl->graph_v35s\n", "\n", "\n", "\n", "\n", "\n", "graph_sfnz\n", "\n", "REDUCE_SMED size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = first(field_grep) ∈ str UNIQUE\n", "_ONE_field_n9r3 = sum(field_7ra5) ∈ option(float[0 2744000])\n", "GROUP BY (field_grep)\n", "\n", "\n", "\n", "graph_0uw9->graph_sfnz\n", "\n", "\n", "\n", "\n", "\n", "graph_z5mx\n", "\n", "MAP_HFZQ size ∈ int[0 140]\n", "field_7ra5 = _NORM__ONE_field_n9r3 ∈ option(float[0 19600])\n", "field_grep = _PRIVACY_UNIT_ ∈ str\n", "\n", "\n", "\n", "graph_sfnz->graph_z5mx\n", "\n", "\n", "\n", "\n", "\n", "graph_248a\n", "\n", "MAP_2N4O size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "field_d4_o = field_d4_o ∈ str{Female, Male}\n", "_NORM__ONE_field_n9r3 = abs((_NORM__ONE_field_n9r3 * _NORM__ONE_field_n9r3)) ∈ option(float[0 19600])\n", "\n", "\n", "\n", "graph_z5mx->graph_248a\n", "\n", "\n", "\n", "\n", "\n", "graph_k4sy\n", "\n", "REDUCE_3SQD size ∈ int[0 140]\n", "_PRIVACY_UNIT_ = first(field_grep) ∈ str\n", "field_d4_o = first(field_zyse) ∈ str{Female, Male}\n", "_NORM__ONE_field_n9r3 = sum(field_1iro) ∈ option(int[0 140])\n", "GROUP BY (field_grep, field_zyse)\n", "\n", "\n", "\n", "graph_248a->graph_k4sy\n", "\n", "\n", "\n", "\n", "\n", "graph_1cob\n", "\n", "MAP_2PCL size ∈ int[0 140]\n", "field_1iro = _ONE_field_n9r3 ∈ option(float{1})\n", "field_grep = _PRIVACY_UNIT_ ∈ str\n", "field_zyse = field_d4_o ∈ str{Female, Male}\n", "\n", "\n", "\n", "graph_k4sy->graph_1cob\n", "\n", "\n", "\n", "\n", "\n", "graph_1cob->graph_v35s\n", "\n", "\n", "\n", "\n", "\n", "graph_c6wu\n", "\n", "MAP_LPGC size ∈ int[0 140]\n", "field_d4_o = field_d4_o ∈ str{Female, Male}\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ option(str)\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ option(int{1})\n", "field_n9r3 = field_n9r3 ∈ option(int{1})\n", "\n", "\n", "\n", "graph_v35s->graph_c6wu\n", "\n", "\n", "\n", "\n", "\n", "graph_nr52\n", "\n", "JOIN_FD1M size ∈ int[0 140]\n", "field_d4_o = _LEFT_.field_d4_o ∈ str{Female, Male}\n", "_PRIVACY_UNIT_ = _RIGHT_._PRIVACY_UNIT_ ∈ option(str)\n", "_PRIVACY_UNIT_WEIGHT_ = _RIGHT_._PRIVACY_UNIT_WEIGHT_ ∈ option(int{1})\n", "field_n9r3 = _RIGHT_.field_n9r3 ∈ option(int{1})\n", "left__kq_o = _RIGHT_.field_d4_o ∈ option(str{Female, Male})\n", "LEFT ON (_LEFT_.field_d4_o = _RIGHT_.field_d4_o)\n", "\n", "\n", "\n", "graph_c6wu->graph_nr52\n", "\n", "\n", "\n", "\n", "\n", "graph_s4t9\n", "\n", "FIELD_D4_O size ∈ int{2}\n", "[Female, Male]\n", "\n", "\n", "\n", "graph_nr52->graph_s4t9\n", "\n", "\n", "\n", "\n", "\n", "graph_v4pl\n", "\n", "MAP_W7KV size ∈ int[0 70]\n", "_PRIVACY_UNIT_ = _PRIVACY_UNIT_ ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = _PRIVACY_UNIT_WEIGHT_ ∈ int{1}\n", "field_n9r3 = 1 ∈ int{1}\n", "field_d4_o = gender ∈ str{Female, Male}\n", "WHERE (age = 10)\n", "\n", "\n", "\n", "graph_nr52->graph_v4pl\n", "\n", "\n", "\n", "\n", "\n", "graph_e7tl\n", "\n", "MAP_LG9F size ∈ int[0 70]\n", "_PRIVACY_UNIT_ = md5(cast_as_text(_PRIVACY_UNIT_ROW_)) ∈ str\n", "_PRIVACY_UNIT_WEIGHT_ = 1 ∈ int{1}\n", "_PRIVACY_UNIT_ROW_ = _PRIVACY_UNIT_ROW_ ∈ float[0 1]\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n", "graph_v4pl->graph_e7tl\n", "\n", "\n", "\n", "\n", "\n", "graph_zcpn\n", "\n", "MAP_XT3G size ∈ int[0 70]\n", "_PRIVACY_UNIT_ROW_ = random() ∈ float[0 1]\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n", "graph_e7tl->graph_zcpn\n", "\n", "\n", "\n", "\n", "\n", "graph_px87\n", "\n", "CENSUS_SCHEMA_CENSUS_TABLE size ∈ int{70}\n", "id = id ∈ int[1 70]\n", "first_name = first_name ∈ str\n", "last_name = last_name ∈ str\n", "age = age ∈ int[26 55]\n", "gender = gender ∈ str{Female, Male}\n", "occupation = occupation ∈ str\n", "city = city ∈ str\n", "state = state ∈ str\n", "\n", "\n", "\n", "graph_zcpn->graph_px87\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display(graphviz.Source(dp_relation.relation().dot()))" ] }, { "cell_type": "markdown", "metadata": { "id": "wskO1NjaQR7V" }, "source": [ "In this example, the only invocked mechanism is a gaussain mechanism for adding noise to the count." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "sV2b9HpOQR7V", "outputId": "9aead7c1-e28b-4485-86a7-2ff8bd3bf874" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 3.776479532659047, '_fields': ['module_name', 'class_name', 'noise_multiplier']}\n" ] } ], "source": [ "print(dp_relation.dp_event().to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "fU4CPTtnQR7V" }, "source": [ "Any `Relation` can be transformed into an SQL query:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "hbQCKX5HQR7V", "outputId": "b7107ded-52b2-416d-83fb-41c0b5ae5aec" }, "outputs": [ { "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "\"WITH field_d4_o (field_d4_o) AS (SELECT * FROM (VALUES ('Female'), ('Male')) AS field_d4_o (field_d4_o)), map_xt3g (_PRIVACY_UNIT_ROW_, id, first_name, last_name, age, gender, occupation, city, state) AS (SELECT RANDOM() AS _PRIVACY_UNIT_ROW_, id AS id, first_name AS first_name, last_name AS last_name, age AS age, gender AS gender, occupation AS occupation, city AS city, state AS state FROM census_schema.census_table), map_lg9f (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, _PRIVACY_UNIT_ROW_, id, first_name, last_name, age, gender, occupation, city, state) AS (SELECT MD5(CAST(_PRIVACY_UNIT_ROW_ AS TEXT)) AS _PRIVACY_UNIT_, 1 AS _PRIVACY_UNIT_WEIGHT_, _PRIVACY_UNIT_ROW_ AS _PRIVACY_UNIT_ROW_, id AS id, first_name AS first_name, last_name AS last_name, age AS age, gender AS gender, occupation AS occupation, city AS city, state AS state FROM map_xt3g), map_w7kv (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3, field_d4_o) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, 1 AS field_n9r3, gender AS field_d4_o FROM map_lg9f WHERE (age) = (10)), join_fd1m (field_d4_o, _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3, left__kq_o) AS (SELECT * FROM field_d4_o AS _LEFT_ LEFT JOIN map_w7kv AS _RIGHT_ ON (_LEFT_.field_d4_o) = (_RIGHT_.field_d4_o)), map_lpgc (field_d4_o, _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_n9r3) AS (SELECT field_d4_o AS field_d4_o, _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, field_n9r3 AS field_n9r3 FROM join_fd1m), map_66hx (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3) AS (SELECT COALESCE(CAST(_PRIVACY_UNIT_ AS TEXT), '_PRIVACY_UNIT_DEFAULT_') AS _PRIVACY_UNIT_, COALESCE(_PRIVACY_UNIT_WEIGHT_, 0) AS _PRIVACY_UNIT_WEIGHT_, field_d4_o AS field_d4_o, CASE WHEN field_n9r3 IS NULL THEN 0 ELSE 1 END AS _ONE_field_n9r3 FROM map_lpgc), map_2pcl (field_1iro, field_grep, field_zyse) AS (SELECT _ONE_field_n9r3 AS field_1iro, _PRIVACY_UNIT_ AS field_grep, field_d4_o AS field_zyse FROM map_66hx), reduce_3sqd (_PRIVACY_UNIT_, field_d4_o, _NORM__ONE_field_n9r3) AS (SELECT field_grep AS _PRIVACY_UNIT_, field_zyse AS field_d4_o, SUM(field_1iro) AS _NORM__ONE_field_n9r3 FROM map_2pcl GROUP BY field_grep, field_zyse), map_2n4o (_PRIVACY_UNIT_, field_d4_o, _NORM__ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, field_d4_o AS field_d4_o, ABS((_NORM__ONE_field_n9r3) * (_NORM__ONE_field_n9r3)) AS _NORM__ONE_field_n9r3 FROM reduce_3sqd), map_hfzq (field_7ra5, field_grep) AS (SELECT _NORM__ONE_field_n9r3 AS field_7ra5, _PRIVACY_UNIT_ AS field_grep FROM map_2n4o), reduce_smed (_PRIVACY_UNIT_, _ONE_field_n9r3) AS (SELECT field_grep AS _PRIVACY_UNIT_, SUM(field_7ra5) AS _ONE_field_n9r3 FROM map_hfzq GROUP BY field_grep), map_fro8 (_PRIVACY_UNIT_, _ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, (1) / (GREATEST(1, (SQRT(_ONE_field_n9r3)) / (1))) AS _ONE_field_n9r3 FROM reduce_smed), join_0gi9 (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3) AS (SELECT * FROM map_66hx AS _LEFT_ JOIN map_fro8 AS _RIGHT_ ON (_LEFT_._PRIVACY_UNIT_) = (_RIGHT_._PRIVACY_UNIT_)), map_m3cm (_PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_, field_d4_o, _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3, _CLIPPED__ONE_field_n9r3) AS (SELECT _PRIVACY_UNIT_ AS _PRIVACY_UNIT_, _PRIVACY_UNIT_WEIGHT_ AS _PRIVACY_UNIT_WEIGHT_, field_d4_o AS field_d4_o, _ONE_field_n9r3 AS _ONE_field_n9r3, _SCALE_FACTOR__PRIVACY_UNIT_ AS _SCALE_FACTOR__PRIVACY_UNIT_, _SCALE_FACTOR__ONE_field_n9r3 AS _SCALE_FACTOR__ONE_field_n9r3, (_ONE_field_n9r3) * (_SCALE_FACTOR__ONE_field_n9r3) AS _CLIPPED__ONE_field_n9r3 FROM join_0gi9), map_orr7 (field_8d51, field_zyse) AS (SELECT _CLIPPED__ONE_field_n9r3 AS field_8d51, field_d4_o AS field_zyse FROM map_m3cm), reduce_tdcw (field_d4_o, _COUNT_field_n9r3) AS (SELECT field_zyse AS field_d4_o, SUM(field_8d51) AS _COUNT_field_n9r3 FROM map_orr7 GROUP BY field_zyse), map_m6is (field_d4_o, _COUNT_field_n9r3) AS (SELECT field_d4_o AS field_d4_o, LEAST(140, GREATEST(0, (COALESCE(_COUNT_field_n9r3, 0)) + ((3.776479532659047) * ((SQRT((-2) * (LN(RANDOM())))) * (COS((6.283185307179586) * (RANDOM()))))))) AS _COUNT_field_n9r3 FROM reduce_tdcw), map_itya (field_i4xy, field_d4_o) AS (SELECT CAST(_COUNT_field_n9r3 AS INTEGER) AS field_i4xy, field_d4_o AS field_d4_o FROM map_m6is), map_09t5 (field_d4_o, field_i4xy) AS (SELECT field_d4_o AS field_d4_o, field_i4xy AS field_i4xy FROM map_itya), map_vdvc (gender, count_all) AS (SELECT field_d4_o AS gender, field_i4xy AS count_all FROM map_09t5) SELECT * FROM map_vdvc\"" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp_sql_query = dp_relation.relation().to_query()\n", "dp_sql_query" ] }, { "cell_type": "markdown", "metadata": { "id": "gbS3OMG7QR7V" }, "source": [ "The DP query can then be sent to the database:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "NzEc4hjmQR7V", "outputId": "711723cd-6def-40d4-cde6-b3384ab97c63" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendercount_all
0Female1
1Male3
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " gender count_all\n", "0 Female 1\n", "1 Male 3" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp_results = pd.read_sql(dp_sql_query, engine)\n", "dp_results" ] }, { "cell_type": "markdown", "metadata": { "id": "2ZykqVBrQR7V" }, "source": [ "This can be compared to the result of the original query:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 53 }, "id": "lybqRq7zQR7V", "outputId": "6d0b465d-ae50-4691-a1b6-b37fcb70fbd2" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
gendercount_all
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "
\n", "
\n" ], "text/plain": [ "Empty DataFrame\n", "Columns: [gender, count_all]\n", "Index: []" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "exact_results = pd.read_sql(query, engine)\n", "exact_results" ] }, { "cell_type": "markdown", "metadata": { "id": "9WhEScfWQR7V" }, "source": [ "## Queries with `GROUP BY` clause\n", "\n", "f the grouping keys are not public, we utilize the $\\tau$-thresholding mechanism, which filters out keys with noisy counts below a specific threshold determined by the privacy parameters." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "6JGVFIZ6QR7V", "outputId": "7bdfcb5d-9a07-426d-9691-38e3ffa839c0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'module_name': 'dp_accounting.dp_event', 'class_name': 'ComposedDpEvent', 'events': [{'module_name': 'dp_accounting.dp_event', 'class_name': 'EpsilonDeltaDpEvent', 'epsilon': 2.5, 'delta': 0.00025, '_fields': ['module_name', 'class_name', 'epsilon', 'delta']}, {'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 5.2624632808831, '_fields': ['module_name', 'class_name', 'noise_multiplier']}, {'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 5.2624632808831, '_fields': ['module_name', 'class_name', 'noise_multiplier']}, {'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 289.4354804485705, '_fields': ['module_name', 'class_name', 'noise_multiplier']}], '_fields': ['module_name', 'class_name', 'events']}\n" ] } ], "source": [ "query = \"\"\"\n", " SELECT occupation as occupation, COUNT(*) AS count_occ, AVG(age) AS avg_age\n", " FROM census_schema.census_table\n", " GROUP BY occupation\n", "\"\"\"\n", "relation = ds.relation(query)\n", "privacy_unit = [\n", " (\"census_table\", [], \"_PRIVACY_UNIT_ROW_\"),\n", "]\n", "synthetic_data = None\n", "\n", "dp_relation = relation.rewrite_with_differential_privacy(\n", " dataset=ds,\n", " synthetic_data=synthetic_data,\n", " privacy_unit=privacy_unit,\n", " epsilon_delta={\"epsilon\": 5.0, \"delta\": 5e-4}\n", ")\n", "print(dp_relation.dp_event().to_dict())" ] }, { "cell_type": "markdown", "metadata": { "id": "mJe8uSbYQR7W" }, "source": [ "In that case, 3 gaussian mechanisms have been used for adding noise to the aggregations (1 for the count, 2 for the average) and a $(\\varepsilon, \\delta)$ mechanism for the $\\tau$-thresholding.\n", "\n", "If you execute the next cell multiple times, you will notice that the grouping keys may vary due to the randomness introduced by the $\\tau$-thresholding mechanism." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 143 }, "id": "HfuMVURhQR7W", "outputId": "955c8a1e-e372-4c29-9194-106027ec74e0" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
occupationcount_occavg_age
0Analyst1774.085976
1Engineer2122.989924
2Data Scientist1931.201097
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " occupation count_occ avg_age\n", "0 Analyst 17 74.085976\n", "1 Engineer 21 22.989924\n", "2 Data Scientist 19 31.201097" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp_results = pd.read_sql(dp_relation.relation().to_query(), engine)\n", "dp_results" ] }, { "cell_type": "markdown", "metadata": { "id": "WxsBKBH0QR7W" }, "source": [ "We can also include the grouping keys in the query's `WHERE` clause. In this case, the grouping keys are treated as public, and they all appear in the result and the $\\tau$-thresholding mechnaism is not used." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "W9nia6LUQR7W", "outputId": "8c7890f3-3da0-4420-b4fb-747742e60dfe" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'module_name': 'dp_accounting.dp_event', 'class_name': 'ComposedDpEvent', 'events': [{'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 2.5346230440950324, '_fields': ['module_name', 'class_name', 'noise_multiplier']}, {'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 2.5346230440950324, '_fields': ['module_name', 'class_name', 'noise_multiplier']}, {'module_name': 'dp_accounting.dp_event', 'class_name': 'GaussianDpEvent', 'noise_multiplier': 139.40426742522678, '_fields': ['module_name', 'class_name', 'noise_multiplier']}], '_fields': ['module_name', 'class_name', 'events']}\n" ] } ], "source": [ "query = \"\"\"\n", " SELECT occupation as occupation, COUNT(*) AS count_occ, AVG(age) AS avg_age\n", " FROM census_schema.census_table\n", " WHERE occupation in ('Analyst', 'Professor')\n", " GROUP BY occupation\n", "\"\"\"\n", "relation = ds.relation(query)\n", "privacy_unit = [\n", " (\"census_table\", [], \"_PRIVACY_UNIT_ROW_\"),\n", "]\n", "synthetic_data = None\n", "\n", "dp_relation = relation.rewrite_with_differential_privacy(\n", " dataset=ds,\n", " synthetic_data=synthetic_data,\n", " privacy_unit=privacy_unit,\n", " epsilon_delta={\"epsilon\": 5.0, \"delta\": 5e-4}\n", ")\n", "print(dp_relation.dp_event().to_dict())" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 112 }, "id": "wjBMd8Y3QR7W", "outputId": "17d335d1-c646-421b-8ddb-d70a74c10056" }, "outputs": [ { "data": { "text/html": [ "\n", "
\n", "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
occupationcount_occavg_age
0Professor038.499150
1Analyst1836.381657
\n", "
\n", "
\n", "\n", "
\n", " \n", "\n", " \n", "\n", " \n", "
\n", "\n", "\n", "
\n", " \n", "\n", "\n", "\n", " \n", "
\n", "
\n", "
\n" ], "text/plain": [ " occupation count_occ avg_age\n", "0 Professor 0 38.499150\n", "1 Analyst 18 36.381657" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dp_results = pd.read_sql(dp_relation.relation().to_query(), engine)\n", "dp_results" ] }, { "cell_type": "markdown", "metadata": { "id": "RsjAEzRSQR7W" }, "source": [ "## Not DP-compilable queries\n", "\n", "If the query cannot by DP-compiled by `qrlew` and no synthetic data have been provided, then `qrlew` returns an error." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "Qxv3WYxsQR7W", "outputId": "93f2f2df-a7a5-448c-c5b7-86ed547e6176" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The error is: UnreachableProperty: differential_privacy is unreachable\n", "\n" ] } ], "source": [ "query = \"\"\"\n", " SELECT occupation as occupation, MAX(age) AS max_age\n", " FROM census_schema.census_table\n", " GROUP BY occupation\n", "\"\"\"\n", "relation = ds.relation(query)\n", "privacy_unit = [\n", " (\"census_table\", [], \"_PRIVACY_UNIT_ROW_\"),\n", "]\n", "synthetic_data = None\n", "try:\n", " dp_relation = relation.rewrite_with_differential_privacy(\n", " dataset=ds,\n", " synthetic_data=synthetic_data,\n", " privacy_unit=privacy_unit,\n", " epsilon_delta={\"epsilon\": 5.0, \"delta\": 5e-4}\n", " ).relation()\n", " dp_results = pd.read_sql(dp_relation.to_query(), engine)\n", " dp_results\n", "except RuntimeError as e:\n", " print(f\"The error is: {e}\")" ] }, { "cell_type": "markdown", "metadata": { "id": "QgALx6T4QR7j" }, "source": [ "If the user provides the list of the original and synthetic tables, then the query is sent to the synthetic tables and no budget has been spent." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "HuHyYsvAQR7j", "outputId": "880087de-6aa9-43b5-bebf-90f91f427b38" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'module_name': 'dp_accounting.dp_event', 'class_name': 'NoOpDpEvent', '_fields': ['module_name', 'class_name']}\n" ] } ], "source": [ "query = \"\"\"\n", " SELECT occupation as occupation, MAX(age) AS max_age\n", " FROM census_schema.census_table\n", " GROUP BY occupation\n", "\"\"\"\n", "relation = ds.relation(query)\n", "privacy_unit = [(\"census_table\", [], \"_PRIVACY_UNIT_ROW_\")]\n", "synthetic_data = [\n", " ([\"census_schema\", \"census_table\"], [\"census_schema\", \"synthetic_census_table\"])\n", "]\n", "\n", "dp_relation = relation.rewrite_with_differential_privacy(\n", " dataset=ds,\n", " synthetic_data=synthetic_data,\n", " privacy_unit=privacy_unit,\n", " epsilon_delta={\"epsilon\": 5.0, \"delta\": 5e-4}\n", ")\n", "print(dp_relation.dp_event().to_dict())" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "id": "RRcXF2qrQR7j" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "myenv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.7" } }, "nbformat": 4, "nbformat_minor": 0 }