Pyqrlew core#

class pyqrlew.Dataset[source]#

A Dataset is a set of SQL Tables.

Examples

Creating a Dataset from an existing database with an sqlalchemy engine

>>> import pyqrlew as qrl
>>> from sqlalchemy import create_engine
>>> engine = create_engine("postgresql+psycopg2://****/mydatabase")
>>> dataset = Dataset.from_database(name='extract', engine=DB.engine(), schema_name='extract', range=False, possible_values_threshold=None)

Creating a Dataset from queries and a previous dataset. Here with WHERE statement we also determine the bounds on the age column.

>>> queries = [
>>>    (("ds_name", "new_schema", "tab1"), 'SELECT * FROM extract.census WHERE age < 18 AND age > 0'),
>>>    (("ds_name", "new_schema", "tab2"), 'SELECT * FROM extract.census WHERE age >= 18  AND age < 120'),
>>> ]
>>> new_dataset = dataset.from_queries(queries)

Parameters:: dataset (_Dataset) –

static from_str(dataset: str, schema: str, size: str) → Dataset[source]#

Factory method to create a Dataset from string representations compatible with protocol buffers defined here.

Parameters:

dataset (str) – string representation of a dataset
schema (str) – string representation of the dataset’s schema
size (str) – string representation of the dataset’s size

Return type:

Dataset

static from_database(name: str, engine: Engine, schema_name: str | None = None, ranges: bool = False, possible_values_threshold: int | None = None) → Dataset[source]#

Builds a Dataset from a sqlalchemy Engine.

Parameters:

name (str) – Name of the Dataset
engine (Engine) – The sqlalchemy Engine to use
schema_name (Optional[str], optional) – The DB schema to use. Defaults to None.
ranges (bool, optional) – Use the actual min and max of the data as ranges. This is unsafe from a privacy perspective. Defaults to False. If False numeric values ranges will be considered between -2.0^50 to 2.0^50.
possible_values_threshold (Optional[int], optional) – Use the actual observed values as range. This is unsafe from a privacy perspective. Defaults to None.

Return type:

Dataset

with_range(schema_name: str | None, table_name: str, field_name: str, min: float, max: float) → Dataset[source]#

Returns a new Dataset with a defined range for a given numeric column. Check out more here!

Parameters:

schema_name (str) – schema
table_name (str) – table
field_name (str) – column
min (float) – min range
max (float) – max range

Return type:

Dataset

with_possible_values(schema_name: str | None, table_name: str, field_name: str, possible_values: Iterable[str]) → Dataset[source]#

Returns a new Dataset with a defined possible values for a given text column. Check out more here!

Parameters:

schema_name (str) – schema
table_name (str) – table
field_name (str) – column
possible_values (Iterable[str]) – a sequence with wanted possible values

Return type:

Dataset

with_constraint(schema_name: str | None, table_name: str, field_name: str, constraint: str | None) → Dataset[source]#

Returns a new Dataset with a constraint on given column. Check out more here!

Parameters:

schema_name (str) – schema
table_name (str) – table
field_name (str) – column
constraint (Optional[str]) – Unique or PrimaryKey

Return type:

Dataset

relations() → Iterable[Tuple[List[str], Relation]][source]#

Returns the Dataset’s Relations and their corresponding path

Return type:: Iterable[Tuple[List[str], Relation]]

relation(query: str, dialect: Dialect | None = None) → Relation[source]#

Returns a Relation from am SQL query.

Parameters:

query (str) – SQL query used to build the Relation.
dialect (Optional[Dialect]) – query’s dialect. If not provided, it is assumed to be PostgreSql.

Return type:

Relation

from_queries(queries: Iterable[Tuple[Iterable[str], str]], dialect: Dialect | None = None) → Dataset[source]#

Returns a dataset from queries.

Parameters:

queries (Iterable[Tuple[Iterable[str], str]]) – A sequence of (path, SQL query). The resulting Dataset will have a Relation for each query identified in the dataset by the corresponding path.
dialect (Optional[Dialect]) – queries dialect. If not provided, it is assumed to be PostgreSql.

Return type:

Dataset

class pyqrlew.Relation[source]#

A Relation is a Dataset transformed by a SQL query.

Example

Create a relation from a dataset

>>> from pyqrlew import Dialect
>>> query = "SELECT AVG(age), sex FROM extract.census GROUP BY sex"
>>> relation = dataset.relation(query, Dialect.PostgreSql)

Or alternatively

>>> from pyqrlew import Relation
>>> relation = Relation.from_query(query=query, dataset=dataset, dialect=Dialect.PostgreSql)

Parameters:: relation (_Relation) –

static from_query(query: str, dataset: Dataset, dialect: Dialect | None = None) → Relation[source]#

Builds a Relation from a query and a dataset

Parameters:

query (str) – sql query.
dataset (Dataset) – the Dataset.
dialect (Optional[Dialect]) – query’s dialect. If not provided, it is assumed to be PostgreSql

Return type:

Relation

to_query(dialect: Dialect | None = None) → str[source]#

Returns an SQL representation of the Relation.

Parameters:: dialect (Optional[Dialect]) – dialect of generated sql query. If no dialect is provided, the query will be in PostgreSql.
Return type:: str

rewrite_as_privacy_unit_preserving(dataset: Dataset, privacy_unit: Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]], bool] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str, str]], bool], epsilon_delta: Dict[str, float], max_multiplicity: float | None = None, max_multiplicity_share: float | None = None, synthetic_data: Sequence[Tuple[Sequence[str], Sequence[str]]] | None = None, strategy: Strategy | None = None) → RelationWithDpEvent[source]#

Returns as RelationWithDpEvent where it’s relation propagates the privacy unit through the query. Check out more here!

Parameters:

dataset (Dataset) – Dataset with needed relations
privacy_unit (PrivacyUnit) – Definition of privacy unit to be protected. Check out more here
epsilon_delta (Mapping[str, float]) – epsilon and delta budget
max_multiplicity (Optional[float]) – maximum number of rows per privacy unit in absolute terms
max_multiplicity_share (Optional[float]) – maximum number of rows per privacy unit in relative terms w.r.t. the dataset size. The actual max_multiplicity used to bound the PU contribution will be minimum(max_multiplicity, max_multiplicity_share*dataset.size).
synthetic_data (Optional[Sequence[Tuple[Sequence[str],Sequence[str]]]]) – Sequence of pairs of original table path and its corresponding synthetic version. Each table must be specified. (e.g.: ([“retail_schema”, “features”], [“retail_schema”, “features_synthetic”])).
strategy (Optional[Strategy]) – Strategy to follow during privacy tracking. If not provided the Hard Strategy will be followed

Return type:

RelationWithDpEvent

rewrite_with_differential_privacy(dataset: Dataset, privacy_unit: Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]], bool] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str, str]], bool], epsilon_delta: Dict[str, float], max_multiplicity: float | None = None, max_multiplicity_share: float | None = None, synthetic_data: Sequence[Tuple[Sequence[str], Sequence[str]]] | None = None) → RelationWithDpEvent[source]#

It transforms a Relation into its differentially private equivalent. Check out more here!

Parameters:

dataset (Dataset) – Dataset with needed relations
privacy_unit (PrivacyUnit) –
Definition of privacy unit to be protected. Check out more here
epsilon_delta (Mapping[str, float]) – epsilon and delta budget
max_multiplicity (Optional[float]) – maximum number of rows per privacy unit in absolute terms
max_multiplicity_share (Optional[float]) – maximum number of rows per privacy unit in relative terms w.r.t. the dataset size. The actual max_multiplicity used to bound the PU contribution will be minimum(max_multiplicity, max_multiplicity_share*dataset.size).
synthetic_data (Optional[Sequence[Tuple[Sequence[str],Sequence[str]]]]) – Sequence of pairs of original table path and its corresponding synthetic version. Each table must be specified. (e.g.: ([“retail_schema”, “features”], [“retail_schema”, “features_synthetic”])).

Return type:

RelationWithDpEvent

compose(relations: Iterable[Tuple[Iterable[str], Relation]]) → Relation[source]#

It composes itself with other relations. It substitute its Tables with the corresponding relation in relations: with the same path. Schemas in the relations to be composed should be compatible with the schema of the corresponding table otherwise an error is raised.

Parameters:: relations (t.Iterable[t.Tuple[t.Iterable[str], 'Relation']]) – An iterable of pairs of a path and a Relation Tables in self will be substituted by the relation with the matching path.
Return type:: Relation

rename_fields(fields: Iterable[Tuple[str, str]]) → Relation[source]#

It renames fields in the Relation

Parameters:: fields (t.Iterable[t.Tuple[str, str]]) – An iterable of pairs with field name to be renamed and the new name
Return type:: Relation

with_field(name: str, expr: str) → Relation[source]#

It creates a new Relation from self with a prepended new field

Parameters:

name (str) – The field name
expr (str) – An SQL expression following the Generic dialect

Return type:

Relation

schema() → str[source]#

Returns a string representation of the Relation’s schema.

Return type:: str

type() → str[source]#

Returns a protobuf compatible string representation of the Relation’s data type.

Return type:: str

dot() → str[source]#

GraphViz representation of the Relation

Return type:: str

class pyqrlew.DpEvent(*args, **kwargs)[source]#

Internal object containing a description of differentially private mechanisms such as Laplace and Gaussian, etc, and their composition. This is compatible with dp-accounting, part of the Google differential privacy library, containing tools for tracking differential privacy budgets.

to_dict() → Mapping[str, str | float | Sequence[Mapping[str, str | float | Sequence[DPEvent]]]][source]#

Returns a Dict representation of DP mechanisms.

Return type:: Mapping[str, str | float | Sequence[Mapping[str, str | float | Sequence[DPEvent]]]]

to_named_tuple() → NamedTuple[source]#

Returns NamedTuple of DP mechanisms compatible with dp-accounting

Return type:: NamedTuple

class pyqrlew.Dialect#

An Enum with supported SQL Datasets

BigQuery = Dialect.BigQuery#

Databricks = Dialect.Databricks#

Hive = Dialect.Hive#

MsSql = Dialect.MsSql#

MySql = Dialect.MySql#

PostgreSql = Dialect.PostgreSql#

RedshiftSql = Dialect.RedshiftSql#

class pyqrlew.Strategy#

An Enum for the privacy unit tracking propagation Soft will protect only when it does not affect the meaning of the original query and fail otherwise. Hard will protect at all cost. It will succeed most of the time.

Hard = Strategy.Hard#

Soft = Strategy.Soft#