Pyqrlew core#
- class pyqrlew.Dataset[source]#
A Dataset is a set of SQL Tables.
Examples
Creating a Dataset from an existing database with an sqlalchemy engine
>>> import pyqrlew as qrl >>> from sqlalchemy import create_engine >>> engine = create_engine("postgresql+psycopg2://****/mydatabase") >>> dataset = Dataset.from_database(name='extract', engine=DB.engine(), schema_name='extract', range=False, possible_values_threshold=None)
Creating a Dataset from queries and a previous dataset. Here with WHERE statement we also determine the bounds on the age column.
>>> queries = [ >>> (("ds_name", "new_schema", "tab1"), 'SELECT * FROM extract.census WHERE age < 18 AND age > 0'), >>> (("ds_name", "new_schema", "tab2"), 'SELECT * FROM extract.census WHERE age >= 18 AND age < 120'), >>> ] >>> new_dataset = dataset.from_queries(queries)
- Parameters:
dataset (_Dataset) –
- static from_str(dataset: str, schema: str, size: str) Dataset[source]#
Factory method to create a Dataset from string representations compatible with protocol buffers defined here.
- static from_database(name: str, engine: Engine, schema_name: str | None = None, ranges: bool = False, possible_values_threshold: int | None = None) Dataset[source]#
Builds a Dataset from a sqlalchemy Engine.
- Parameters:
name (str) – Name of the Dataset
engine (Engine) – The sqlalchemy Engine to use
schema_name (Optional[str], optional) – The DB schema to use. Defaults to None.
ranges (bool, optional) – Use the actual min and max of the data as ranges. This is unsafe from a privacy perspective. Defaults to False. If False numeric values ranges will be considered between -2.0^50 to 2.0^50.
possible_values_threshold (Optional[int], optional) – Use the actual observed values as range. This is unsafe from a privacy perspective. Defaults to None.
- Return type:
- with_range(schema_name: str | None, table_name: str, field_name: str, min: float, max: float) Dataset[source]#
Returns a new Dataset with a defined range for a given numeric column. Check out more here!
- with_possible_values(schema_name: str | None, table_name: str, field_name: str, possible_values: Iterable[str]) Dataset[source]#
Returns a new Dataset with a defined possible values for a given text column. Check out more here!
- with_constraint(schema_name: str | None, table_name: str, field_name: str, constraint: str | None) Dataset[source]#
Returns a new Dataset with a constraint on given column. Check out more here!
- relations() Iterable[Tuple[List[str], Relation]][source]#
Returns the Dataset’s Relations and their corresponding path
- relation(query: str, dialect: Dialect | None = None) Relation[source]#
Returns a Relation from am SQL query.
- class pyqrlew.Relation[source]#
A Relation is a Dataset transformed by a SQL query.
Example
Create a relation from a dataset
>>> from pyqrlew import Dialect >>> query = "SELECT AVG(age), sex FROM extract.census GROUP BY sex" >>> relation = dataset.relation(query, Dialect.PostgreSql)
Or alternatively
>>> from pyqrlew import Relation >>> relation = Relation.from_query(query=query, dataset=dataset, dialect=Dialect.PostgreSql)
- Parameters:
relation (_Relation) –
- static from_query(query: str, dataset: Dataset, dialect: Dialect | None = None) Relation[source]#
Builds a Relation from a query and a dataset
- to_query(dialect: Dialect | None = None) str[source]#
Returns an SQL representation of the Relation.
- rewrite_as_privacy_unit_preserving(dataset: Dataset, privacy_unit: Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]], bool] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str, str]], bool], epsilon_delta: Dict[str, float], max_multiplicity: float | None = None, max_multiplicity_share: float | None = None, synthetic_data: Sequence[Tuple[Sequence[str], Sequence[str]]] | None = None, strategy: Strategy | None = None) RelationWithDpEvent[source]#
Returns as RelationWithDpEvent where it’s relation propagates the privacy unit through the query. Check out more here!
- Parameters:
dataset (Dataset) – Dataset with needed relations
privacy_unit (PrivacyUnit) – Definition of privacy unit to be protected. Check out more here
epsilon_delta (Mapping[str, float]) – epsilon and delta budget
max_multiplicity (Optional[float]) – maximum number of rows per privacy unit in absolute terms
max_multiplicity_share (Optional[float]) – maximum number of rows per privacy unit in relative terms w.r.t. the dataset size. The actual max_multiplicity used to bound the PU contribution will be minimum(max_multiplicity, max_multiplicity_share*dataset.size).
synthetic_data (Optional[Sequence[Tuple[Sequence[str],Sequence[str]]]]) – Sequence of pairs of original table path and its corresponding synthetic version. Each table must be specified. (e.g.: ([“retail_schema”, “features”], [“retail_schema”, “features_synthetic”])).
strategy (Optional[Strategy]) – Strategy to follow during privacy tracking. If not provided the Hard Strategy will be followed
- Return type:
RelationWithDpEvent
- rewrite_with_differential_privacy(dataset: Dataset, privacy_unit: Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str]], bool] | Tuple[Sequence[Tuple[str, Sequence[Tuple[str, str, str]], str, str]], bool], epsilon_delta: Dict[str, float], max_multiplicity: float | None = None, max_multiplicity_share: float | None = None, synthetic_data: Sequence[Tuple[Sequence[str], Sequence[str]]] | None = None) RelationWithDpEvent[source]#
It transforms a Relation into its differentially private equivalent. Check out more here!
- Parameters:
dataset (Dataset) – Dataset with needed relations
privacy_unit (PrivacyUnit) –
Definition of privacy unit to be protected. Check out more here
epsilon_delta (Mapping[str, float]) – epsilon and delta budget
max_multiplicity (Optional[float]) – maximum number of rows per privacy unit in absolute terms
max_multiplicity_share (Optional[float]) – maximum number of rows per privacy unit in relative terms w.r.t. the dataset size. The actual max_multiplicity used to bound the PU contribution will be minimum(max_multiplicity, max_multiplicity_share*dataset.size).
synthetic_data (Optional[Sequence[Tuple[Sequence[str],Sequence[str]]]]) – Sequence of pairs of original table path and its corresponding synthetic version. Each table must be specified. (e.g.: ([“retail_schema”, “features”], [“retail_schema”, “features_synthetic”])).
- Return type:
RelationWithDpEvent
- compose(relations: Iterable[Tuple[Iterable[str], Relation]]) Relation[source]#
- It composes itself with other relations. It substitute its Tables with the corresponding relation in relations
with the same path. Schemas in the relations to be composed should be compatible with the schema of the corresponding table otherwise an error is raised.
- rename_fields(fields: Iterable[Tuple[str, str]]) Relation[source]#
It renames fields in the Relation
- with_field(name: str, expr: str) Relation[source]#
It creates a new Relation from self with a prepended new field
- class pyqrlew.DpEvent(*args, **kwargs)[source]#
Internal object containing a description of differentially private mechanisms such as Laplace and Gaussian, etc, and their composition. This is compatible with dp-accounting, part of the Google differential privacy library, containing tools for tracking differential privacy budgets.
- to_dict() Mapping[str, str | float | Sequence[Mapping[str, str | float | Sequence[DPEvent]]]][source]#
Returns a Dict representation of DP mechanisms.
- to_named_tuple() NamedTuple[source]#
Returns NamedTuple of DP mechanisms compatible with dp-accounting
- Return type: