-
Notifications
You must be signed in to change notification settings - Fork 17
[AE-782] Build dap collector job for incrementality experiments #387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 33 commits
7305d05
b69a042
3e051a7
7d21e54
89a4cb5
3ad955b
caee055
4fd78a5
c7f3bfd
cfbe5fe
f8e1e7b
ee4e345
6133976
e9350e0
6ab2eae
a0bb7f7
7146c86
210a696
5074a9b
258fe4b
c3856ef
55d9003
4630ae6
77b2fcc
e2d5b7b
6f5aed7
5f197ec
05cd838
12cfa71
8d531f0
8cb20c8
bcbd562
58be045
81670c3
14942a9
681c121
a9ca652
f87c915
1eb5271
2a42873
ccacd16
252fa3a
22cf4fa
e006488
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
.cache/ | ||
ci_job.yaml | ||
ci_workflow.yaml | ||
public_key_to_hpke_config.py | ||
dev_run_docker.sh | ||
dev_runbook.md | ||
.DS_Store | ||
example_config.json | ||
*.pyc | ||
.pytest_cache/ | ||
.python-version | ||
__pycache__/ | ||
venv/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,2 @@ | ||
[flake8] | ||
max-line-length = 120 |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
.DS_Store | ||
*.pyc | ||
__pycache__/ | ||
venv/ | ||
.python-version |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,38 @@ | ||
FROM python:3.12 | ||
LABEL maintainer="Glenda Leonard <[email protected]>" | ||
ARG HOME="/janus_build" | ||
WORKDIR ${HOME} | ||
|
||
RUN apt update && apt --yes install curl | ||
|
||
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y | ||
ENV PATH=$HOME/.cargo/bin:$PATH | ||
|
||
# build the CLI tool | ||
RUN git clone --depth 1 https://github.com/divviup/janus.git --branch '0.7.69' | ||
RUN cd janus && cargo build -r -p janus_tools --bin collect | ||
|
||
######### next stage | ||
|
||
FROM python:3.12 | ||
LABEL maintainer="Glenda Leonard <[email protected]>" | ||
# https://github.com/mozilla-services/Dockerflow/blob/master/docs/building-container.md | ||
ARG USER_ID="10001" | ||
ARG GROUP_ID="app" | ||
ARG HOME="/app" | ||
WORKDIR ${HOME} | ||
|
||
RUN groupadd --gid ${USER_ID} ${GROUP_ID} && \ | ||
useradd --create-home --uid ${USER_ID} --gid ${GROUP_ID} --home-dir ${HOME} ${GROUP_ID} | ||
##################### from other Dockerfile | ||
COPY --from=0 /janus_build/janus/target/release/collect ./ | ||
################### | ||
|
||
# Drop root and change ownership of the application folder to the user | ||
RUN chown -R ${USER_ID}:${GROUP_ID} ${HOME} | ||
USER ${USER_ID} | ||
ADD ./requirements.txt . | ||
RUN pip install --upgrade pip | ||
RUN pip install -r requirements.txt | ||
|
||
ADD . . |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
# Ads Incrementality DAP collector | ||
|
||
## Background | ||
|
||
Incrementality is a way to measure the effectiveness of our ads in a general, agreggated, privacy-preserving way -- | ||
without knowing anything about specific users. | ||
|
||
Incrementality works by dividing clients into various Nimbus experiment branches that vary how/whether an ad is shown. | ||
Separately, a [DAP](https://docs.divviup.org/) task is configured to store the metrics for each experiment branch in a | ||
different DAP bucket. | ||
|
||
Firefox is instrumented with [DAP telemetry functionality](https://github.com/mozilla-firefox/firefox/tree/main/toolkit/components/telemetry/dap), which allows it to send metrics and reports into the correct DAP buckets as configured in the experiment. | ||
|
||
Then this job can go out and collect metrics from DAP (using bucket info from the experiment's data), and write them | ||
to BQ. | ||
|
||
Some examples of existing metrics are | ||
- "url visit counting", which increments counters in DAP when a firefox client visits an ad landing page. | ||
|
||
Great care is taken to preserve privacy and anonymity of these metrics. The DAP technology itself agreggates counts | ||
in separate systems and adds noise. The DAP telemetry feature will only submit a count to DAP once per week per client. | ||
All DAP reports are deleted after 2 weeks. | ||
|
||
|
||
## Overview | ||
|
||
This job is driven by a config file from a GCS bucket. Inform the job of the config file location by passing the | ||
`gcp_project` and `gcs_config_bucket` parameters. See `example_config.json` for how to structure this file. | ||
|
||
The config file specifies the incrementality experiments that are currently running, some config and credentials from DAP, | ||
and where in BQ to write the incrementality results. | ||
|
||
The job will go out to Nimbus and read data for each of the experiments, then go out to DAP and read experiment branch results, | ||
then put it all together into results rows and write metrics to BQ. | ||
|
||
## Usage | ||
|
||
This script is intended to be run in a docker container. | ||
|
||
It requires setup of some environment variables that hold DAP credentials, and the job will look for those when it | ||
starts up. A dev script, `dev_run_docker.sh`, is included for convenience to build and run the job locally, and it | ||
also documents those variables. | ||
|
||
There is also a `dev_runbook.md` doc that walks through what is required to set up a DAP account, create some DAP | ||
tasks for testing, and the DAP credentials setup and management. The `public_key_to_hpke_config.py` utility will help | ||
with encoding the DAP credentials for consumption by this job. | ||
|
||
|
||
Once the environment variables are set up, run the job with: | ||
|
||
|
||
```sh | ||
./dev_run_docker.sh | ||
``` | ||
|
||
To just build the docker image, use: | ||
|
||
```sh | ||
docker build -t ads_incrementality_dap_collector . | ||
``` | ||
|
||
To run outside of docker, install dependencies with: | ||
|
||
```sh | ||
pip install -r requirements.txt | ||
``` | ||
|
||
Run the script with: | ||
|
||
```sh | ||
python3 -m python_template_job.main | ||
``` | ||
|
||
## Testing | ||
|
||
Run tests with: | ||
|
||
```sh | ||
python3 -m pytest | ||
``` | ||
|
||
## Linting and format | ||
|
||
`flake8` and `black` are included for code linting and formatting: | ||
|
||
```sh | ||
pytest --black --flake8 | ||
``` | ||
|
||
or | ||
|
||
```sh | ||
flake8 . | ||
``` | ||
|
||
or | ||
|
||
```sh | ||
black . | ||
``` | ||
|
||
or | ||
|
||
```sh | ||
black --diff . | ||
``` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
from datetime import datetime | ||
|
||
from google.cloud import bigquery | ||
|
||
DAP_LEADER = "https://dap-09-3.api.divviup.org" | ||
VDAF = "histogram" | ||
PROCESS_TIMEOUT = 1200 # 20 mins | ||
|
||
CONFIG_FILE_NAME = "config.json" # See example_config.json for the contents and structure of the job config file. | ||
LOG_FILE_NAME = f"{datetime.now()}-ads-incrementality-dap-collector.log" | ||
|
||
DEFAULT_BATCH_DURATION = 604800 | ||
|
||
COLLECTOR_RESULTS_SCHEMA = [ | ||
bigquery.SchemaField( | ||
"collection_start", | ||
"DATE", | ||
mode="REQUIRED", | ||
description="Start date of the collected time window, inclusive.", | ||
), | ||
bigquery.SchemaField( | ||
"collection_end", | ||
"DATE", | ||
mode="REQUIRED", | ||
description="End date of the collected time window, inclusive.", | ||
), | ||
bigquery.SchemaField( | ||
"country_codes", | ||
"JSON", | ||
mode="NULLABLE", | ||
description="List of 2-char country codes for the experiment", | ||
), | ||
bigquery.SchemaField( | ||
"experiment_slug", | ||
"STRING", | ||
mode="REQUIRED", | ||
description="Slug indicating the experiment.", | ||
), | ||
bigquery.SchemaField( | ||
"experiment_branch", | ||
"STRING", | ||
mode="REQUIRED", | ||
description="The experiment branch this data is associated with.", | ||
), | ||
bigquery.SchemaField( | ||
"advertiser", | ||
"STRING", | ||
mode="REQUIRED", | ||
description="Advertiser associated with this experiment.", | ||
), | ||
bigquery.SchemaField( | ||
"metric", | ||
"STRING", | ||
mode="REQUIRED", | ||
description="Metric collected for this experiment.", | ||
), | ||
bigquery.SchemaField( | ||
name="value", | ||
field_type="RECORD", | ||
mode="REQUIRED", | ||
fields=[ | ||
bigquery.SchemaField("count", "INT64", mode="NULLABLE"), | ||
bigquery.SchemaField("histogram", "JSON", mode="NULLABLE"), | ||
], | ||
), | ||
bigquery.SchemaField( | ||
"created_at", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Update to |
||
"TIMESTAMP", | ||
mode="REQUIRED", | ||
description="Timestamp for when this row was written.", | ||
), | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An example of the type of metric is not needed here.