This repo contains research and production tooling for using Machine Learning to automate document annotation on ToS;DR
The process before Docbot:
- We crawl documents
- We wait for volunteers to submit Points, which are privacy policy quotations that highlight evidence that a Case (privacy-related statement) is true.
- We wait for curators, trusted volunteers, to approve or reject Points
- We score the service A-F based on the approved Points
The process after is the same, but now Docbot does the initial document analysis and submits Points, along with a confidence score, to curators. We achieve this by fine-tuning large language models into binary classifiers.
Each week an automated job runs 123 case models in total, analyzing any new documents that were added to ToS;DR's database since the last run.
If you would like to have some privacy policies or T&Cs analyzed with docbot, the best way is to add them to the ToS;DR platform on edit.tosdr.org and wait until our automated system picks them up.
If you have a particular need to run the models on standalone documents, or have any other questions, please get in touch with us at [email protected].
We welcome contributions to the engineering or research to improve our models.
We also plan to release our datasets used for training/evaluation, which could be of value to NLP researchers.
In a pip or conda environment, run:
pip install requirements.txt
python -m spacy download en_core_web_md
We create training corpora from database dumps. The first step is to convert from sql to pandas.
- Start postgresql (on a mac:
brew services restart postgresql) - Run
createdb tosdr - In the interactive REPL (
psql -d tosdr), runcreate user phoenix with superuser; ALTER ROLE "phoenix" WITH LOGIN; - Comment out any foreign key setups, found at the bottom of the sql files.
- Log into psql with
psql -d tosdr -U phoenix \i services.sql\i topics.sql\i documents.sql\i cases.sql\i points.sql- Confirm all 5 tables exist with
\dt public.*(you might have to log out and back in). Tables can be inspected with i.e.\d+ public.topics - Run
sql_to_pandas.pyto load the tables and save them as pickled pandas DataFrames indata
Run explore.ipynb on the output of sql_to_pandas.py to clean the data, and then see make_classification_datasets.py to turn that into a classification
dataset apt for training or eval.
Run train.py --help to see options. Dockerfile.train is available for convenience, train.py args can be added to the end of docker run.
CUDA will be used if available.
There are two modes, one to train all case models serially on a single host, and with --parallel_key to parallelize across several containers using AWS SQS.
Exploratory data analysis and data cleaning, saves new versions as data/{DATASET_VERSION}_clean.pkl
Data that was removed:
- Services and Documents marked as
deleted, and associated Points - Services that lack any Documents, and associated Points
- Points marked as
[disputed, changes-requested, pending, draft](only ~60) - Documents without text (~1.5k) and associated Points
- A handful of Points that have a
quoteStartbut noquoteText - Points with
quoteTextthat no longer matchesdocument.text[point.quoteStart:point.quoteEnd], likely due to re-crawled text that changed. About 2k, down from 3k (1k were saved by re-searching for the quote and updatingquoteStart/quoteEnd) - Service 502, Case 235, Topic 53, Document 1378, and all associated Points
Data that was kept for now in case they're useful:
- Points that don't refer to docs
- Docs without any points (there are many)
- Non-english docs and points (can be filtered by doing
documents = documents[documents.lang == 'en'])
The highlights of EDA from explore.ipynb, like graphs and dataset size
An early notebook used to look at points for brainstorming, and help decide whether sentence classification vs sentence spans is the right paradigm. Finds that a lot of points span multiple sentences, so that's ideal, but over 5 sentences is rare.
Tests whether positive predictions using our inference strategy of sentence expansion (inference.apply_sent_span_model()) yields spans about the
same length as human submitted points.
Plots ROC curves, precision recall curves, and helps find optimal thresholds.
