Skip to content

Commit 172962b

Browse files
authored
Merge pull request #65 from sfu-discourse-lab/V7.0
Update code base to V7.0
2 parents d822b0c + 1396702 commit 172962b

File tree

89 files changed

+3415
-1835
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+3415
-1835
lines changed

LICENSE

Lines changed: 21 additions & 674 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
__Status: V6.1__ (Code provided as-is; only sporadic updates expected).
1+
__Status: V7.0__ (Code provided as-is; only sporadic updates expected)
22

33
# The Gender Gap Tracker
44

@@ -18,8 +18,9 @@ See [CONTRIBUTORS.md](CONTRIBUTORS.md)
1818

1919
* `scraper`: Modules for scraping English and French news articles from various Canadian news organizations' websites and RSS feeds.
2020
* `nlp`: NLP modules for performing quote extraction and entity gender annotation on both English and French news articles.
21-
* `statistics`: Example scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.
22-
* `dashboard_for_research`: [Research dashboard and apps](https://gendergaptracker.research.sfu.ca/) that allow us to explore the GGT data in more detail.
21+
* `api`: FastAPI code base exposing endpoints that serve our daily statistics to public-facing dashboards: [Gender Gap Tracker](https://gendergaptracker.informedopinions.org) and [Radar de Parité](https://radardeparite.femmesexpertes.ca)
22+
* `research_dashboard`: [A multi-page, extensible dashboard](https://gendergaptracker.research.sfu.ca/) built in Plotly Dash that allows us to explore the GGT data in more detail.
23+
* `statistics`: Scripts for running batch queries on our MongoDB database to retrieve source/gender statistics.
2324

2425
## Data
2526

@@ -31,7 +32,7 @@ In future versions of the software, we are planning to visualize more fine-grain
3132

3233
From a research perspective, questions of salience and space arise, i.e., whether quotes by men are presented more prominently in an article, and whether men are given more space in average (perhaps counted in number of words). More nuanced questions that involve language analysis include whether the quotes are presented differently in terms of endorsement or distance from the content of the quote (*stated* vs. *claimed*). Analyses of transitivity structure in clauses can yield further insights about the type of roles women are portrayed in, complementing some of our studies' findings via dependency analyses.
3334

34-
We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights (for example, the relative benefits of stemming vs. lemmatization on topic keyword interpretability for non-English corpora), but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!
35+
We are mindful of and acknowledge the relative lack of work in NLP, topic modelling and gender equality for corpora in languages other than English. Our hope is that we are at least playing a small role here, through our analyses of Canadian French-language news whose code we share in this repo. We believe that such work will yield not only interesting methodological insights, but also reveal whether the same gender disparities we observed in our English corpus are present in French. While we are actively pursuing such additional areas of inquiry, we also invite other researchers to join in this effort!
3536

3637

3738
## Contact

api/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# APIs for public-facing dashboards
2+
3+
This section hosts code for the backend APIs that serve our public-facing dashboards for our partner organization, Informed Opinions.
4+
5+
We have two APIs: one each serving the English and French dashboards (for the Gender Gap Tracker and the Radar de Parité, respectively).
6+
7+
## Dashboards
8+
* English: https://gendergaptracker.informedopinions.org
9+
* French: https://radardeparite.femmesexpertes.ca
10+
11+
### Front end code
12+
13+
The front end code base, for clearer separation of roles and responsibilities, is hosted elsewhere in private repos. Access to these repos is restricted, so please reach out to [email protected] to get access to the code, if required.
14+
15+
## Setup
16+
17+
Both APIs are written using [FastAPI](https://fastapi.tiangolo.com/), a high-performance web framework for building APIs in Python.
18+
19+
This code base has been tested in Python 3.9, but there shouldn't be too many problems if using a higher Python version.
20+
21+
Install the required dependencies via `requirements.txt` as follows.
22+
23+
Install a new virtual environment if it does not already exist:
24+
```sh
25+
$ python3.9 -m venv api_venv
26+
$ python3.9 -m pip install -r requirements.txt
27+
```
28+
29+
For further use, activate the virtual environment:
30+
31+
```sh
32+
$ source api_venv/bin/activate
33+
```
34+
35+

api/english/README.md

Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
# Gender Gap Tracker: API
2+
3+
This section contains the code for the API that serves the [Gender Gap Tracker public dashboard](https://gendergaptracker.informedopinions.org/). The dashboard itself is hosted externally, and its front end code is hosted on this [GitLab repo](https://gitlab.com/client-transfer-group/gender-gap-tracker).
4+
5+
## API docs
6+
7+
The docs can be accessed in one of two ways:
8+
9+
* Swagger: https://gendergaptracker.informedopinions.org/docs
10+
* Useful to test out the API interactively on the browser
11+
* Redoc: https://gendergaptracker.informedopinions.org/redoc
12+
* Clean, modern UI to see the API structure in a responsive format
13+
14+
## Extensibility
15+
16+
The code base has been written with the intention that future developers can add endpoints for other functionality that can potentially serve other dashboards.
17+
18+
* `db`: Contains MongoDB-specific code (config and queries) that help interact with the RdP data on our MongoDB database
19+
* `endpoints`: Add new functionality to process and serve results via RESTful API endpoints
20+
* `schemas`: Perform response data validation so that the JSON results from the endpoint are formatted properly in the docs
21+
* `utils`: Add utility functions that support data manipulation within the routers
22+
* `gunicorn_conf.py`: Contains deployment-specific instructions for the web server, explained below.
23+
24+
## Deployment
25+
26+
We perform a standard deployment of FastAPI in production, as per the best practices [shown in this blog post](https://www.vultr.com/docs/how-to-deploy-fastapi-applications-with-gunicorn-and-nginx-on-ubuntu-20-04/).
27+
28+
* `uvicorn` is used as an async web server (compatible with the `gunicorn` web server for production apps)
29+
* `gunicorn` works as a process manager that starts multiple `uvicorn` processes via the `uvicorn.workers.UvicornWorker` class
30+
* `nginx` is used as a reverse proxy
31+
32+
The deployment and maintenance of the web server is carried out by SFU's Research Computing Group (RCG).
33+
34+
35+

api/english/db/config.py

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
host = ["mongo0", "mongo1", "mongo2"]
2+
# host = "localhost"
3+
is_direct_connection = True if (host == "localhost") else False
4+
5+
config = {
6+
"MONGO_HOST": host,
7+
"MONGO_PORT": 27017,
8+
"MONGO_ARGS": {
9+
"authSource": "admin",
10+
"readPreference": "primaryPreferred",
11+
"username": "username",
12+
"password": "password",
13+
"directConnection": is_direct_connection,
14+
},
15+
"DB_NAME": "mediaTracker",
16+
"LOGS_DIR": "logs/",
17+
}
18+

api/english/db/mongoqueries.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
def agg_total_per_outlet(begin_date: str, end_date: str):
2+
query = [
3+
{"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
4+
{
5+
"$group": {
6+
"_id": "$outlet",
7+
"totalArticles": {"$sum": "$totalArticles"},
8+
"totalFemales": {"$sum": "$totalFemales"},
9+
"totalMales": {"$sum": "$totalMales"},
10+
"totalUnknowns": {"$sum": "$totalUnknowns"},
11+
}
12+
},
13+
]
14+
return query
15+
16+
17+
def agg_total_by_week(begin_date: str, end_date: str):
18+
query = [
19+
{"$match": {"publishedAt": {"$gte": begin_date, "$lte": end_date}}},
20+
{
21+
"$group": {
22+
"_id": {
23+
"outlet": "$outlet",
24+
"week": {"$week": "$publishedAt"},
25+
"year": {"$year": "$publishedAt"},
26+
},
27+
"totalFemales": {"$sum": "$totalFemales"},
28+
"totalMales": {"$sum": "$totalMales"},
29+
"totalUnknowns": {"$sum": "$totalUnknowns"},
30+
}
31+
},
32+
]
33+
return query
Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
import pandas as pd
2+
import utils.dateutils as dateutils
3+
from db.mongoqueries import agg_total_by_week, agg_total_per_outlet
4+
from fastapi import APIRouter, HTTPException, Request, Query
5+
from schemas.stats_by_date import TotalStatsByDate
6+
from schemas.stats_weekly import TotalStatsByWeek
7+
8+
outlet_router = APIRouter()
9+
COLLECTION_NAME = "mediaDaily"
10+
LOWER_BOUNT_START_DATE = "2018-10-01"
11+
ID_MAPPING = {"Huffington Post": "HuffPost Canada"}
12+
13+
14+
@outlet_router.get(
15+
"/info_by_date",
16+
response_model=TotalStatsByDate,
17+
response_description="Get total and per outlet gender statistics for English outlets between two dates",
18+
)
19+
def expertwomen_info_by_date(
20+
request: Request,
21+
begin: str = Query(description="Start date in yyyy-mm-dd format"),
22+
end: str = Query(description="End date in yyyy-mm-dd format"),
23+
) -> TotalStatsByDate:
24+
if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
25+
raise HTTPException(
26+
status_code=416,
27+
detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
28+
)
29+
begin = dateutils.convert_date(begin)
30+
end = dateutils.convert_date(end)
31+
32+
query = agg_total_per_outlet(begin, end)
33+
response = request.app.connection[COLLECTION_NAME].aggregate(query)
34+
# Work with the data in pandas
35+
source_stats = list(response)
36+
df = pd.DataFrame.from_dict(source_stats)
37+
df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
38+
# Replace outlet names if necessary
39+
df["_id"] = df["_id"].replace(ID_MAPPING)
40+
# Take sums of total males, females, unknowns and articles and convert to dict
41+
result = df.drop("_id", axis=1).sum().to_dict()
42+
# Compute per outlet stats
43+
df["perFemales"] = df["totalFemales"] / df["totalGenders"]
44+
df["perMales"] = df["totalMales"] / df["totalGenders"]
45+
df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
46+
df["perArticles"] = df["totalArticles"] / result["totalArticles"]
47+
# Convert dataframe to dict prior to JSON serialization
48+
result["sources"] = df.to_dict("records")
49+
result["perFemales"] = result["totalFemales"] / result["totalGenders"]
50+
result["perMales"] = result["totalMales"] / result["totalGenders"]
51+
result["perUnknowns"] = result["totalUnknowns"] / result["totalGenders"]
52+
return result
53+
54+
55+
@outlet_router.get(
56+
"/weekly_info",
57+
response_model=TotalStatsByWeek,
58+
response_description="Get gender statistics per English outlet aggregated WEEKLY between two dates",
59+
)
60+
def expertwomen_weekly_info(
61+
request: Request,
62+
begin: str = Query(description="Start date in yyyy-mm-dd format"),
63+
end: str = Query(description="End date in yyyy-mm-dd format"),
64+
) -> TotalStatsByWeek:
65+
if not dateutils.is_valid_date_range(begin, end, LOWER_BOUNT_START_DATE):
66+
raise HTTPException(
67+
status_code=416,
68+
detail=f"Date range error: Should be between {LOWER_BOUNT_START_DATE} and tomorrow's date",
69+
)
70+
begin = dateutils.convert_date(begin)
71+
end = dateutils.convert_date(end)
72+
73+
query = agg_total_by_week(begin, end)
74+
response = request.app.connection[COLLECTION_NAME].aggregate(query)
75+
# Work with the data in pandas
76+
df = (
77+
pd.json_normalize(list(response), max_level=1)
78+
.sort_values(by="_id.outlet")
79+
.reset_index(drop=True)
80+
)
81+
df.rename(
82+
columns={
83+
"_id.outlet": "outlet",
84+
"_id.week": "week",
85+
"_id.year": "year",
86+
},
87+
inplace=True,
88+
)
89+
# Replace outlet names if necessary
90+
df["outlet"] = df["outlet"].replace(ID_MAPPING)
91+
# Construct DataFrame and handle begin/end dates as datetimes for summing by week
92+
df["w_begin"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 0), axis=1)
93+
df["w_end"] = df.apply(lambda row: dateutils.get_week_bound(row["year"], row["week"], 6), axis=1)
94+
df["w_begin"], df["w_end"] = zip(*df.apply(lambda row: (pd.to_datetime(row["w_begin"]), pd.to_datetime(row["w_end"])), axis=1))
95+
df = (
96+
df.drop(columns=["week", "year"], axis=1)
97+
.sort_values(by=["outlet", "w_begin"])
98+
)
99+
# In earlier versions, there was a bug due to which we returned weekly information for the same week begin date twice
100+
# This bug only occurred when the last week of one year spanned into the next year (partial week across a year boundary)
101+
# To address this, we perform summation of stats by week to avoid duplicate week begin dates being passed to the front end
102+
df = df.groupby(["outlet", "w_begin", "w_end"]).sum().reset_index()
103+
df["totalGenders"] = df["totalFemales"] + df["totalMales"] + df["totalUnknowns"]
104+
df["perFemales"] = df["totalFemales"] / df["totalGenders"]
105+
df["perMales"] = df["totalMales"] / df["totalGenders"]
106+
df["perUnknowns"] = df["totalUnknowns"] / df["totalGenders"]
107+
# Convert datetimes back to string for JSON serialization
108+
df["w_begin"] = df["w_begin"].dt.strftime("%Y-%m-%d")
109+
df["w_end"] = df["w_end"].dt.strftime("%Y-%m-%d")
110+
df = df.drop(columns=["totalGenders", "totalFemales", "totalMales", "totalUnknowns"], axis=1)
111+
112+
# Convert dataframe to dict prior to JSON serialization
113+
weekly_data = dict()
114+
for outlet in df["outlet"]:
115+
per_outlet_data = df[df["outlet"] == outlet].to_dict(orient="records")
116+
# Remove the outlet key from weekly_data
117+
[item.pop("outlet") for item in per_outlet_data]
118+
weekly_data[outlet] = per_outlet_data
119+
output = {"outlets": weekly_data}
120+
return output

api/english/gunicorn_conf.py

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
# gunicorn_conf.py to point gunicorn to the uvicorn workers
2+
from multiprocessing import cpu_count
3+
4+
# Socket path
5+
bind = 'unix:/path_to_code/GenderGapTracker/api/english/g-tracker.sock'
6+
7+
# Worker Options
8+
workers = cpu_count() + 1
9+
worker_class = 'uvicorn.workers.UvicornWorker'
10+
11+
# Logging Options
12+
loglevel = 'debug'
13+
accesslog = '/path_to_code/GenderGapTracker/api/english/access_log'
14+
errorlog = '/path_to_code/GenderGapTracker/api/english/error_log'

api/english/logging.conf

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
[loggers]
2+
keys=root, gunicorn.error, gunicorn.access
3+
4+
[handlers]
5+
keys=console, error_file, access_file
6+
7+
[formatters]
8+
keys=generic, access
9+
10+
[logger_root]
11+
level=INFO
12+
handlers=console
13+
14+
[logger_gunicorn.error]
15+
level=INFO
16+
handlers=error_file
17+
propagate=1
18+
qualname=gunicorn.error
19+
20+
[logger_gunicorn.access]
21+
level=INFO
22+
handlers=access_file
23+
propagate=0
24+
qualname=gunicorn.access
25+
26+
[handler_console]
27+
class=StreamHandler
28+
formatter=generic
29+
args=(sys.stdout, )
30+
31+
[handler_error_file]
32+
class=logging.FileHandler
33+
formatter=generic
34+
args=('/var/log/gunicorn/error.log',)
35+
36+
[handler_access_file]
37+
class=logging.FileHandler
38+
formatter=access
39+
args=('/var/log/gunicorn/access.log',)
40+
41+
[formatter_generic]
42+
format=%(asctime)s [%(process)d] [%(levelname)s] %(message)s
43+
datefmt=%Y-%m-%d %H:%M:%S
44+
class=logging.Formatter
45+
46+
[formatter_access]
47+
format=%(message)s
48+
class=logging.Formatter

0 commit comments

Comments
 (0)