Skip to content

Conversation

yeswanth120-gif
Copy link
Collaborator

Metric 1 : Number of automated edits
Metric 2 : Number of deleted pages and edits
Metric 3 : Number of edits deleted, reverted or rolled back

@yeswanth120-gif yeswanth120-gif self-assigned this Mar 21, 2025
Copy link
Member

@kcvelaga kcvelaga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left comments

JOIN user_groups
ON actor.actor_user = user_groups.ug_user -- Join user_groups and actor tables
WHERE user_groups.ug_group = 'bot' -- Filter for bot user group
AND revision.rev_timestamp BETWEEN '20230101' AND '20240301'; -- Filter by specific date range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't filter in WHERE condition, that way you won't be able to filter on the dashboard.
Add DATE to SELECT statement. Also, count distinct revision IDs.

SELECT
(SELECT COUNT(*)
FROM revision -- For edits we use revision table
WHERE LEFT(rev_timestamp, 8) BETWEEN '20240101' AND '20240301') AS total_edits, -- Filtered between specific dates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need deleted pages, not revisions. This can be removed.


(SELECT COUNT(*)
FROM archive -- For deleted pages we use archive tale
WHERE LEFT(ar_timestamp, 8) BETWEEN '20240101' AND '20240301') AS deleted_pages; -- Filtered between specific dates No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, don't filter in WHERE condition. Add date statement to SELECT

SELECT
(SELECT COUNT(*)
FROM archive -- For deleted edits we use archieve table
WHERE LEFT(ar_timestamp, 8) BETWEEN '20230301' AND '20231212') AS deleted_edits, -- Filtering between specific dates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted edits are not required, only pages, this can be removed.

(SELECT COUNT(*)
FROM revision r -- For reverted or rollback edits we use comment table
JOIN comment c ON r.rev_comment_id = c.comment_id
WHERE LEFT(r.rev_timestamp, 8) BETWEEN '20230301' AND '20231212' -- Filtering between specific dates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove filtering in WHERE on timestamp, add them to SELECT statement as date

Comment on lines 14 to 16
AND (c.comment_text LIKE '%revert%'
OR c.comment_text LIKE '%rollback%'
OR c.comment_text LIKE '%undid%')) AS reverted_edits; -- Comparing strings i.e revert or rollback with column comment_text No newline at end of file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good way to get reverted edits. You have to join the ctd_tag and ctd_tag_def tables, and check if the tags are mw-reverted, mw-rollback or mw-undo.

@yeswanth120-gif yeswanth120-gif requested a review from kcvelaga July 22, 2025 14:58
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@yeswanth120-gif yeswanth120-gif changed the title Added Three Metrics Updated Metrics and Notebook Jul 23, 2025
@yeswanth120-gif yeswanth120-gif changed the title Updated Metrics and Notebook Updated Metrics and Notebook Analysis Jul 23, 2025
@@ -0,0 +1,464 @@
{
Copy link

@Kalli-navya Kalli-navya Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This graph looks very clumsy. For representing plots with continuous data, line plots look more cleaner instead of bar plots. With a line plot, this graph will have just 3 lines each with a color and it also records the ups and downs over time without looking clumsy. It becomes easy to understand insights


Reply via ReviewNB

@@ -0,0 +1,464 @@
{
Copy link

@Kalli-navya Kalli-navya Jul 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #22.    if all_dfs:

Box plots are not the ideal way of representing this. Histograms, bar plots (including count plots), and line plots are effective for visualizing data where "number of counts" is a key metric. Histograms are ideal for displaying the distribution of numerical data, while bar plots, especially count plots, are suitable for showing the frequency of categorical variables.


Reply via ReviewNB

@yeswanth120-gif
Copy link
Collaborator Author

Please review my updated Visualizations notebook in reviewnb

@yeswanth120-gif yeswanth120-gif changed the title Updated Metrics and Notebook Analysis Visualizations for the Queries Deleted pages , Automated Edits , Edits revert,rollback,undo in Wikimedia from Multiple wikipedia language sets (tewiki , hiwiki , mlwiki) Jul 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants