Skip to content

Commit c677c32

Browse files
docs(blog): data quality (#2675)
1 parent c0f458f commit c677c32

File tree

1 file changed

+88
-0
lines changed
  • gcp/appengine/blog/content/posts/announcing-data-quality-initiatives

1 file changed

+88
-0
lines changed
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: "OSV's approach to data quality"
3+
date: 2023-09-30T09:00:00Z
4+
draft: false
5+
author: Andrew Pollock and Charl de Nysschen
6+
---
7+
OSV's mission is to enable developers to reduce security risk arising from known
8+
vulnerabilities in open source components they use.
9+
10+
Part of the strategy to accomplish that mission is to provide a comprehensive,
11+
accurate and timely database of known vulnerabilities covering both language
12+
ecosystems and OS package distributions.
13+
14+
Today, OSV.dev's coverage is fast approaching 30 ecosystems, while also
15+
importing records from almost as many disparate "[home databases](https://ossf.github.io/osv-schema/#id-modified-fields)".
16+
As this number of federated data sources continues to grow, so does the prospect
17+
of OSV records being expressed in ways that are detrimental to them being
18+
effectively utilized in aggregate.
19+
20+
To ensure the accuracy and usability of OSV.dev's data at scale we have
21+
initiated a program of work to prevent future regression in data quality as the
22+
ecosystem of data contributions continues to grow.
23+
<!--more-->
24+
25+
In our
26+
[experiences](https://www.first.org/conference/vulncon2024/program#pThe-Trials-and-Tribulations-of-Bulk-Converting-CVEs-to-OSV)
27+
from [interacting with the CVE Program and broader
28+
ecosystem](https://osv.dev/blog/posts/introducing-broad-c-c++-support/), we've
29+
found that the term "data quality" means different things to different people.
30+
31+
For OSV.dev, the primary objective is to enable awareness and remediation of
32+
known vulnerabilities in open source components. To this end, "data quality"
33+
means being able to reason about and act upon vulnerability records at scale.
34+
This is why the OSV format was designed to enable machine-readability as its
35+
primary use case. In order to programmatically reason about OSV records at
36+
scale, a degree of consistent use of fields beyond what can be validated using
37+
JSON Schema validation alone is necessary.
38+
39+
Problems that the OSV Data Quality Program seeks to address include:
40+
41+
- No way for record providers to know there are problems with records they have already
42+
published
43+
- OSV.dev accepts non-schema-compliant records OSV.dev accepts records
44+
with other validity issues (such as invalid package names or non-existent
45+
package versions)
46+
- No turnkey way for an OSV record provider to bring the data
47+
quality problem forward, to earlier in the record publication lifecycle
48+
- No best practice tooling for OSV records to be created by a new OSV record provider
49+
- [Downstream data consumers often mistake OSV.dev as the originator for the data
50+
and provide feedback about it to us, rather than the record's originator](https://google.github.io/osv.dev/faq/#ive-found-something-wrong-with-the-data)
51+
- Git repository owners may not be following best-practice release processes (such as
52+
not using tags, or by using unusual tag naming conventions), confounding
53+
OSV.dev's ability to resolve fix commits for fix versions, which isn't known
54+
until the first time a vulnerability referencing the repository is published
55+
56+
We have published our current opinion on the [Properties of a High Quality OSV
57+
Record](https://google.github.io/osv.dev/data_quality.html), which goes above
58+
and beyond JSON Schema compliance, and are working on an open source [OSV record
59+
linting tool](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter) to
60+
programmatically validate records against these properties.
61+
62+
Thereafter, we will begin gating record imports to records that meet the quality
63+
requirements.
64+
65+
In order for the operators of home databases that OSV.dev imports from to be
66+
able to reason about the acceptability of records published, they will be able
67+
to:
68+
69+
- run the OSV linter against their records as part of their publication
70+
workflow
71+
- review OSV.dev's import findings about their records
72+
73+
You can follow our [progress on this journey on
74+
GitHub](https://github.com/orgs/google/projects/62). Input and contributions
75+
are, as always, appreciated.
76+
77+
If you're responsible for an existing home database that OSV.dev imports records
78+
from, we will contact you directly before there are any changes to the record
79+
import process that may impact you. You can also consider proactively running
80+
our OSV record linter on your existing records to see how they rate.
81+
82+
If you'd like to experiment with or help expand the capabilities of the OSV
83+
record linter, it's [currently residing in the OpenSSF OSV Schema GitHub
84+
repository](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter).
85+
86+
As an end-consumer of OSV.dev's data, we hope that this blog post encourages you
87+
to continue to have confidence in the capabilities enabled by that data into the
88+
future.

0 commit comments

Comments
 (0)