|
| 1 | +--- |
| 2 | +title: "OSV's approach to data quality" |
| 3 | +date: 2023-09-30T09:00:00Z |
| 4 | +draft: false |
| 5 | +author: Andrew Pollock and Charl de Nysschen |
| 6 | +--- |
| 7 | +OSV's mission is to enable developers to reduce security risk arising from known |
| 8 | +vulnerabilities in open source components they use. |
| 9 | + |
| 10 | +Part of the strategy to accomplish that mission is to provide a comprehensive, |
| 11 | +accurate and timely database of known vulnerabilities covering both language |
| 12 | +ecosystems and OS package distributions. |
| 13 | + |
| 14 | +Today, OSV.dev's coverage is fast approaching 30 ecosystems, while also |
| 15 | +importing records from almost as many disparate "[home databases](https://ossf.github.io/osv-schema/#id-modified-fields)". |
| 16 | +As this number of federated data sources continues to grow, so does the prospect |
| 17 | +of OSV records being expressed in ways that are detrimental to them being |
| 18 | +effectively utilized in aggregate. |
| 19 | + |
| 20 | +To ensure the accuracy and usability of OSV.dev's data at scale we have |
| 21 | +initiated a program of work to prevent future regression in data quality as the |
| 22 | +ecosystem of data contributions continues to grow. |
| 23 | +<!--more--> |
| 24 | + |
| 25 | +In our |
| 26 | +[experiences](https://www.first.org/conference/vulncon2024/program#pThe-Trials-and-Tribulations-of-Bulk-Converting-CVEs-to-OSV) |
| 27 | +from [interacting with the CVE Program and broader |
| 28 | +ecosystem](https://osv.dev/blog/posts/introducing-broad-c-c++-support/), we've |
| 29 | +found that the term "data quality" means different things to different people. |
| 30 | + |
| 31 | +For OSV.dev, the primary objective is to enable awareness and remediation of |
| 32 | +known vulnerabilities in open source components. To this end, "data quality" |
| 33 | +means being able to reason about and act upon vulnerability records at scale. |
| 34 | +This is why the OSV format was designed to enable machine-readability as its |
| 35 | +primary use case. In order to programmatically reason about OSV records at |
| 36 | +scale, a degree of consistent use of fields beyond what can be validated using |
| 37 | +JSON Schema validation alone is necessary. |
| 38 | + |
| 39 | +Problems that the OSV Data Quality Program seeks to address include: |
| 40 | + |
| 41 | +- No way for record providers to know there are problems with records they have already |
| 42 | +published |
| 43 | +- OSV.dev accepts non-schema-compliant records OSV.dev accepts records |
| 44 | +with other validity issues (such as invalid package names or non-existent |
| 45 | +package versions) |
| 46 | +- No turnkey way for an OSV record provider to bring the data |
| 47 | +quality problem forward, to earlier in the record publication lifecycle |
| 48 | +- No best practice tooling for OSV records to be created by a new OSV record provider |
| 49 | +- [Downstream data consumers often mistake OSV.dev as the originator for the data |
| 50 | +and provide feedback about it to us, rather than the record's originator](https://google.github.io/osv.dev/faq/#ive-found-something-wrong-with-the-data) |
| 51 | +- Git repository owners may not be following best-practice release processes (such as |
| 52 | +not using tags, or by using unusual tag naming conventions), confounding |
| 53 | +OSV.dev's ability to resolve fix commits for fix versions, which isn't known |
| 54 | +until the first time a vulnerability referencing the repository is published |
| 55 | + |
| 56 | +We have published our current opinion on the [Properties of a High Quality OSV |
| 57 | +Record](https://google.github.io/osv.dev/data_quality.html), which goes above |
| 58 | +and beyond JSON Schema compliance, and are working on an open source [OSV record |
| 59 | +linting tool](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter) to |
| 60 | +programmatically validate records against these properties. |
| 61 | + |
| 62 | +Thereafter, we will begin gating record imports to records that meet the quality |
| 63 | +requirements. |
| 64 | + |
| 65 | +In order for the operators of home databases that OSV.dev imports from to be |
| 66 | +able to reason about the acceptability of records published, they will be able |
| 67 | +to: |
| 68 | + |
| 69 | +- run the OSV linter against their records as part of their publication |
| 70 | +workflow |
| 71 | +- review OSV.dev's import findings about their records |
| 72 | + |
| 73 | +You can follow our [progress on this journey on |
| 74 | +GitHub](https://github.com/orgs/google/projects/62). Input and contributions |
| 75 | +are, as always, appreciated. |
| 76 | + |
| 77 | +If you're responsible for an existing home database that OSV.dev imports records |
| 78 | +from, we will contact you directly before there are any changes to the record |
| 79 | +import process that may impact you. You can also consider proactively running |
| 80 | +our OSV record linter on your existing records to see how they rate. |
| 81 | + |
| 82 | +If you'd like to experiment with or help expand the capabilities of the OSV |
| 83 | +record linter, it's [currently residing in the OpenSSF OSV Schema GitHub |
| 84 | +repository](https://github.com/ossf/osv-schema/tree/main/tools/osv-linter). |
| 85 | + |
| 86 | +As an end-consumer of OSV.dev's data, we hope that this blog post encourages you |
| 87 | +to continue to have confidence in the capabilities enabled by that data into the |
| 88 | +future. |
0 commit comments