Skip to content

Commit 945e30c

Browse files
fedorovgitbook-bot
authored andcommitted
GITBOOK-404: Data versioning updates
1 parent 506ef89 commit 945e30c

File tree

6 files changed

+35
-7
lines changed

6 files changed

+35
-7
lines changed

.gitbook/assets/image (36).png

518 KB
Loading

.gitbook/assets/image (37).png

45.4 KB
Loading

.gitbook/assets/image (38).png

43.6 KB
Loading

.gitbook/assets/image (39).png

87.7 KB
Loading

cookbook/data-studio/README.md

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,9 @@
1-
# Data Studio
1+
# Looker dashboards
22

3-
[Google Data Studio](https://support.google.com/datastudio/answer/6283323?hl=en) is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.
3+
[Google Looker Studio](https://support.google.com/datastudio/answer/6283323?hl=en) is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.
44

55
{% hint style="info" %}
6-
If you would like to share an interesting Data Studio dashboard that uses IDC/cloud for imaging research, please let us know and we would be happy to review and reference it from the IDC documentation!
6+
If you would like to share an interesting Looker Studio dashboard that uses IDC/cloud for imaging research, please let us know and we would be happy to review and reference it from the IDC documentation!
77
{% endhint %}
88

9-
In this section you can learn how to very quickly make a custom DataStudio dashboard to explore the content of your cohort, and find some additional examples of using DataStudio for analyzing content of IDC.
10-
9+
In this section you can learn how to very quickly make a custom Looker Studio dashboard to explore the content of your cohort, and find some additional examples of using Looker Studio for analyzing content of IDC.

data/data-versioning.md

Lines changed: 31 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,40 @@
11
# Data versioning
22

3+
## Summary
4+
5+
IDC updates its data offering at the intervals of 2-4 months, with the data releases timing driven by the availability of new data, updates of existing data, introduction of new capabilities and various priority considerations. You can see the historical summary of IDC releases in [this page](data-release-notes.md#idc-releases-summary-view). 
6+
7+
When you work with IDC data at any given time, you should be aware of the data release version. If you build cohorts using filters or queries, the result of those queries will change as the IDC content is evolving. Building queries that refer to the specific data release version will ensure that the result is the same.
8+
9+
Here is how you can learn what version of IDC data you are interacting with, depending on what interface to the data you are using:
10+
11+
* **IDC Portal**: data version and release date are displayed in the summary strip
12+
13+
<figure><img src="../.gitbook/assets/image (36).png" alt="" width="375"><figcaption></figcaption></figure>
14+
* **idc-index**: use `get_idc_version()`function
15+
16+
```python
17+
from idc_index import IDCClient
18+
19+
idc_version = IDCClient.get_idc_version()
20+
```
21+
22+
* **BigQuery**: within `bigquery-public-data`project, `idc_current`dataset contains table "views" to effectively provide an alias for the latest IDC data release. To find the actual IDC data release number, expand the list of datasets under `bigquery-public-data`project, and search for the ones that follow the pattern \`idc\_v\<number>\`. The one with the largest number corresponds to the latest released version, and will match the content in `idc_current` (related Google bug [here](https://issuetracker.google.com/issues/324112186)).
23+
24+
<figure><img src="../.gitbook/assets/image (38).png" alt="" width="408"><figcaption></figcaption></figure>
25+
26+
* **3D Slicer / SlicerIDCBrowser**: version information is provided in the SlicerIDCBrowser module top panel, and in the pop-up window title.
27+
28+
<figure><img src="../.gitbook/assets/image (39).png" alt="" width="563"><figcaption></figcaption></figure>
29+
30+
## Implementation details
31+
332
The IDC obtains curated DICOM radiology, pathology and microscopy image and analysis data from The Cancer Imaging Archive (TCIA) and additional sources. Data from all these sources evolves over time as new data is added (common), existing files are corrected (rare), or data is removed (extremely rare).
433

534
Users interact with IDC using one of the following interfaces to define cohorts, and then perform analyses on these cohorts:
635

736
* [IDC Portal](https://portal.imaging.datacommons.cancer.gov/explore/) directly or using [IDC API](https://learn.canceridc.dev/api/getting-started): while this approach is most convenient, it allows searching using a small subset of attributes, defines cohorts only in terms of cases that meet the defined criteria, and has very limited options for combining multiple search criteria
8-
* [IDC BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc\_current\&t=dicom\_all\&page=table) tables via [SQL interface](https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction): this approach is most powerful, as it allows the use of [any of the DICOM metadata attributes](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)
37+
* [IDC BigQuery](https://console.cloud.google.com/bigquery?p=bigquery-public-data\&d=idc_current\&t=dicom_all\&page=table) tables via [SQL interface](https://cloud.google.com/bigquery/docs/reference/standard-sql/introduction): this approach is most powerful, as it allows the use of [any of the DICOM metadata attributes](https://cloud.google.com/healthcare-api/docs/how-tos/dicom-bigquery-schema) to define the cohort, while leveraging the expressiveness of SQL in defining the selection logic, and allows to define cohort at any level of the data model hierarchy (i.e., instances, series, studies or cases)
938

1039
The goal of IDC versioning is to create a series of "snapshots” over time of the entirety of the evolving IDC imaging dataset, such that searching an IDC version according to some criteria (creating a cohort) will always identify exactly the same set of objects. Here “identify” particularly means providing URLs or other access methods to the corresponding physical data objects.
1140

@@ -24,7 +53,7 @@ There are various reasons that can cause modification of the existing collection
2453

2554
These and other possible changes mean that DICOM instances, series and studies can change from one IDC data version to the next, while their DICOM UIDs remain unchanged. This motivates the need for maintaining versioning of the DICOM entities.
2655

27-
Because DICOM `SOPInstanceUIDs`, `SeriesInstanceUIDs` or `StudyInstanceUIDs` can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a [_UUID_](https://en.wikipedia.org/wiki/Universally\_unique\_identifier) to uniquely identify it and differentiate it from other versions of the same DICOM object.
56+
Because DICOM `SOPInstanceUIDs`, `SeriesInstanceUIDs` or `StudyInstanceUIDs` can remain invariant even when the composition of an instance, series or study changes, IDC assigns each version of each instance, series or study a [_UUID_](https://en.wikipedia.org/wiki/Universally_unique_identifier) to uniquely identify it and differentiate it from other versions of the same DICOM object.
2857

2958
{% hint style="info" %}
3059
It is very important to appreciate the difference between DICOM Unique Identifiers (UIDs) and CRDC Universally Unique Identifiers (UUIDs) assigned at the various levels of the DICOM hierarchy:

0 commit comments

Comments
 (0)