You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Reading only metadata or only specific attributes will reduce the amount of data that needs to be pulled down under some circumstances and therefore make the loading process faster. This depends on the size of the attributes being retrieved, the `chunk_size` (a parameter of the `open()` method that controls how much data is pulled in each HTTP request to the server), and the position of the requested element within the file (since it is necessary to seek through the file until the requested attributes are found, but any data after the requested attributes need not be pulled).
90
+
Reading only metadata or only specific attributes will reduce the amount of data that needs to be pulled down under some circumstances and therefore make the loading process faster. This depends on the size of the attributes being retrieved, the `chunk_size` (a parameter of the `open()` method that controls how much data is pulled in each HTTP request to the server), and the position of the requested element within the file (since it is necessary to seek through the file until the requested attributes are found, but any data after the requested attributes need not be pulled). If you are not retrieving entire images, we strongly recommend specifying a `chunk_size` (in bytes) because the default value is around 40MB, which is typically far larger than the optimal value for accessing metadata attributes or individual frames (see later).
91
91
92
92
This works because running the [open](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open) method on a Blob object returns a [BlobReader](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.fileio.BlobReader) object, which has a "file-like" interface (specifically the `seek`, `read`, and `tell` methods).
93
93
94
94
**From AWS S3 blobs**
95
95
96
-
The `boto3`package provides a Python API for accessing S3 blobs. It can be installed with `pip install boto3`. In order to access open IDC data without providing AWS credentials, it is necessary to configure your own client object such that it does not require signing. This is demonstrated in the following example, which repeats the above example using the counterpart of the same blob on AWS S3. If you want to read an entire file, we recommend using a temporary buffer like this:
96
+
The `s3fs`[package](https://s3fs.readthedocs.io/en/latest/) provides "file-like" interface for accessing S3 blobs. It can be installed with `pip install s3fs`. The following examplerepeats the above example using the counterpart of the same blob on AWS S3.
Unlike `google-cloud-storage`, `boto3` does not provide a file-like interface to access data in blobs. Instead, the `smart_open` [package](https://github.com/piskvorky/smart_open) is a third-party package that wraps an S3 client to expose a "file-like" interface. It can be installed with `pip install 'smart_open[s3]'`. However, we have found that the buffering behavior of this package (which is intended for streaming) is not well matched to the use case of reading DICOM metadata, resulting in many unnecassary requests while reading the metadata of DICOM files (see [this](https://github.com/piskvorky/smart_open/issues/712) issue). Therefore while the following will work, we recommend using the approach in the above example (downloading the whole file) in most cases even if you only want to read the metadata as it will likely be much faster. The exception to this is when reading only the metadata of very large images where the total amount of pixel data dwarfs the amount of metadata (or using frame-level access to such images, see below).
136
-
137
-
```python
138
-
from pydicom import dcmread
139
-
140
-
import boto3
141
-
from botocore importUNSIGNED
142
-
from botocore.config import Config
143
-
import smart_open
144
-
145
-
from idc_index import IDCClient
146
-
147
-
# Create IDCClient for looking up bucket URLs
148
-
idc_client = IDCClient()
149
-
150
-
# Get the list of file URLs in AWS bucket from SeriesInstanceUID
with smart_open.open(url, mode="rb", transport_params=dict(client=s3_client)) as reader:
123
+
with s3_client.open(file_urls[0], 'rb') as reader:
164
124
dcm = dcmread(reader)
165
125
166
126
# Read metadata only (no pixel data)
167
-
withsmart_open.open(url, mode="rb", transport_params=dict(client=s3_client)) as reader:
127
+
withs3_client.open(file_urls[0], 'rb') as reader:
168
128
dcm = dcmread(reader, stop_before_pixels=True)
169
-
```
170
129
171
-
You may want to look into the the other options of `smart_open`'s `open`[method](https://github.com/piskvorky/smart_open/blob/master/help.txt) to improve performance (in particular the `buffering` parameter).
130
+
# Read only specific attributes, identified by their tag
131
+
# (here the Manufacturer and ManufacturerModelName attributes)
132
+
with s3_client.open(file_urls[0], 'rb') as reader:
In the remainder of the examples, we will use only the GCS access method for brevity. However, you should be able to straightforwardly swap out the opened GCS blob for the opened AWS S3 blob to achieve the same effect with Amazon S3.
140
+
Similar to the `chunk_size` parameter in GCS, the `default_block_size` is crucially important for determining how efficient this is. Its default value is around 50MB, which will result in orders of magnitude more data than necessary being pulled than is needed to retrieve metadata. In the above example, we set it to 50kB.
174
141
175
142
### Frame-level access with Highdicom
176
143
177
-
[Highdicom](https://highdicom.readthedocs.io) is a higher-level library providing several features to work with images and image-derived DICOM objects. As of the release 0.25.1, its various reading methods (including [imread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.imread), [segread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.seg.segread), [annread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.ann.annread), and [srread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.sr.srread)) can read any file-like object, including Google Cloud blobs and anything opened with `smart_open` (including S3 blobs).
144
+
[Highdicom](https://highdicom.readthedocs.io) is a higher-level library providing several features to work with images and image-derived DICOM objects. As of the release 0.25.1, its various reading methods (including [imread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.imread), [segread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.seg.segread), [annread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.ann.annread), and [srread](https://highdicom.readthedocs.io/en/latest/package.html#highdicom.sr.srread)) can read any file-like object, including Google Cloud blobs and S3 blobs opened with `s3fs`.
178
145
179
146
A particularly useful feature when working with blobs is ["lazy" frame retrieval](https://highdicom.readthedocs.io/en/latest/image.html#lazy) for images and segmentations. This downloads only the image metadata when the file is initially loaded, uses it to create a frame-level index, and downloads specific frames as and when they are requested by the user. This is especially useful for large multiframe files (such as those found in slide microscopy or multi-segment binary or fractional segmentations) as it can significantly reduce the amount of data that needs to be downloaded to access a subset of the frames.
180
147
181
-
In this first example, we use lazy frame retrieval to load only a specific spatial patch from a large whole slide image from the IDC.
148
+
In this first example, we use lazy frame retrieval to load only a specific spatial patch from a large whole slide image from the IDC using GCS.
182
149
183
150
```python
184
151
import numpy as np
185
152
import highdicom as hd
186
153
import matplotlib.pyplot as plt
187
154
from google.cloud import storage
188
155
from pydicom import dcmread
189
-
from pydicom.datadict import keyword_dict
190
156
191
157
from idc_index import IDCClient
192
158
193
159
# Create IDCClient for looking up bucket URLs
194
160
idc_client = IDCClient()
195
161
196
-
#install additional component of idc-index to resolve SM instances to file URLs
162
+
#Install additional component of idc-index to resolve SM instances to file URLs
197
163
idc_client.fetch_index("sm_instance_index")
198
164
199
-
#given SeriesInstanceUID of an SM series, find the instance that corresponds to the
165
+
#Given SeriesInstanceUID of an SM series, find the instance that corresponds to the
200
166
# highest resolution base layer of the image pyramid
201
167
query ="""
202
168
SELECT SOPInstanceUID, TotalPixelMatrixColumns
@@ -207,10 +173,13 @@ LIMIT 1
207
173
"""
208
174
result = idc_client.sql_query(query)
209
175
210
-
# get URL corresponding to the base layer instance in the Google Storage bucket
# Read directly from the blob object using lazy frame retrieval
223
-
with base_layer_blob.open(mode="rb") as reader:
192
+
with base_layer_blob.open(mode="rb", chunk_size=500_000) as reader:
224
193
im = hd.imread(reader, lazy_frame_retrieval=True)
225
194
226
195
# Grab an arbitrary region of tile full pixel matrix
@@ -241,7 +210,69 @@ Running this code should produce an output that looks like this:
241
210
242
211
<divalign="center"><imgsrc="../../.gitbook/assets/slide_screenshot.png"alt="Screenshot of slide region"height="454"width="524"></div>
243
212
244
-
As a further example, we use lazy frame retrieval to load only a specific set of segments from a large multi-organ segmentation of a CT image in the IDC stored in binary format (in binary segmentations, each segment is stored using a separate set of frames).
213
+
The next example repeats this on the same image in AWS S3:
214
+
215
+
```python
216
+
import numpy as np
217
+
import highdicom as hd
218
+
import matplotlib.pyplot as plt
219
+
from pydicom import dcmread
220
+
import s3fs
221
+
222
+
from idc_index import IDCClient
223
+
224
+
# Create IDCClient for looking up bucket URLs
225
+
idc_client = IDCClient()
226
+
227
+
# Install additional component of idc-index to resolve SM instances to file URLs
228
+
idc_client.fetch_index("sm_instance_index")
229
+
230
+
# Given SeriesInstanceUID of an SM series, find the instance that corresponds to the
231
+
# highest resolution base layer of the image pyramid
232
+
query ="""
233
+
SELECT SOPInstanceUID, TotalPixelMatrixColumns
234
+
FROM sm_instance_index
235
+
WHERE SeriesInstanceUID = '1.3.6.1.4.1.5962.99.1.1900325859.924065538.1719887277027.4.0'
236
+
ORDER BY TotalPixelMatrixColumns DESC
237
+
LIMIT 1
238
+
"""
239
+
result = idc_client.sql_query(query)
240
+
241
+
# Get URL corresponding to the base layer instance in the AWS S3 bucket
# Create a storage client and use it to access the IDC's public data bucket
248
+
# Configure a client to avoid the need for AWS credentials
249
+
s3_client = s3fs.S3FileSystem(
250
+
anon=True, # no credentials needed to access pubilc data
251
+
default_block_size=500_000, # ~500kB data pulled in each request
252
+
use_ssl=False# disable encryption for a speed boost
253
+
)
254
+
255
+
# Read directly from the blob object using lazy frame retrieval
256
+
with s3_client.open(base_layer_file_url, 'rb') as reader:
257
+
im = hd.imread(reader, lazy_frame_retrieval=True)
258
+
259
+
# Grab an arbitrary region of tile full pixel matrix
260
+
region = im.get_total_pixel_matrix(
261
+
row_start=15000,
262
+
row_end=15512,
263
+
column_start=17000,
264
+
column_end=17512,
265
+
dtype=np.uint8
266
+
)
267
+
268
+
# Show the region
269
+
plt.imshow(region)
270
+
plt.show()
271
+
```
272
+
273
+
In both cases, we set the `chunk_size`/`default_block_size` to around 500kB, which should be enough to ensure each frame can be retrieved in a single request while minimizing further unnecessary data retrieval.
274
+
275
+
As a further example, we use lazy frame retrieval to load only a specific set of segments from a large multi-organ segmentation of a CT image in the IDC stored in binary format (in binary segmentations, each segment is stored using a separate set of frames) using GCS.
0 commit comments