Skip to content

Commit e8c9b44

Browse files
Enhance Foundry usability, testability, and documentation
This commit introduces several improvements to the Foundry library aimed at making it easier for you to use, more robust to test, and better documented. Key changes include: - **Authentication:** - Clarified Globus authentication parameters (`no_browser`, `no_local_server`) in `Foundry.__init__` docstrings. - Improved error handling during authentication setup by raising a `RuntimeError` with a more user-friendly message. - **Testability:** - Significantly refactored `tests/test_foundry.py` to use `unittest.mock` for external service interactions (Globus, Forge, MDF Connect). - Introduced `tests/conftest.py` for shared mocking fixtures, enabling offline and more reliable test execution. - **Usage Patterns & API:** - `Foundry.list()` now consistently returns a `List[FoundryDataset]`. - Introduced `DatasetNotFoundError` for more specific error handling in `Foundry.search()`. - Refined Pydantic `ValidationError` handling in `FoundryDataset` and `foundry/models.py` to provide clearer error feedback and avoid unnecessary re-validation. - Updated `FoundryDataset.clean_dc_dict()` for Pydantic V2 compatibility and replaced a print statement with logging. - **Documentation:** - Updated `README.md` and `docs/` to reflect all code changes. - Provided detailed explanations of the Globus authentication flow, `no_browser`/`no_local_server` options, and `use_globus=False` for data transfers. - Added a comprehensive "Common Foundry Errors" section to `docs/support/troubleshooting.md`. - Overhauled `docs/publishing/publishing-datasets.md` with current API examples. - Added guidance for contributors on testing with mocks in `docs/how-to-contribute/contributing.md`. - Updated examples in `README.md` and `docs/examples.md` to use current API patterns.
1 parent 0c028b8 commit e8c9b44

File tree

12 files changed

+1223
-369
lines changed

12 files changed

+1223
-369
lines changed

README.md

Lines changed: 82 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,29 +30,97 @@ DLHub documentation for model publication and running information can be found [
3030
Install Foundry-ML via command line with:
3131
`pip install foundry_ml`
3232

33-
You can use the following code to import and instantiate Foundry-ML, then load a dataset.
33+
You can use the following code to import and instantiate Foundry-ML, then find and load a dataset.
3434

3535
```python
3636
from foundry import Foundry
37-
f = Foundry(index="mdf")
37+
import pandas as pd # Optional: for handling search results as a DataFrame
38+
39+
# Initialize Foundry.
40+
# For remote environments (e.g., Google Colab, Binder), use:
41+
# f = Foundry(no_browser=True, no_local_server=True)
42+
# By default, Foundry uses the "mdf" index and Globus for data transfers.
43+
# To disable Globus transfers (e.g., if Globus Connect Personal is not set up), use:
44+
# f = Foundry(use_globus=False)
45+
f = Foundry()
46+
47+
# Search for a dataset by DOI or query string
48+
# This example uses a DOI. Searching by query (e.g., "elwood") also works.
49+
dataset_doi = "10.18126/e73h-3w6n" # Example: Elwood Monomer Properties
50+
results_df = f.search(dataset_doi)
51+
52+
if results_df.empty:
53+
print(f"Dataset with DOI {dataset_doi} not found.")
54+
else:
55+
# Access the FoundryDataset object from the search results DataFrame
56+
# The DataFrame might contain multiple results if searching by query string.
57+
dataset = results_df.iloc[0].FoundryDataset
58+
59+
# Display dataset metadata (if in a Jupyter environment, it renders as HTML)
60+
print(f"Dataset Name: {dataset.dataset_name}")
61+
# In Jupyter, just 'dataset' on a line would render its HTML representation:
62+
# dataset
63+
64+
# Load the actual data from the dataset
65+
# This might download files if not already cached.
66+
# 'load()' is an alias for 'get_as_dict()'.
67+
# Specify splits if the dataset has them (e.g., "train", "test").
68+
# The structure of 'data_splits' depends on the dataset's specific schema.
69+
try:
70+
data_splits = dataset.load() # Loads all available splits if `split` param is None
71+
72+
# Example: Accessing data from a 'train' split (structure is dataset-dependent)
73+
# This part of the example assumes a specific dataset structure for demonstration.
74+
# You'll need to inspect 'data_splits.keys()' and the dataset's metadata
75+
# to understand how to access its specific contents.
76+
if "train" in data_splits:
77+
train_data = data_splits["train"]
78+
# Suppose train_data is a dict with 'input' and 'target' keys,
79+
# and 'input' itself is a dict containing 'imgs', 'metadata', etc.
80+
if isinstance(train_data, dict) and "input" in train_data and "target" in train_data:
81+
# The following lines are highly specific to the original example's dataset structure
82+
# and may not apply to the dataset "10.18126/e73h-3w6n".
83+
# Adapt based on the actual structure of the loaded dataset.
84+
# imgs = train_data['input'].get('imgs', {})
85+
# desc = train_data['input'].get('metadata', {})
86+
# coords = train_data['target'].get('coords', {})
87+
print("Train data loaded successfully. Explore its structure.")
88+
else:
89+
print(f"Train data structure: {type(train_data)}")
90+
else:
91+
print(f"No 'train' split found. Available splits: {list(data_splits.keys())}")
92+
93+
except Exception as e:
94+
print(f"Error loading data: {e}")
3895

39-
40-
f = f.load("10.18126/e73h-3w6n", globus=True)
4196
```
42-
*NOTE*: If you run locally and don't want to install the [Globus Connect Personal endpoint](https://www.globus.org/globus-connect-personal), just set the `globus=False`.
43-
44-
If running this code in a notebook, a table of metadata for the dataset will appear:
97+
*NOTE*: Foundry uses Globus for authentication and (by default) for efficient data transfers. If you run locally and don't want to install [Globus Connect Personal](https://www.globus.org/globus-connect-personal), you can initialize Foundry with `f = Foundry(use_globus=False)`. This will use HTTPS for downloads, which may be slower for large datasets. For cloud environments, initializing with `f = Foundry(no_browser=True, no_local_server=True)` enables a compatible authentication flow.
4598

99+
If running this code in a notebook and a `FoundryDataset` object is the last item in a cell, its metadata will be displayed as an HTML table:
46100
<img width="903" alt="metadata" src="https://user-images.githubusercontent.com/16869564/197038472-0b6ae559-4a6b-4b20-88e5-679bb6eb4f5c.png">
101+
*(Image shows an example of metadata display)*
47102

48-
We can use the data with `f.load_data()` and specifying splits such as `train` for different segments of the dataset, then use matplotlib to visualize it.
49-
103+
Visualizing data (example assumes a specific image dataset structure):
50104
```python
51-
res = f.load_data()
52-
53-
imgs = res['train']['input']['imgs']
54-
desc = res['train']['input']['metadata']
55-
coords = res['train']['target']['coords']
105+
# This visualization example is illustrative and depends heavily on the dataset's structure.
106+
# After loading data with `data_splits = dataset.load()`,
107+
# you would adapt the following based on the actual content of `data_splits`.
108+
109+
# For instance, if you loaded the original example's atomic position dataset:
110+
# imgs = data_splits['train']['input']['imgs']
111+
# coords = data_splits['train']['target']['coords']
112+
113+
# key_list = list(imgs.keys())[offset:n_images+offset] # Choose some images
114+
115+
# import matplotlib.pyplot as plt # Ensure plt is imported
116+
# fig, axs = plt.subplots(1, n_images, figsize=(20,20))
117+
# for i in range(n_images):
118+
# axs[i].imshow(imgs[key_list[i]])
119+
# axs[i].scatter(coords[key_list[i]][:,0], coords[key_list[i]][:,1], s=20, c='r', alpha=0.5)
120+
# plt.show() # Display the plot
121+
```
122+
<img width="595" alt="Screen Shot 2022-10-20 at 2 22 43 PM" src="https://user-images.githubusercontent.com/16869564/197039252-6d9c78ba-dc09-4037-aac2-d6f7e8b46851.png">
123+
*(Image shows an example of data visualization)*
56124

57125
n_images = 3
58126
offset = 150

docs/README.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,29 @@ pip install foundry-ml
1616

1717
### Globus
1818

19-
Foundry uses the Globus platform for authentication, search, and to optimize some data transfer operations. Follow the steps below to get set up.
20-
21-
* [Create a free account.](https://app.globus.org) You can create a free account here with your institutional credentials or with free IDs \(GlobusID, Google, ORCID, etc\).
22-
* [Set up a Globus Connect Personal endpoint ](https://www.globus.org/globus-connect-personal)_**\(optional\)**_. While this step is optional, some Foundry capabilities will work more efficiently when using GCP.
19+
Foundry uses the Globus platform for authentication, search, and (optionally) to optimize data transfer operations. Here’s how it works and how to get set up:
20+
21+
1. **Globus Account:**
22+
* You'll need a Globus account. You can [create a free account](https://app.globus.org) using your institutional credentials, GlobusID, Google ID, ORCID iD, etc.
23+
24+
2. **Authentication Process:**
25+
* When you first initialize `Foundry()` (e.g., `f = Foundry()`), the library attempts to authenticate with Globus.
26+
* **Interactive Environments (Default):** By default, Foundry will try to open a web browser, taking you to a Globus authentication page. After you successfully log in and grant consent, Globus will redirect you back to a local web server that Foundry starts temporarily on your machine (usually on `localhost` at a specific port) to capture an authentication code. Once the code is captured, the local server shuts down, and authentication is complete.
27+
* **Tokens:** Successful authentication results in securing tokens from Globus, which are then used for accessing Foundry services (like search and data transfer). These tokens are typically long-lived but can expire. Foundry will attempt to refresh them automatically when needed. If a refresh fails (e.g., after a very long period of inactivity or if consents are revoked), you might need to re-authenticate.
28+
29+
3. **Headless or Remote Environments:**
30+
* If you are using Foundry on a remote server, a Jupyter Hub, Google Colab, or any environment where a browser cannot be automatically opened or a local redirect server cannot be reliably used, you need to modify the authentication flow:
31+
* `f = Foundry(no_browser=True, no_local_server=True)`
32+
* **`no_browser=True`**: Prevents Foundry from trying to automatically open a web browser. Instead, it will print a Globus authentication URL to your console. You must copy this URL and paste it into a browser on your local machine (or any machine where you can access a browser).
33+
* **`no_local_server=True`**: After you authenticate in the browser via the URL provided, Globus will redirect you to a page displaying an authentication code. You must copy this code from your browser and paste it back into your terminal/notebook where Foundry is waiting for it.
34+
* This two-parameter setup enables authentication in almost any environment.
35+
36+
4. **Globus Connect Personal (GCP) _(Optional)_:**
37+
* For the most efficient data transfers, especially for large datasets, Foundry can use Globus transfers. To enable your local machine (like a laptop or workstation) or a server to be a source or destination for Globus transfers, you can install [Globus Connect Personal](https://www.globus.org/globus-connect-personal).
38+
* If you don't have GCP set up or prefer not to use Globus for transfers, you can initialize Foundry to use HTTPS for downloads:
39+
* `f = Foundry(use_globus=False)`
40+
* HTTPS downloads are universally compatible but may be slower than Globus transfers for large datasets or datasets with many files.
41+
* Note: Even if `use_globus=False` is set for data transfers, Globus is still used for the initial authentication and for searching datasets.
2342

2443
## Project Support
2544

docs/concepts/overview.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,9 @@ TODO:
77

88
![](../.gitbook/assets/foundry-overview.png)
99

10+
{% hint style="info" %}
11+
**Note:** The code snippet in the image above uses a simplified, older version of the API. For current usage, please refer to the examples in our Quickstart guide in the main [Foundry documentation](https://ai-materials-and-chemistry.gitbook.io/foundry/getting-started-1/examples). Key differences include using `f.search()` to find datasets and `dataset_object.get_as_dict()` (or its alias `dataset_object.load()`) to load data.
12+
{% endhint %}
13+
1014

1115

docs/examples.md

Lines changed: 45 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -29,24 +29,54 @@ f.list()
2929

3030
### Loading Datasets
3131

32-
The Foundry client can be used to access datasets using a `source_id`, e.g. here `"_test_foundry_fashion_mnist_v1.1"`_._ You can retrieve the `source_id` from the [`list()` method](examples.md#listing-datasets).
32+
The Foundry client can be used to access datasets using their unique identifier (often a DOI or a specific `source_id`). You can find these identifiers by using the `f.list()` or `f.search("your query")` methods.
3333

34+
Let's load the metadata for a dataset. This example uses the `source_id` for the Fashion MNIST test dataset.
3435
```python
3536
from foundry import Foundry
36-
f = Foundry()
37-
f = f.load("_test_foundry_fashion_mnist_v1.1")
37+
f = Foundry() # Assumes default interactive authentication
38+
39+
# Search for the dataset
40+
# Replace with a known DOI or source_id for a dataset you want to access
41+
dataset_identifier = "_test_foundry_fashion_mnist_v1.1"
42+
results_df = f.search(dataset_identifier)
43+
44+
if results_df.empty:
45+
print(f"Dataset '{dataset_identifier}' not found.")
46+
dataset = None
47+
else:
48+
# Get the FoundryDataset object
49+
dataset = results_df.iloc[0].FoundryDataset
50+
print(f"Found dataset: {dataset.dataset_name}")
51+
# In a Jupyter notebook, simply typing 'dataset' on its own line would display its metadata.
3852
```
3953

40-
This will remotely load the metadata \(e.g., data location, data keys, etc.\) and download the data to local storage if it is not already cached. Data can be downloaded via HTTPS without additional setup or more optimally with a Globus endpoint [set up](https://www.globus.org/globus-connect-personal) on your machine.
41-
42-
Once the data are accessible locally, access the data with the `load_data()` method. Load data allows you to load data from a specific split that is defined for the dataset, here we use `train`.
54+
This will remotely load the dataset's metadata (e.g., data location, data keys, etc.). The actual data files are downloaded by the `FoundryCache` when you request the data if they are not already cached locally. By default, data is downloaded via Globus Transfer if `use_globus=True` (the default) and you have Globus Connect Personal set up. Otherwise, or if `use_globus=False`, it will use HTTPS.
4355

56+
Once you have the `dataset` object, you can load its data:
4457
```python
45-
res = f.load_data()
46-
X,y = res['train']
58+
if dataset:
59+
try:
60+
# Load the data. dataset.load() is an alias for dataset.get_as_dict()
61+
# This loads all splits by default. You can specify splits, e.g., dataset.load(split="train")
62+
data_splits = dataset.load()
63+
64+
# The structure of data_splits depends on the dataset.
65+
# For Fashion MNIST, it might look like:
66+
# {'train': {'input': <data>, 'output': <data>}, 'test': {'input': <data>, 'output': <data>}}
67+
68+
if "train" in data_splits:
69+
X_train = data_splits['train']['input']
70+
y_train = data_splits['train']['output']
71+
print(f"Successfully loaded train data. X_train type: {type(X_train)}, y_train type: {type(y_train)}")
72+
else:
73+
print(f"No 'train' split found. Available splits: {list(data_splits.keys())}")
74+
75+
except Exception as e:
76+
print(f"Error loading data for {dataset.dataset_name}: {e}")
4777
```
4878

49-
The data are then usable within the `X` and `y` variables. This full example can be found in [`/examples/fashion-mnist/`](https://github.com/MLMI2-CSSI/foundry/tree/master/examples/fashion-mnist).
79+
The data are then usable within the variables like `X_train` and `y_train`. This full example can be found in [`/examples/fashion-mnist/`](https://github.com/MLMI2-CSSI/foundry/tree/master/examples/fashion-mnist) (Note: the example notebook there might also need updates to align with current API).
5080

5181
## Using Foundry on Cloud Computing Resources
5282

@@ -56,14 +86,17 @@ Foundry works with common cloud computing providers \(e.g., the NSF sponsored Je
5686
f = Foundry(no_browser=True, no_local_server=True)
5787
```
5888

59-
When downloading data, add the following argument to download via HTTPS.
89+
When downloading data (which happens when `dataset.load()` or similar methods are called), Foundry uses Globus by default. To use HTTPS instead (e.g., if Globus Connect Personal is not available or desired for transfers):
6090

6191
{% hint style="info" %}
6292
This method may be slow for large datasets and datasets with many files
6393
{% endhint %}
6494

6595
```python
66-
f.load(globus=False)
67-
X, y = f.load_data()
96+
# Initialize Foundry to use HTTPS for all data transfers
97+
f = Foundry(use_globus=False, no_browser=True, no_local_server=True)
98+
# Then proceed with f.search(...) and dataset.load() as above.
99+
# The dataset object created from this Foundry instance will inherit the use_globus=False setting.
68100
```
101+
Alternatively, if you have an existing `Foundry` instance `f` that was initialized with `use_globus=True`, creating a new one with `use_globus=False` is the way to switch to HTTPS for subsequent operations with datasets derived from that new instance. The `use_globus` preference is tied to the `FoundryCache` object, which is set up when `Foundry` is initialized.
69102

docs/how-to-contribute/contributing.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,23 @@ If you want to contribute, start working through the Foundry codebase, navigate
3535

3636
guide.
3737

38+
### Testing with Mocks for External Services
39+
Much of Foundry's functionality involves interacting with external services like Globus (for authentication, search, and transfer) and the Materials Data Facility (MDF Connect for publishing). To ensure our tests are reliable, fast, and can run in offline environments (like CI), we extensively use mocking.
40+
41+
* **Core Idea:** Instead of making real network calls in tests, we replace parts of our code (or the external libraries Foundry uses) with "mock" objects. These mocks simulate the behavior of the real services.
42+
* **Tools:** We primarily use Python's built-in `unittest.mock` library, often through the `pytest-mock` plugin which provides convenient fixtures (e.g., the `mocker` fixture).
43+
* **Where to Find Mocks:**
44+
* Shared, reusable mocks for core components (like a mocked `Foundry` client instance or mock authorizers) are often defined as fixtures in `tests/conftest.py`. For example, `mock_foundry` provides a `Foundry` instance where authentication and other clients are already mocked.
45+
* For specific tests, you might apply mocks directly using `mocker.patch(...)` or `@patch(...)` decorators.
46+
* **How It Works:**
47+
* When testing a function that, for example, searches for datasets, we would mock the method on the `ForgeClient` that actually performs the search (e.g., `foundry_instance.forge_client.search`).
48+
* We configure this mock to return a predefined, sample response (like a list of dataset metadata dictionaries).
49+
* This allows us to test the logic of our function (how it processes the search results, how it handles errors, etc.) without relying on a live search backend or specific data existing there.
50+
* **Contributing Tests:** If you're adding a feature that interacts with an external service:
51+
* Please include unit tests that use mocks to cover its behavior.
52+
* Check `tests/conftest.py` and existing tests in `tests/` for examples of how to set up and use mocks for the services your feature touches.
53+
* The goal is to make your tests deterministic and independent of external factors.
54+
3855
## Pull Request Process
3956

4057
1. Ensure any install or build dependencies are removed before the end of the layer when doing a

0 commit comments

Comments
 (0)