Skip to content

source attribute of file and dir entities with special characters #226

@LauLauThom

Description

@LauLauThom

Dear all,

I am opening this issue to discuss the source attribute of Data entities.
It is related to the escaping of special characters, which has been the topic of some past issues and PR (#217 #225).
The fixes are not yet part of the current release at the time of writing this (0.13.0), so please use the latest version "from source" to test the example below.

So I noticed that the source attribute of File and Dataset entities currently return a Path object.
This Path is valid (i.e resolve to the actual file or dir on disk) when the filepath does not contain special characters (like spaces or accentuated chars).
With special characters however, the returned Path contains the escaped special characters (as in a URL), and so source.exists() returns false.
The workaround to retrieve the actual filepath as a "valid" Path, is to unquote with Path(unquote(str(file.source))), which is quite heavy notation.

I know that the source can also be a URL, in which case it makes sense to keep the % chars.
Yet, I see a potential for hard-to-debug issues, when someone would write the code expecting source to return a valid path, to later find out that the code does not work as soon as the filepath contains special characters.

I was wondering if source could return a valid path when it's pointing to a local entity, and a URL when it's a remote entity.
If source is a property in the class FileOrDir, the getter could be something like

# in rocrate/model/file_or_dir.py
import validators

class FileOrDir(DataEntity):
    
    def get_source(self):

        if validators.url(self.source): # a URL to remote data, then return as such
            return self.source
        
        else:
            return Path(unquote(str(self.source))) # return as a "valid" Path 

Below is a short example demonstrating the current behaviour.

from rocrate.rocrate import ROCrate
import os
from urllib.parse import unquote
from pathlib import Path

new_crate_root = "./test_crate"

# create a directory with a single text file
if not os.path.exists(new_crate_root):
    os.mkdir(new_crate_root)

# create a dummy file, illustrating "usual" behaviour
with open(os.path.join(new_crate_root, "file.txt"), "w") as f:
    f.write("empty")

# create a file with spaces in the name
with open(os.path.join(new_crate_root, "file with space.txt"), "w") as f:
    f.write("empty")

# Initialize a crate from the directory
crate = ROCrate(new_crate_root, init=True)
crate.write(new_crate_root) 

# then reload it, this time parsing it
crate = ROCrate(new_crate_root)
list_files = crate.get_by_type("File")

for file_entity in list_files : 
    print(f"{file_entity.source=}") 
    print("source exists :", file_entity.source.exists())

    source_unquoted = Path(unquote(str(file_entity.source)))
    print("source (escaped) exists :", source_unquoted.exists() )
    print("")

outcome :

file_entity.source=PosixPath('test_crate/file.txt')
source exists : True
source (escaped) exists : True

file_entity.source=PosixPath('test_crate/file%20with%20space.txt')
source exists : False
source (escaped) exists : True

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions