The Tutte Institute for Mathematics and Computing, as part of its data science research programs, publishes multiple Python libraries for exploratory data analysis and complex network science. While we compose these tools together all the time, their joint setup is documented nowhere. This repository provides this documentation, as well as a set of tools for addressing some deployment edge cases.
This README file exposes various methods for installing, in particular, the Institute's unstructured data exploration and mapping tools, including HDBSCAN and UMAP. These tools are collectively understood as the TIMC vector toolkit.
Shortcuts to the various installation and deployment procedures:
- Installing from PyPI (or a mirror)
- Using a Docker image
- Deploying on modern UNIX hosts in an air-gapped network
The main release channel for Tutte Institute libraries is the Python Package Index.
The simplest and best-supported approach to deploy these tools is thus simply to use pip install
(or the tools that supersede it, such as uv or Poetry).
pip install timc-vector-toolkitThis package includes the Institute libraries without upper-bounding their versions.
As such, newer versions of the package are mainly produced when adding new libraries
to the toolkit.
Please do not mistake the age of package timc-vector-toolkit for abandon.
- Access to the Internet (or to a PyPI mirror for using which
pipis duly configured) - A C/C++ compilation toolchain
Using a Python virtual environment
on a modern UNIX (GNU/Linux, *BSD, MacOS and so on) host and a Bourne-compatible interactive shell (Bash or Zsh).
In this case,
the user has already set up a C/C++ compilation toolchain using their operating system's package manager
(e.g. on Debian/Ubuntu, sudo apt-get install build-essential).
python -m venv timc-tools
. timc-tools/bin/activate
python -m pip install timc-vector-toolkitUsing a Conda environment. Remark that the following example includes a Conda package that brings up a generic C/C++ toolchain.
conda create -n timc-tools python=3.13 pip conda-forge::compilers
conda activate timc-tools
pip install timc-vector-toolkitUsing uv to start Jupyter Lab with a Python kernel that includes Tutte Institute tools.
uv run --with timc-vector-toolkit --with jupyterlab jupyter labuv's excellent package and environment caching avoids managing an environment explicitly on the side of the code development.
This repository includes a Dockerfile
to generate a Docker image published on Docker Hub.
The image, named tutteinstitute/vector-toolkit,
is based on the latest Ubuntu LTS release
and its native Python 3 distribution.
We set up an environment that includes package timc-vector-toolkit.
Tags to this image reflect the time they were produced.
The image is presumed to be used as a base for further application packaging.
For example, one may augment this image to build an image that hosts Jupyter Lab
through an unprivileged user:
FROM tutteinstitute/vector-toolkit:latest
RUN pip install jupyterlab ipywidgets matplotlib ipympl seaborn
RUN adduser --disabled-password --comment "" user
WORKDIR /home/user
USER user
EXPOSE 8888
ENTRYPOINT ["/timc/bin/jupyter", "lab", "--port", "8888", "--ip", "0.0.0.0", "--notebook-dir", "/notebooks"]Running this image into a container,
one minds forwarding a port to 8888,
and mounting a universally writable volume to /notebooks.
- Ability to run Docker
- Either access to Docker Hub on the Internet, or have configuration to access an image repository index that mirrors
tutteinstitute/vector_toolkittags
Run Marimo on a notebook directory mounted to a container.
docker run --rm \
--volume $(pwd)/my_notebooks:/notebooks \
--workdir /notebooks \
--port 2718:2718 \
--user $(id -u) \
tutteinstitute/data-science:latest \
marimo edit --host 0.0.0.0Customize the data-exploration image to run a Streamlit app that accesses a PostgreSQL database.
The Dockerfile:
FROM tutteinstitute/data-exploration:latest
RUN pip install psycopg[binary] streamlit
ADD myapp.py /home/user/myapp.py
ENTRYPOINT ["streamlit", "run", "myapp.py", "--server.address", "0.0.0.0", "--server.port", 5000]Build the new image and run the container:
docker build --tag myapp .
docker run --rm --publish 5000:5000 myappThis repository comprises tools to build a self-contained Bash script that deploys an all-dressed full Python distribution.
This distribution is designed to include timc-vector-toolkit and its dependencies,
but it can be customized to one's specific needs.
It can also include additional non-Python artifacts,
such as model files or web resources.
Both for building the installer and deploying the distribution:
- Either a GLibC-based GNU/Linux system OR a MacOS system
- If you don't know whether your GNU/Linux system is based on GLibC, it likely is. The requirement enables using Conda. GNU/Linux distributions known not to work are those based on musl libc, including Alpine Linux.
- A MacOS host system will build a distribution that can deploy on a MacOS target system; a GNU/Linux host system will build a distribution that can deploy on most GLibC-based GNU/Linux systems. Perform target tests ahead of committing much work into building the perfect installer.
- Common UNIX utilities (such as included in GNU Coreutils)
For the installer build step, these extra requirements should also be met:
- Cookiecutter
- GNU Make
- A C/C++ compilation toolchain
- Full Internet access is expected
The installation building and deployment tools have only been tested on Ubuntu Linux, MacOS and WSL2/Ubuntu systems running on Intel x86-64 hardware. Other GLibC-based Linux systems are supported on Intel x86-64 hardware; alternative hardware platforms ARM64 (aarch64), IBM S390 and PowerPC 64 LE are likely to work, but are not supported. Idem for 32-bits hardware platforms x86 and ARM7. Finally, it sounds possible to make the tools work on non-WSL Windows, but it has not been tested by the author and it is not supported. *BSD platforms are also excluded from support, as no Conda binary is being distributed for them.
This repository is organized to host multiple distribution installer projects. Each such project is composed in its own subdirectory. An example distribution project is provided for examining and experimentation.
One's own project is initiated by running
cookiecutter templateThe first question sets the name of the installer to produce through this project,
which will be appended with .sh.
For instance, using default value my-installer would,
as output to building the installer,
For instance, using default value my-installer would,
yield a file named out/my-installer.sh.
The second question sets the minor Python version that would get distributed through the installer.
Choose a version that can run all the package dependencies you want deployed with your distribution.
Once all questions are answered, the project named like your answer to question 1 is created. It contains the following files:
python_version
Specifies the minor Python version to base the distribution on. Change it here rather than have to edit it throughout other files.
bootstrap.yaml
Specifies a Conda environment that will be used to put together the Python installer. This rarely needs to be edited.
construct.yaml
This is a Conda Constructor specification
used to put together the Python installer out of a set of Conda packages.
Under the specs section,
add any further Conda package you would like installed as part of your target distribution.
Other packages will be sourced out of the Python Package Index.
requirements.txt
This is the main file to edit the essential contents of the distribution your want deployed on your target systems. Per documentation, you may specify version bounds for each package you include.
extras.mk and tasks
extras.mk is a small bit of Makefile
that specifies rules for gathering extras,
resources that should be bundled into the installers.
These can be model weights to downloads,
datasets,
web resources,
scripts — anything one can put in files.
At deployment time,
these extras are deployed on the target host by running the scripts under the tasks subdirectory in alphanumerical order.
Remark that the installer produced by this project is expected to be run with a user's own level of privileges,
which may or may not be root.
Determine how the installer will be used on the target network when setting up the list of install tasks,
so as to ensure success.
A header comment to extras.mk provides further Makefile variable definitions and details to guide and facilitate the implementation of extras gathering tasks.
To build the installer, we use GNU Make.
Given project my-installer,
the installer script is produced by running
make out/my-installer.shYou may still edit the constituent files of the project afterwards. Running GNU Make again will rebuild the installer incrementally. Remark, once again, that the host system and the target system are expected to match; the closer they are, the better the odds the installer will work. No cross-building of MacOS installers can be done from a GNU/Linux host, or vice-versa.
Bring the installer script over to each host of the air-gapped network where you mean to run it
(a shared network filesystem such as NFS works as well).
It can be run by any unprivileged user or as root.
Using the -h flag shows some terse documentation:
This self-contained script sets up a Python computing environment that includes
common data science and data engineering tools, as well as the libraries developed
by the Tutte Institute for unsupervised learning and data exploration.
Usage:
$0 [-h|--help] [-n name] [-p path] [-q]
Options:
-h, --help
Prints out this help and exits.
-n name
If the system has Conda set up, these tools will be installed as the named
Conda environment. Do not use -p if you use -n.
-p path
Sets the path to the directory where to set up the computing environment,
or where such an environment has been previously set up. Do not use -n if you
use -p.
-q
Skips any interactive confirmation, proceeding with the default answer.
Without any of the -n or -p options, the installer simply deploys the Tutte Institute
tools in the current Conda environment or Python virtual environment (venv).
There are three ways in which the installation can be run.
- The most complete approach deploys the full Python distribution, over which it lays out the packages from
exploration.txt, in a named directory. For this, use option-p PATH-WHERE-TO-INSTALL. - If Conda is also deployed on the host, the path can be chosen so that the distribution is deployed in a named Conda environment. For this, use option
-n DESIRED-ENVIRONMENT-NAME.- Remark that if the host has Conda, it will still see the Python distribution deployed using
-pas an environment, but not an environment with a name unless the path is a child of a directory listed underenvs_dirs(checkconda config --show envs_dirs).
- Remark that if the host has Conda, it will still see the Python distribution deployed using
- If one would like to use a Python distribution that is already deployed on the system, one can create and activate a virtual environment, then run the installer without any of the options
-nor-p. This willpip-install the wheels corresponding to the Tutte Institute tools and their dependencies in the currently active environment.
Finally, installation tasks are, by default, confirmed interactively in the shell.
To bypass this confirmation and just carry on in any case, use option -q.
Depending on the installation type, one either gets a Python virtual environment or a Conda environment. In the former case, one uses it by activating the environment per usual. For the usual Bash/Zsh shell,
source path/to/environment/bin/activateThere are alternative activation scripts for Powershell, Tcsh or Fish.
If instead the installation was complete,
as performed using either the -n or -p flags of the installer,
then the distribution works as a Conda environment.
If Conda is also deployed
(and set up for the user's shell)
on this air-gapped host,
one can use
conda activate name-of-environment # Instealled with -n name-of-environment
conda activate path-to-environment # Installed with -p path/to/environmentShort of using Conda,
activating such an environment merely involves tweaking the value of a set of environment variables.
For a basic Python distribution,
the only variable that strictly needs tweaking is PATH.
Thus, running in one's shell the Bash/Zsh equivalent to
export PATH="$(realpath path/to/environment):$PATH"should suffice to make the distribution's python, pip and other executables pre-eminent,
and thus the environment active.
For the sake of convenience,
the distribution comes with a utility script named startshell.
This tool starts a subshell
(using the user's $SHELL, falling back on /bin/bash)
where this PATH tweaking is done.
startshell takes and relays all its parameters to the shell it starts,
so it can also be used as a launcher for the tools in the environment.
For instance:
path/to/environment/bin/startshell -c 'echo $PATH'yields
/root/to/working/directory/path/to/environment/bin:<rest of what was in PATH>
More typically, one simply runs
path/to/environment/bin/startshellso as to have a shell duly set up for using the installed Python distribution and tools.