Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,14 @@ This repo was created for Sourcegraph Implementation Engineering deployments, an
- Other version control systems are left up to the customer to convert to Git
- This project builds a framework to convert repos from other VCSes to Git

## Versioning

- This project uses semantic versioning, where x.y.z is major.minor.patch
- The minor version is incremented if a breaking change has been made, usually to either:
- Configuration files
- Environment variables
- Storage locations on the host

## Deployment

For Sourcegraph Cloud customers, they'll need to run the repo-converter, src serve-git, and the Sourcegraph Cloud Private Access Agent on a container platform with connectivity to both their Sourcegraph Cloud instance, and their code hosts. This can be done quite securely, as the src serve-git API endpoint does not need any ports exposed outside of the container network Running src serve-git and the agent together on the same container network allows the agent to use the container platform's local DNS service to reach src serve-git, and prevents src serve-git's unauthenticated HTTP endpoint from needing to be opened outside of the container network.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
services:

repo-converter:
# image: ghcr.io/sourcegraph/repo-converter:v0.5.1
image: ghcr.io/sourcegraph/repo-converter:${DOCKER_TAG}
environment:
- LOG_LEVEL=DEBUG
- REPO_CONVERTER_INTERVAL_SECONDS=600
Expand Down
81 changes: 68 additions & 13 deletions deploy/docker-compose/customer1/pull-start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,50 @@

## Get script args

# If an f is passed into the script args, then try to fix the ownership and permissions of files in the src-serve-git directory
if [[ "$1" == *"f"* ]]
then
fix_perms="true"
else
fix_perms="false"
fi
# If an -f is passed into the script args, then try to fix the ownership and permissions of files in the src-serve-git directory
fix_perms="false"

# If a -dt or --docker-tag is passed in, then use it in the Docker Compose up command for the repo-converter
# DOCKER_TAG="latest"
DOCKER_TAG="stable"

# Create the arg to allow disabling git reset and pull
NO_GIT=""

POSITIONAL_ARGS=()

while [[ $# -gt 0 ]]; do
case $1 in
-f|--fix-perms)
fix_perms="true"
shift # past argument
;;
-l|--latest)
DOCKER_TAG="latest"
shift # past argument
;;
-s|--stable)
DOCKER_TAG="stable"
shift # past argument
;;
-n|--no-git)
NO_GIT="true"
shift # past argument
;;
-dt|--docker-tag)
DOCKER_TAG="$2"
shift # past argument
shift # past value
;;
-*|--*)
echo "Unknown option $1"
exit 1
;;
esac
done

set -- "${POSITIONAL_ARGS[@]}" # restore positional parameters


## Setup
# Define file paths
Expand Down Expand Up @@ -111,16 +148,34 @@ fi
log "On branch before git pull:"
$git_branch_cmd

## Formulate Git and Docker commands
git_commands="\
$git_cmd reset --hard &&\
$git_cmd pull --force &&\
"

if [[ -n "$NO_GIT" ]]
then
git_commands=""
fi


export DOCKER_TAG=$DOCKER_TAG
docker_commands="\
$docker_cmd pull &&\
DOCKER_TAG=$DOCKER_TAG CURRENT_UID_GID=$CURRENT_UID_GID $docker_cmd up -d --remove-orphans
"

command="\
$git_commands \
$docker_commands \
"


log "Docker compose file: $docker_compose_full_file_path"
log "docker ps before:"
$docker_cmd ps

command="\
$git_cmd reset --hard &&\
$git_cmd pull --force &&\
$docker_cmd pull &&\
CURRENT_UID_GID=$CURRENT_UID_GID $docker_cmd up -d --remove-orphans \
"

log "Running command in a sub shell:"
# awk command to print the command nicely with newlines
Expand Down
53 changes: 35 additions & 18 deletions dev/TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,31 +181,46 @@
- Add a timeout in run_subprocess() for hanging svn info ~~and svn log~~ commands, if data isn't transferring
- Does the svn cli not have a timeout built in for this command?

- PID layers, from `docker exec -it repo-converter top`
- This output was captured 14 hours into converting a repo that's up to 2 GB on disk so far, with 6 years of history left to catch up on
- This is after removing our batch processing bubble-wrap, and just lettin'er buck
- Process tree
- Copied from the output of `docker exec -it repo-converter top`
```
PID PPID nTH S CODE USED SWAP RES %MEM nMaj nMin nDRT OOMa OOMs %CPU TIME+ COMMAND
1 0 2 S 2.7m 37.4m 1.5m 35.9m 0.5 991 8.7m 0 0 668 0.0 2:44.22 /usr/bin/python3 /sg/repo-converter/src/main.py
85 1 1 S 2.7m 40.8m 11.6m 29.2m 0.4 0 20k 0 0 669 0.0 0:05.82 `- /usr/bin/python3 /sg/repo-converter/src/main.py
330 85 1 S 2.7m 1.4m 0.2m 1.2m 0.0 0 364 0 0 666 0.0 0:00.00 `- git -C /sg/src-serve-root/org/repo svn fetch --quiet --username user --log-window-size 100
331 330 1 S 1.6m 115.6m 17.8m 97.8m 1.2 56 534m 0 0 674 13.6 66:17.92 `- /usr/bin/perl /usr/lib/git-core/git-svn fetch --quiet --username user --log-window-size 100
376 331 1 S 2.7m 1.1g 0.1m 1.1g 14.6 18k 1.4m 0 0 744 0.3 1:18.22 `- git cat-file --batch
34015 331 1 S 2.7m 10.0m 0.0m 10.0m 0.1 17 889k 0 0 667 2.0 4:36.38 `- git hash-object -w --stdin-paths --no-filters
1850259 331 1 S 2.7m 5.1m 0.0m 5.1m 0.1 0 499 0 0 666 0.0 0:00.00 `- git update-index -z --index-info
top - 03:39:24 up 6 days, 4:37, 0 users, load average: 0.89, 1.60, 1.83
Tasks: 14 total, 1 running, 13 sleeping, 0 stopped, 0 zombie
%Cpu(s): 8.7 us, 10.7 sy, 0.0 ni, 80.2 id, 0.2 wa, 0.0 hi, 0.3 si, 0.0 st
GiB Mem : 7.8 total, 2.0 free, 1.2 used, 4.5 buff/cache
GiB Swap: 2.0 total, 1.9 free, 0.1 used. 6.3 avail Mem

SID PGRP PID PPID VIRT RES SHR %MEM OOMs %CPU TIME+ COMMAND
1 1 1 0 0.1g 0.0g 0.0g 0.4 668 0.0 0:02.41 /usr/bin/python3 /sg/repo-converter/src/main.py
1 1 7 1 0.5g 0.0g 0.0g 0.2 668 0.0 0:00.45 `- /usr/bin/python3 /sg/repo-converter/src/main.py
81 81 81 1 0.1g 0.0g 0.0g 0.3 668 0.0 0:00.40 `- /usr/bin/python3 /sg/repo-converter/src/main.py
81 81 527 81 0.0g 0.0g 0.0g 0.0 666 0.0 0:00.00 `- git -C /sg/src-serve-root/repo1 svn fetch --quiet --username user --log-window-size 100
81 81 529 527 0.0g 0.0g 0.0g 0.5 668 0.3 2:52.51 `- /usr/bin/perl /usr/lib/git-core/git-svn fetch --quiet --username user --log-window-size 100
81 81 880 529 2.1g 0.2g 0.1g 2.6 680 0.0 0:02.23 `- git cat-file --batch
81 81 6267 529 0.0g 0.0g 0.0g 0.1 667 0.0 0:09.38 `- git hash-object -w --stdin-paths --no-filters
81 81 305238 529 0.0g 0.0g 0.0g 0.1 666 0.0 0:00.00 `- git update-index -z --index-info
144 144 144 1 0.1g 0.0g 0.0g 0.4 668 0.0 0:00.92 `- /usr/bin/python3 /sg/repo-converter/src/main.py
144 144 478 144 0.0g 0.0g 0.0g 0.0 666 0.0 0:00.00 `- git -C /sg/src-serve-root/repo2 -c http.sslVerify=false svn fetch --quiet --username user --log-window-size 100
144 144 479 478 0.2g 0.1g 0.0g 1.8 676 12.3 7:21.52 `- /usr/bin/perl /usr/lib/git-core/git-svn fetch --quiet --username user --log-window-size 100
144 144 709 479 0.0g 0.0g 0.0g 0.4 668 0.0 0:02.79 `- git cat-file --batch
144 144 1015 479 0.0g 0.0g 0.0g 0.1 666 0.0 0:08.89 `- git hash-object -w --stdin-paths --no-filters
```
- PID 1
- Docker container entrypoint
- PID 85
- Spawned by `multiprocessing.Process().start()` in `convert_repos.start()`
- PID 330
- Spawned by `psutil.Popen()` in `cmd.run_subprocess()`
- PID 7
- Probably the `status_monitor.start` function
- PIDs 81 and 144
- Notice that the SID (Session ID) and PGRP (Process Group) match the PID 81 and 144 numbers, i.e. this process is its session and group leader, as a result of the `os.setsid()` call in `fork_conversion_processes.py`, this makes it much easier to find PGRP values in the container's logs to track which processes are getting cleaned up as they finish
- Spawned by `multiprocessing.Process().start()` in `fork_conversion_processes.start()`
- PIDs 527 and 478
- `git svn fetch` command, called from `_git_svn_fetch()` in `svn.convert()`
- PID 331
- Spawned by `psutil.Popen()` in `cmd.run_subprocess()`
- PIDs 529 and 479
- `git-svn` perl script, which runs the `git svn fetch` workload in [sub fetch, in SVN.pm](https://github.com/git/git/blob/v2.50.1/perl/Git/SVN.pm#L2052)
- This script is quite naive, no retries, always exits 0, even on failures
- PID 376
- This perl script is quite naive, no retries, always exits 0, even on failures
- PIDs 880 and 709
- Long-running `git cat-file` process, which stores converted content in memory
- This process usually has a higher than average OOMs (OOMkill score)
- It seems quite likely that this process doesn't free up memory after each commit, so memory requirements for this process alone would be some large portion of a repo's size
- The minimum memory requirements for this process would be the contents of the largest commit in the repo's history, otherwise the conversion would never progress beyond this commit
- This process' CPU state is usually Sleeping, because it spends almost all of its time receiving content from the subversion server
Expand Down Expand Up @@ -386,6 +401,8 @@

## Old Doc

- Need to clean this up, and put it somewhere

```yaml
xmlbeans:
# Usage: This key is used as the converted Git repo's name
Expand Down
23 changes: 8 additions & 15 deletions src/config/load_repos.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@
# Import repo-converter modules
from utils import secret
from utils.context import Context
from utils.log import log
from utils.logging import log

# Import Python standard modules
from sys import exit
from urllib.parse import urlparse

# Import third party modules
Expand All @@ -29,20 +28,14 @@ def load_from_file(ctx: Context) -> None:
# This should return a dict
repos = yaml.safe_load(repos_to_convert_file)

except IsADirectoryError:
except IsADirectoryError as e:
log(ctx, f"File not found at {repos_to_convert_file_path}, but found a directory, likely created by the Docker mount. Please stop the container, delete the directory, and create the yaml file.", "critical", exception=e)

log(ctx, f"File not found at {repos_to_convert_file_path}, but found a directory, likely created by the Docker mount. Please stop the container, delete the directory, and create the yaml file.", "critical")
exit(1)
except FileNotFoundError as e:
log(ctx, f"File not found at {repos_to_convert_file_path}", "critical", exception=e)

except FileNotFoundError:

log(ctx, f"File not found at {repos_to_convert_file_path}", "critical")
exit(2)

except (AttributeError, yaml.scanner.ScannerError) as exception: # type: ignore

log(ctx, f"YAML syntax error in {repos_to_convert_file_path}, please lint it. Exception: {type(exception)}, {exception.args}, {exception}", "critical")
exit(3)
except (AttributeError, yaml.scanner.ScannerError) as e: # type: ignore
log(ctx, f"YAML syntax error in {repos_to_convert_file_path}, please lint it", "critical", exception=e)

repos = check_types(ctx, repos)
repos = reformat_repos_dict(ctx, repos)
Expand Down Expand Up @@ -439,7 +432,7 @@ def validate_inputs(ctx: Context, repos_input: dict) -> dict:
break

except Exception as e:
log(ctx, f"urlparse failed to parse URL {url}: {e}", "warning")
log(ctx, f"urlparse failed to parse URL {url}", "warning", exception=e)

# Fallback to code-host-name if provided
if not server_name:
Expand Down
7 changes: 3 additions & 4 deletions src/config/validate_env.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,7 @@

# Import repo-converter modules
from utils.context import Context
from utils.log import log

from utils.logging import log

def validate_env_vars(ctx: Context) -> None:
"""Validate inputs here, now that the logger is instantiated, instead of throughout the code"""
Expand All @@ -15,10 +14,10 @@ def validate_env_vars(ctx: Context) -> None:

# Validate concurrency limits
if ctx.env_vars["MAX_CONCURRENT_CONVERSIONS_PER_SERVER"] <= 0:
raise ValueError("MAX_CONCURRENT_CONVERSIONS_PER_SERVER must be greater than 0")
log(ctx, "MAX_CONCURRENT_CONVERSIONS_PER_SERVER must be greater than 0", "critical")

if ctx.env_vars["MAX_CONCURRENT_CONVERSIONS_GLOBAL"] <= 0:
raise ValueError("MAX_CONCURRENT_CONVERSIONS_GLOBAL must be greater than 0")
log(ctx, "MAX_CONCURRENT_CONVERSIONS_GLOBAL must be greater than 0", "critical")

if ctx.env_vars["MAX_CONCURRENT_CONVERSIONS_PER_SERVER"] > ctx.env_vars["MAX_CONCURRENT_CONVERSIONS_GLOBAL"]:

Expand Down
4 changes: 2 additions & 2 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from config import load_env, load_repos, validate_env
from utils import concurrency_manager, fork_conversion_processes, git, logger, signal_handler, status_monitor
from utils.context import Context
from utils.log import log
from utils.logging import log

# Import Python standard modules
# import sysconfig
Expand All @@ -24,7 +24,7 @@ def main():
)

# Configure logging
logger.configure_logger(ctx.env_vars["LOG_LEVEL"])
logger.configure(ctx.env_vars["LOG_LEVEL"])

# Validate env vars, now that we have logging available
validate_env.validate_env_vars(ctx)
Expand Down
Loading