Skip to content

Major corruption while moving LXC storage to MooseFS. Errors not cancelling 'move storage' tasks. #50

@JoaGamo

Description

@JoaGamo

Hello, I'm facing major corruption, luckily I had backups for most things. The actual scenario is moving every single mountpoint from all LXCs to Moosefs. At this exact point when this problem happened I was rebooting the moosefs master safely (systemctl restart moosefs-master). I'm on MooseFS Community Edition.

I mounted moosefs with Use Block Device feature enabled

First problem that I'm facing, the plugin isn't cancelling operations on failures, there's an example where I did a move + delete source and this happened

started block device: (/dev/mfs/mfsmaster.main_9421__images_178_vm-178-disk-0->/dev/nbd2 : MFS://images/178/vm-178-disk-0 : 120.000GiB)
Creating filesystem with 31457280 4k blocks and 7864320 inodes
Filesystem UUID: cebb78ff-31d1-4478-899c-213db950b394
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872
/dev/rbd1

Number of files: 123,836 (reg: 110,154, dir: 13,680, link: 2)
Number of created files: 123,834 (reg: 110,154, dir: 13,678, link: 2)
Number of deleted files: 0
Number of regular files transferred: 110,154
Total file size: 36,692,106,197 bytes
Total transferred file size: 36,692,106,080 bytes
Literal data: 36,692,106,080 bytes
Matched data: 0 bytes
File list size: 3,997,413
File list generation time: 0.052 seconds
File list transfer time: 0.000 seconds
Total bytes sent: 36,709,001,584
Total bytes received: 2,206,455

sent 36,709,001,584 bytes  received 2,206,455 bytes  17,296,211.09 bytes/sec
total size is 36,692,106,197  speedup is 1.00
can't unmap MooseFS device '/images/178/vm-178-disk-0': error receiving data from '/dev/mfs/nbdsock': Connection timed out
volume deactivation failed: mfsBlock:178/vm-178-disk-0
Removing image: 1% complete...
Removing image: 2% complete...
Removing image: 3% complete...
Removing image: 4% complete...
Removing image: 5% complete...
Removing image: 6% complete...
Removing image: 7% complete...
// The source image got completely deleted, im truncating the log.

Why is this important?
Not only because it completely nuked the source image (!), but after moving a secondary mountpoint from this same LXC with ID 178, 15 minutes later, it overwrote the 'vm-178-disk-0' image. To put it in better words:

Both move operations failed with the same log shown above. It could be due to a moosefs master restart in the middle or simply that the server lost connection.

Before migration, my rootfs: was vm-178-disk-0, my mp0: was vm-178-disk-1. First I started moving the rootfs from my source storage to MooseFS. After some time, this error appears Connection timed out, but the source image is pruned anyway. As I did not see any error, I started moving mp0:, which overwrote the previous 'vm-178-disk-0' that was in /mnt/pve/Block/images/178/. I will never know if the rootfs image did copy successfully and moosefs overwrote it

Metadata

Metadata

Assignees

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions