csvlink hangs after a few seconds with 0.0% CPU
- python version: 3.7.3
- environment: centos
CSV Files to Match
$ wc -l train-*
494 train-left.csv
481 train-right.csv
Config file
Attempting to match on 9 fields.
{
"field_names": [
"state",
"email",
"address_2",
"address_1",
"county",
"postal_code",
"city",
"name"
],
"field_definition": [
{
"field": "state",
"type": "String",
"Has Missing": true
},
{
"field": "email",
"type": "String",
"Has Missing": true
},
{
"field": "address_2",
"type": "String",
"Has Missing": true
},
{
"field": "address_1",
"type": "String",
"Has Missing": true
},
{
"field": "county",
"type": "String",
"Has Missing": true
},
{
"field": "postal_code",
"type": "String",
"Has Missing": true
},
{
"field": "city",
"type": "String",
"Has Missing": true
},
{
"field": "name",
"type": "String",
"Has Missing": true
}
],
"output_file": "deduped.csv",
"skip_training": false,
"training_file": false,
"sample_size": 150000,
"recall_weight": 2
}
Command
Running csvlink with the following:
csvlink train-left.csv train-right.csv --config_file=config.json --inner_join
After an initial large cpu hit, the script settles down into a very relaxed state:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14191 somebody+ 20 0 558660 143092 10144 S 0.0 0.9 0:52.45 csvlink
Am I doing something wrong?
csvlinkhangs after a few seconds with0.0%CPUCSV Files to Match
$ wc -l train-* 494 train-left.csv 481 train-right.csvConfig file
Attempting to match on 9 fields.
{ "field_names": [ "state", "email", "address_2", "address_1", "county", "postal_code", "city", "name" ], "field_definition": [ { "field": "state", "type": "String", "Has Missing": true }, { "field": "email", "type": "String", "Has Missing": true }, { "field": "address_2", "type": "String", "Has Missing": true }, { "field": "address_1", "type": "String", "Has Missing": true }, { "field": "county", "type": "String", "Has Missing": true }, { "field": "postal_code", "type": "String", "Has Missing": true }, { "field": "city", "type": "String", "Has Missing": true }, { "field": "name", "type": "String", "Has Missing": true } ], "output_file": "deduped.csv", "skip_training": false, "training_file": false, "sample_size": 150000, "recall_weight": 2 }Command
Running
csvlinkwith the following:After an initial large cpu hit, the script settles down into a very relaxed state:
Am I doing something wrong?