This project designed and implemented a "General Multi-model Video Classification Frameworks", including the overall process of training and predicting. The source code based on tensorflow-1.14 platform.
The specific supported functions are as follows:
- Video cate classification tasks(multi-classes)
- Video tag classification tasks(multi-labels)
- Text vector aggregation: textCNN and Bi-LSTM
- Video frames and audio frames aggregation : nextvlad and trn
- Adversarial perturbation(improving the robustness and generalization capabilities of the model)
- Multi-gpus based on single machine
- Eval metrics: Precison, Recall, F1,GAP and mAP
- Multi-task learning: cate & tag classification
- Generate the multi-modal video embedding: which can be used to construct similar video recall and other tasks
The architecture of "General Multi-model Video Classification Frameworks" includes two stages:
Stage1: Multi-modal feature representation
Stage II: Multi-modal feature fusion and classification

├── README.md -->documentation
├── requirements.txt -->environment dependencies
├── scripts
│ ├── infer.sh --> pipeline for predict
│ └── train.sh --> pipeline for train
└── src
├── data.py --> process for data
├── eval_metrics.py --> eval metrics
├── models.py --> the implementation of each model
├── train.py --> the entrance of train of predict
├── utils.py --> multi-gpus
└── video_model.py --> the whole framework of model
pip install -r requirements.txt
cd ~
sh scripts/train.sh
cd ~
sh scripts/infer.sh
Tips: For any other inquiries, kindly contact me through this email : [email protected]