{"id":13547979,"url":"https://github.com/NVIDIA/nvvl","last_synced_at":"2025-04-02T20:31:22.308Z","repository":{"id":66147225,"uuid":"127057781","full_name":"NVIDIA/nvvl","owner":"NVIDIA","description":"A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training","archived":false,"fork":false,"pushed_at":"2019-04-29T16:40:10.000Z","size":2575,"stargazers_count":683,"open_issues_count":33,"forks_count":86,"subscribers_count":28,"default_branch":"master","last_synced_at":"2025-03-29T07:09:10.747Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NVIDIA.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2018-03-27T23:33:53.000Z","updated_at":"2025-03-09T07:04:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"ac1dbe4d-c3ae-4f99-b76a-ce9e5d5f887f","html_url":"https://github.com/NVIDIA/nvvl","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvvl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvvl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvvl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NVIDIA%2Fnvvl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NVIDIA","download_url":"https://codeload.github.com/NVIDIA/nvvl/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246887991,"owners_count":20850181,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T12:01:04.082Z","updated_at":"2025-04-02T20:31:17.300Z","avatar_url":"https://github.com/NVIDIA.png","language":"C++","readme":"# NVVL is part of DALI!\n[DALI (Nvidia Data Loading Library)](https://developer.nvidia.com/dali) incorporates NVVL functionality and offers much more than that, so it is recommended to switch to it.\nDALI source code is also open source and available on the [GitHub](https://github.com/NVIDIA/DALI).\nUp to date documentation can be found [here](https://docs.nvidia.com/deeplearning/sdk/dali-developer-guide/docs/index.html).\nNVVL project will still be available on the GitHub but it won't be maintained. All issues and request for the future please [submit in the DALI repository](https://github.com/NVIDIA/DALI/issues).\n\n# NVVL\nNVVL (**NV**IDIA **V**ideo **L**oader) is a library to load random\nsequences of video frames from compressed video files to facilitate\nmachine learning training. It uses FFmpeg's libraries to parse and\nread the compressed packets from video files and the video decoding\nhardware available on NVIDIA GPUs to off-load and accelerate the\ndecoding of those packets, providing a ready-for-training tensor in\nGPU device memory. NVVL can additionally perform data augmentation\nwhile loading the frames. Frames can be scaled, cropped, and flipped\nhorizontally using the GPUs dedicated texture mapping units. Output\ncan be in RGB or YCbCr color space, normalized to [0, 1] or [0, 255],\nand in `float`, `half`, or `uint8` tensors.\n\n**Note that, while we hope you find NVVL useful, it is example code\nfrom a research project performed by a small group of NVIDIA researchers.\nWe will do our best to answer questions and fix small bugs as they come\nup, but it is not a supported NVIDIA product and is for the most part\nprovided as-is.**\n\nUsing compressed video files instead of individual frame image files\nsignificantly reduces the demands on the storage and I/O systems\nduring training. Storing video datasets as video files consumes an\norder of magnitude less disk space, allowing for larger datasets to\nboth fit in system RAM as well as local SSDs for fast access. During\nloading fewer bytes must be read from disk. Fitting on smaller, faster\nstorage and reading fewer bytes at load time allievates the bottleneck\nof retrieving data from disks, which will only get worse as GPUs get\nfaster. For the dataset used in our example project, H.264 compressed\n`.mp4` files were nearly 40x smaller than storing frames as `.png`\nfiles.\n\nUsing the hardware decoder on NVIDIA GPUs to decode images\nsignificantly reduces the demands on the host CPU. This means fewer\nCPU cores need to be dedicated to data loading during training. This\nis especially important in servers with a large number of GPUs per\nCPU, such as the in the NVIDIA DGX-2 server, but also provides\nbenefits for other platforms. When training our example project on a\nNVIDIA DGX-1, the CPU load when using NVVL was 50-60% of the load seen\nwhen using a normal dataloader for `.png` files.\n\nMeasurements that quantify the performance advantages of using NVVL\nare detailed in our [super resolution example\nproject](/examples/pytorch_superres).\n\nMost users will want to use the deep learning framework wrappers\nprovided rather than using the library directly. Currently a wrapper\nfor PyTorch is provided (PR's for other frameworks are welcome). See\nthe [PyTorch wrapper README](/pytorch/README.md) for documentation on\nusing the PyTorch wrapper. Note that it is not required to build or\ninstall the C++ library before building the PyTorch wrapper (its\nsetup scripts will do so for you).\n\n# Building and Installing\n\nNVVL depends on the following:\n- CUDA Toolkit. We have tested versions 8.0 and later but earlier\n  versions may work. NVVL will perform better with CUDA 9.0 or\n  later\u003csup id=\"a1\"\u003e[1](#f1)\u003c/sup\u003e.\n- FFmpeg's libavformat, libavcodec, libavfilter, and libavutil. These\n  can be installed from source as in the [example\n  Dockerfiles](/docker) or from the Ubuntu 16.04 packages\n  `libavcodec-dev libavfilter-dev libavformat-dev\n  libavutil-dev`. Other distributions should have similar packages.\n\nAdditionally, building from source requires CMake version 3.8 or above\nand some examples optionally make use of some libraries from OpenCV if\nthey are installed.\n\nThe [docker](docker) directory contains Dockerfiles that can be used\nas a starting point for creating an image to build or use the NVVL\nlibrary. The [example's docker directory](examples/pytorch/docker) has\nan example Dockerfile that actually builds and installs the NVVL\nlibrary.\n\nCMake 3.8 and above provides builtin CUDA language support that NVVL's\nbuild system uses. Since CMake 3.8 is relatively new and not yet in\nwidely used Linux distribution, it may be required to install a new\nversion of CMake.  The easiest way to do so is to make use of their\npackage on PyPI:\n\n```\npip install cmake\n```\n\nAlternatively, or if `pip` isn't available, you can install to\n`/usr/local` from a binary distribution:\n\n```sh\nwget https://cmake.org/files/v3.10/cmake-3.10.2-Linux-x86_64.sh\n/bin/sh cmake-3.10.2-Linux-x86_64.sh --prefix=/usr/local\n```\n\nSee https://cmake.org/download/ for more options.\n\nBuilding and installing NVVL follows the typical CMake pattern:\n\n```sh\nmkdir build \u0026\u0026 cd build\ncmake ..\nmake -j\nsudo make install\n```\n\nThis will install `libnvvl.so` and development headers into\nappropriate subdirectores under `/usr/local`. CMake can be passed the\nfollowing options using `cmake .. -DOPTION=Value`:\n\n- `CUDA_ARCH` - Name of a CUDA architecture to generate device code\n  for, seperated via a semicolon. Valid options are `Kepler`,\n  `Maxwell`, `Pascal`, and `Volta`. You can also use specific\n  architecture names such as `sm_61`. Default is\n  `Maxwell;Pascal;Volta`.\n\n- `CMAKE_CUDA_FLAGS` - A string of arguments to pass to `nvcc`. In\n  particular, you can decide to link against the static or shared\n  runtime library using `-cudart shared` or `-cudart static`. You can\n  also use this for finer control of code generation than `CUDA_ARCH`,\n  see the `nvcc` documentation. Default is `-cudart shared`.\n\n- `WITH_OPENCV` - Set this to 1 to build the examples with the\n  optional OpenCV functionality.\n\n- `CMAKE_INSTALL_PREFIX` - Install directory. Default is\n  `/usr/local`.\n\n- `CMAKE_BUILD_TYPE` - `Debug` or `Release` build.\n\nSee the [CMake documentation](https://cmake.org/cmake/help/v3.8/) for\nmore options.\n\nThe examples in `doc/examples` can be built using the `examples` target:\n```\nmake examples\n```\n\nFinally, if Doxygen is installed, API documentation can be built using\nthe `doc` target:\n```\nmake doc\n```\nThis will build html files in `doc/html`.\n\n# Preparing Data\n\nNVVL supports the H.264 and HEVC (H.265) video codecs in any container\nformat that FFmpeg is able to parse.  Video codecs only store certain\nframes, called keyframes or intra-frames, as a complete image in the\ndata stream. All other frames require data from other frames, either\nbefore or after it in time, to be decoded. In order to decode a\nsequence of frames, it is necessary to start decoding at the keyframe\nbefore the sequence, and continue past the sequence to the next\nkeyframe after it. This isn't a problem when streaming sequentially\nthrough a video; however, when decoding small sequences of frames\nrandomly throughout the video, a large gap between keyframes results in\nreading and decoding a large amount of frames that are never used.\n\nThus, to get good performance when randomly reading short sequences\nfrom a video file, it is necessary to encode the file with frequent\nkey frames. We've found setting the keyframe interval to the length of\nthe sequences you will be reading provides a good compromise between\nfilesize and loading performance. Also, NVVL's seeking logic doesn't\nsupport open GOPs in HEVC streams. To set the keyframe interval to `X`\nwhen using `ffmpeg`:\n\n- For `libx264` use `-g X`\n- For `libx265` use `-x265-params \"keyint=X:no-open-gop=1\"`\n\nThe pixel format of the video must also be yuv420p to be supported by\nthe hardware decoder. This is done by passing `-pix_fmt yuv420p` to\n`ffmpeg`. You should also remove any extra audio or video streams from\nthe video file by passing `-map v:0` to ffmpeg after the input but\nbefore the output.\n\nFor example to transcode to H.264:\n```\nffmpeg -i original.mp4 -map v:0 -c:v libx264 -crf 18 -pix_fmt yuv420p -g 5 -profile:v high prepared.mp4\n```\n\n# Basic Usage\nThis section describes the usage of the base C/C++ library, for usage\nof the PyTorch wrapper, see the [README](/pytorch/README.md) in the\npytorch directory.\n\nThe library provides both a C++ and C interface. See the examples in\n[doc/examples](doc/examples) for brief example code on how to use the\nlibrary. [extract_frames.cpp](doc/examples/extract_frames.cpp)\ndemonstrates the C++ interface and\n[extract_frames_c.c](doc/examples/extract_frames_c.c) the C\ninterface. The API documentation built with `make doc` is the\ncanonical reference for the API.\n\nThe basic flow is to create a `VideoLoader` object, tell it which\nframe sequences to read, and then give it buffers in device memory to\nput the decoded sequences into. In C++, creating a video loader is\nstraight forward:\n\n```C++\nauto loader = NVVL::VideoLoader{device_id};\n```\n\nYou can then tell it which sequences to read via `read_sequence`:\n\n```C++\nloader.read_sequence(filename, frame_num, sequence_length);\n\n```\n\nTo receive the frames from the decoder, it is necessary to create a\n`PictureSequence` to tell it how and where you want the decoded frames\nprovided. First, create a `PictureSequence`, providing a count of the\nnumber of frames to receive from the decoder. Note that the count here\ndoes not need to be the same as the sequence_length provided to\n`read_sequence`; you can read a large sequence of frames and receive\nthem as multiple tensors, or read multiple smaller sequences and\nreceive them concatenated as a single tensor.\n\n```C++\nauto seq = PictureSequence{sequence_count};\n```\n\nYou now create \"Layers\" in the sequence to provide the destination for\nthe frames. Each layer can be a different type, have different\nprocessing, and contain different frames from the received\nsequence. First, create a `PictureSequence::Layer` of the desired\ntype:\n\n```C++\nauto pixels = PictureSequence::Layer\u003cfloat\u003e{};\n```\n\nNext, fill in the pointer to the data and other details. See the\ndocumentation in [PictureSequence.h](include/PictureSequence.h) for a\ndescription of all the available options.\n\n```C++\nfloat* data = nullptr;\nsize_t pitch = 0;\ncudaMallocPitch(\u0026data, \u0026pitch,\n                crop_width * sizeof(float),\n                crop_height * sequence_count * 3);\npixels.data = data;\npixels.desc.count = sequence_count;\npixels.desc.channels = 3;\npixels.desc.width = crop_width;\npixels.desc.height = crop_height;\npixels.desc.scale_width = scale_width;\npixels.desc.scale_height = scale_height;\npixels.desc.horiz_flip = false;\npixels.desc.normalized = true;\npixels.desc.color_space = ColorSpace_RGB;\npixels.desc.stride.x = 1;\npixels.desc.stride.y = pitch / sizeof(float);\npixels.desc.stride.c = pixels.desc.stride.y * crop_height;\npixels.desc.stride.n = pixels.desc.stride.c * 3;\n```\n\nNote that here we have set the strides such that the dimensions are\n\"nchw\", we could have done \"nhwc\" or any other dimension order by\nsetting the strides appropriately. Also note that the strides in the\nlayer description are number of elements, not number of bytes.\n\nWe now add this layer to our `PictureSequence`, and send it to the loader:\n\n```C++\nseq.set_layer(\"pixels\", pixels);\nloader.receive_frames(seq);\n```\n\nThis call to `receive_frames` will be\nasynchronous. `receive_frames_sync` can be used if synchronous reading\nis desired. When we are ready to use the frames we can insert a wait\nevent into the CUDA stream we are using for our computation:\n\n```C++\nseq.wait(stream);\n```\n\nThis will insert a wait event into the stream `stream`, causing any\nfurther kernels launched on `stream` to wait until the data is\nready.\n\nThe C interface follows a very similar pattern, see\n[doc/examples/extract_frames_c.c](doc/examples/extract_frames_c.c)\nfor an example.\n\n# Reference\nIf you find this library useful in your work, please cite it in your\npublications using the following BibTeX entry:\n\n```\n@misc{nvvl,\n  author = {Jared Casper and Jon Barker and Bryan Catanzaro},\n  title = {NVVL: NVIDIA Video Loader},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/NVIDIA/nvvl}}\n}\n```\n\n# Footnotes\n\n\u003cb id=\"f1\"\u003e[1]\u003c/b\u003e Specifically, with nvidia kernel modules version\n384 and later, which come with CUDA 9.0+, CUDA kernels launched by\nNVVL will run asynchronously on a separate stream. With earlier kernel\nmodules, all CUDA kernels are launched on the default stream. [↩](#a1)\n","funding_links":[],"categories":["C++","AI"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fnvvl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FNVIDIA%2Fnvvl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FNVIDIA%2Fnvvl/lists"}