Efficient way to store large datasets

I’m collecting trajectories for imitation learning, each with around 1500 time steps and consisting of 4 image streams at 600x600 resolution. The dataset grows rapidly with the number of trajectories.

Are there any good libraries for storing such data efficiently in terms of disk space? I tried h5py with level 9 zip compression, but the files are still too large. Any better alternatives?

Saving/loading speed isn’t a concern, and most resources focus on efficient data loading or memory handling, which doesn’t apply. I’m already using uint8 for RGB streams.

2 Likes

You might want to try a video encoding algorithm. These are made to reduce file size by removing repetitive data within frames. Popular options are h.264 or h.265, which both offer lossy and lossless compression.

Lossy compression is a good first choice since it’s faster and uses much less space, though you lose some data.

If you can’t lose any data, lossless encoding is still better than storing raw data, but it won’t save as much space as lossy compression.

2 Likes

uint8 is essentially uncompressed, and gzip isn’t ideal for image compression. A simple solution would be to read the image, convert it to a lossy or lossless format (like PNG), then save it in h5 as bytes using BytesIO, and later read it back from there.

2 Likes

I agree with the suggestions made by others regarding compression for images and videos, but I’m also wondering why you prioritize storage over load speed. When compared to computation and human labor, storage is typically VERY inexpensive.

2 Likes

If the images have a lot of empty or repetitive areas, you could consider using the COO (Coordinate) format to save space. This format is designed to store sparse data more efficiently.

1 Like

You can use JPEG 2000 with the glymur library, which offers a high compression ratio for images. It also allows you to choose between lossless or lossy compression, depending on your needs.

1 Like

Look at the way the diffusion policy saved the video streams. MP4 files.

Also, look at the LeRobotDataset. They use AV1 and say they have good efficiency gains.