I’m working on pruning a neural network to improve its efficiency by setting some connections (weights) to zero. However, I’m having trouble figuring out how to efficiently store these pruned weight matrices.

I know PyTorch supports sparse matrices, which track non-zero values and their indexes. But I’m concerned that storing these indexes might reduce the space-saving benefits. For example, if half of the matrix has non-zero values, would storing their indexes offset the space savings?

Am I misunderstanding how pruning should work, especially with around 50% non-zero values in a matrix? How do you typically implement pruning to save storage space effectively? Any advice or suggestions on efficient storage methods would be greatly appreciated.

I’m sorry to say that I’ve also found this issue confusing. My guess is that after sparsifying the model, the weights are permuted to group the zeros together, creating blocks of zeros that can be skipped during processing.

Even a matrix with 50% zeros can be stored more efficiently than a dense matrix. For example, using a Huffman-like compression could save space by assigning shorter codes to zeros and slightly longer codes to non-zero values. This approach saves storage space overall, but the matrix might need to be decompressed for operations like multiplication, which adds complexity. Standard sparse formats, like Compressed Sparse Row (CSR), and additional compression techniques, like Delta encoding, can also be used but come with their own trade-offs.

As @m_guru mentioned, pruning 50% isn’t very substantial. Some models can be pruned up to 99% and still perform well, especially with magnitude-based unstructured pruning. The main aim of pruning is not just compression but possibly better generalization or faster training (as suggested by the lottery ticket hypothesis). Although pruning may reduce model size with sparse matrices, these are hard to accelerate.

For effective model compression, structured pruning is better. This involves removing entire neurons, convolutional channels, or layers. Also, PyTorch doesn’t have a function to automatically apply masks for pruning, so this needs to be done manually.

Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, and Alexandra Peste. 2021. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. JMLR 22, 1 (2021), 10882–11005.

You first need to define if it is a structure or unstructure pruning. If structure pruning, you can use some libraries such as NNI for it to compress the models. For unstructure pruning, I generally store it in SparseCSR format (more efficient to access compared to COO).

Pruning by itself won’t lead to any savings as it’s just a mask over your tensor as it was mentioned below, what you must do is to switch the tensor to use a sparse format.

Which leads to the challenge that sparse tensors are quite size inefficient. For most formats you need sparsity way above 50% for any savings to happen and it gets worse as you start to use less bits per weight.

Don’t forget that sparse matmul is slow AF.

One avenue worth trying is nvidia’s structured sparsity that is hardware accelerated and does reduce memory utilization.