Fast Tucker Factorization for Large-scale Tensor Completion

Dongha Lee, Jaehyung Lee, Hwanjo Yu*

Department of Computer Science and Engineering
Pohang University of Science and Technology (POSTECH), Pohang, South Korea
{dongha.lee, anthi7, hwanjoyu}@postech.ac.kr

Abstract


Tensor completion is the task of completing multiaspect data represented as a tensor by accurately predicting missing entries in the tensor. It is mainly solved by tensor factorization methods, and among them, Tucker factorization has attracted considerable interests due to its powerful ability to learn latent factors and even their interactions. Although several Tucker methods have been developed to reduce the memory and computational complexity, the state-of-the-art method still 1) generates redundant computations and 2) cannot factorize a large tensor that exceeds the size of memory. This paper proposes FTcom, a fast and scalable Tucker factorization method for tensor completion. FTcom performs element-wise updates for factor matrices based on coordinate descent, and adopts a novel caching algorithm which stores frequently-required intermediate data. It also uses a tensor file for disk-based data processing and loads only a small part of the tensor at a time into the memory. Experimental results show that FTcom is much faster and more scalable compared to all other competitors. It significantly shortens the training time of Tucker factorization, especially on real-world tensors, and it can be executed on a billion-scale tensor which is bigger than the memory capacity within a single machine.

Overview


Figure 1. Intermediate data caching algorithm
FTcom stores already-computed delta vectors in the cache Table T
and utilizes them when they are required.
Figure 2. Disk-based data processing
FTcom updates each factor matrix using GTF
while holding only a small part of the entire tensor on the memory.

In this paper, we propose FTcom, a Fast and scalable Tucker factorization method for tensor completion. First, we derive a coordinate descent-based update rule that is computationally efficient and also able to easily enforce a nonnegativity constraint. Then, we propose an effective caching scheme that stores intermediate vectors frequently-required during the training process, considering a characteristic of real-world sparse tensors. Furthermore, for better scalability, we design a file structure for disk-based tensor processing and present a scheduling algorithm based on our file structure.

Experiments


Datasets

Order Dimensionality Non-zeros Download
CamVid 3 360 × 480 × 30 467,374 [Download] [Raw data]
MovieLens 4 138K × 27K × 21 × 24 19,800,237 [Link]
YahooMusic
4
1M × 625K × 133 × 24
227,520,248
[Link]
HugeWiki 2 50M × 40K 3,411,259,583

Competitors

Model Description
FTcomALS, FTcomCD, FTcomNN Our proposed methods. They basically adopt the proposed caching algorithm and disk-based tensor processing, with 1) ALS-based, 2) CD-based, and 3) non-negativity (NN) constrained update rules. [Down]
P-Tucker,
P-Tucker-Cache
The state-of-the-art Tucker factorization methods for tensor completion. P-Tucker-Cache accelerates the process by caching the values of all sub-components used for computing delta vectors. [Link]
Tucker-wOpt A tensor factorization method mainly composed of tensor operations. It uses a binary mask to consider only observed entries, and computes nonlinear conjugate gradient to minimize its loss function. [Link]

Results

1. The effect of the caching algorithm

(a) Dataset: CamVid
(b) Dataset: MovieLens

FTcom considerably shortens the execution time compared to other methods. FTcom is faster than P-Tucker up to 7.25x in case of CamVid, and up to 1.88x in case of MovieLens.

(a) Dataset: CamVid
(b) Dataset: MovieLens
(c) Dataset: Yahoo-Music
(d) Dataset: HugeWiki

The convergence of FTcom is significantly faster than P-Tucker for all datasets. In cases of Tucker-wOpt and P-Tucker-Cache, they can handle only CamVid among the four datasets, because they require huge memory space when factorizing the other datasets.

2. The effect of the disk-based tensor processing

Dataset: Synthetic

FTcom is much more scalable than other competitors. Tucker-WOPT and P-Tucker-Cache cannot perform Tucker factorization due to out-of-memory even on the smallest synthetic tensor, and P-Tucker also halts when the number of observed entries grows larger than 500 millions. On the contrary, FTcom successfully factorizes billion-scale tensors which exceed the size of our memory, and shows linear scaling with respect to the number of observed entries.

3. The effect of the coordinate descent approach

(a) Dataset: CamVid
(b) Dataset: MovieLens
(c) Dataset: Yahoo-Music
(d) Dataset: HugeWiki

There is a trade-off between computational cost and convergence rate according to the solver type. Specifically, a single iteration of FTcomCD not only takes less time but also less converges compared to FTcomALS, and vice versa. As a result, FTcomALS reaches to lower test RMSE values than FTcomCD does in general, but in cases of MovieLens and HugeWiki, FTcomCD achieves test RMSE values comparable to FTcomALS. Moreover, FTcomNN factorizes an input tensor with the non-negativity constraint without compromising the accuracy compared to FTcomCD