Recently, there are increasing reports that a majority of datasets can be actually stored in disks of a single off-the-shelve workstation, and utilizing out-of-core methods is much cheaper and even faster than using a distributed system. For these reasons, out-of-core methods have been actively developed for machine learning and graph processing. The goal of this paper is to develop an efficient out-of-core matrix completion method based on coordinate descent approach. Coordinate descent based matrix completion (CD-MC) has two strong benefits over the other approaches: 1) it does not involve heavy computation such as matrix inversion and 2) it does not have step-size hyper-parameters, which reduces much effort for hyper-parameter tuning. Existing solutions for CD-MC are developed and analyzed for in-memory setting and they do not take disk-I/O into account. Thus, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. Our evaluation results and cost analyses results provide sound evidences supporting the following benefits of OCAM: (1) Scalability - OCAM is a truly scalable out-of-core method with and thus decomposes a matrix larger than the size of memory, (2) Efficiency - OCAM is super fast. OCAM is up to 10x faster than the state-of-the-art out-of-core method, and up to 4.1x faster than a competing distributed method when using eight machines.
In this paper, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. The key idea of OCAM is to load only a small part of an entire matrix at a time to handle a large-scale matrix. For this, we design an efficient file structure to reduce the space requirements by half while keeping efficient access of matrix. OCAM accesses a disk-resident matrix in GCSC file grid row-by-grid row and grid column-by-grid column when updating factor matrices W and H, respectively.
Netflix | Yahoo! Music | HugeWiki | |
---|---|---|---|
# rows | 480,189 | 1,000,990 | 50,082,603 |
# cols | 17,770 | 624,961 | 39,780 |
# training |
99,072,112 |
252,800,275 |
3,411,259,583 |
# test | 1,408,395 | 4,003,960 | 34,458,060 |
Model | Description |
---|---|
CCD and CCD++ | The state-of-the-art in-memory competitors. [Download] |
GraphChi | The state-of-the-art out-of-core graph processing engine. CCD is implemented as an application. [Download] |
OOC-CCD++ |
The straightforward out-of-core extension of CCD++. |
MLGF-MF |
The out-of-core SGD-MC method. [Download] |
NOMAD, CCD++ and DSGD++ |
The distributed competitors. [Download] |
OCAM and OCAMopt | Our proposed methods [Download] |
1. The efficiency of OCAM
OCAM significantly outperforms the competitors. It is up to 10x faster than the state-of-the-art out-of-core method. On the other hand, OOC-CCD++ shows the slowest convergence due to its large amount of disk I/O.
2. The effect of two-phase optimization
Our proposed optimization scheme improves OCAM to converge significantly faster at an early stage of training. Two-phase optimization boosts the convergence of OCAM by reducing time per iterations during the first phase (i.e., iter < Tboost).
3. The scalability of OCAM
OCAM significantly outperforms both CCD and CCD++ in terms of scalability. OCAM can decompose a very large matrix whereas in-memory CD-MC methods fail due to out-of-memory. Moreover, the elapsed time per iteration increases proportional to the number of observed entries, and it shows the linear scalability of OCAM.
4. Comparison with a SGD-MC method
OCAMopt consistently shows the best results while the convergence of MLGF-MF largely depends on the step-size. Ill-conditioned MLGF-MF sometimes takes more iterations to converge to the fine-tuned solution (s1), or cannot achieve the same level of RMSE (s2). OCAMopt also shows slightly better results than that of MLGF-MF with the best step-size value without an extensive and costly search over η parameter space.
5. Comparison with distributed methods
In this experiment, eight machines are used for the distributed methods, and each machine utilizes four cores. y-axis is test RMSE and x-axis is the number of seconds multiplied by the total number of used cores. Figures show that OCAMopt is more cost-efficient than the other competitors. The distributed methods are not always faster than an out-of-ocre method, because heavy communication costs degrade the efficiency per core.
OCAMopt also shows linear scaling with respect to the number of machines when using the distributed validation; each machine trains an independent model with a different hyper-parameter value for model validation. It significantly outperforms all competitors when the same number of machines are used, and is up to 4.1x faster than distributed CCD++ which shows the best performance in Yahoo! Music. Moreover, OCAMopt can decompose the HugeWiki matrix more efficiently with fewer resources compared to the distributed methods.
OCAM running on a single machine is more scalable and faster than Spark-ALS running on four machines even though it uses fewer resources. Specifically, OCAM is up to 9.95x faster than Spark-ALS in Netflix and Yahoo! Music, and Spark-ALS cannot handle HugeWiki due to its memory limitation.
The following two executable files are built: (1) build-gcsc creates a GCSC file based on a raw text file, (2) train-ocam performs CCD using the GCSC file. A toy dataset is additionally included for a simple test. In the target directory, type the following command to build a GCSC file. The default grid granularity (g) is set to 16, and you can change it using the argument -g
. In addition, the option -d 1
enables to use small amount of memory by building a GCSC file based on a disk.
./build-gcsc ./toy-example
./build-gcsc -g 8 ./toy-example
./build-gcsc -g 8 -d 1 ./toy-example
Then, run OCAM or OCAMopt by using the following commands. The default mode is OCAM, and you can execute OCAMopt using the argument -o 1
. Likewise, -n
, -k
and -t
determines the number of threads, latent dimensions and iterations, repectively. A regularizer also can be specified by -l
argument. You have to set the grid granularity (g) the same with that of the GCSC file you made.
./train-ocam -n 4 -k 20 -t 10 -l 0.05 ./toy-example
./train-ocam -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example
./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 ./toy-example
./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example