OCAM

OCAM: Out-of-core Coordinate Descent Algorithm for Matrix Completion

Dongha Lee¹, Jinoh Oh², Hwanjo Yu^*1

¹Pohang University of Science and Technology (POSTECH), Pohang, South Korea
²Adobe Systems Inc., San Jose, CA, USA
{dongha.lee, hwanjoyu}@postech.ac.kr¹, joh@adobe.com²

Abstract

Recently, there are increasing reports that a majority of datasets can be actually stored in disks of a single off-the-shelve workstation, and utilizing out-of-core methods is much cheaper and even faster than using a distributed system. For these reasons, out-of-core methods have been actively developed for machine learning and graph processing. The goal of this paper is to develop an efficient out-of-core matrix completion method based on coordinate descent approach. Coordinate descent based matrix completion (CD-MC) has two strong benefits over the other approaches: 1) it does not involve heavy computation such as matrix inversion and 2) it does not have step-size hyper-parameters, which reduces much effort for hyper-parameter tuning. Existing solutions for CD-MC are developed and analyzed for in-memory setting and they do not take disk-I/O into account. Thus, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. Our evaluation results and cost analyses results provide sound evidences supporting the following benefits of OCAM: (1) Scalability - OCAM is a truly scalable out-of-core method with and thus decomposes a matrix larger than the size of memory, (2) Efficiency - OCAM is super fast. OCAM is up to 10x faster than the state-of-the-art out-of-core method, and up to 4.1x faster than a competing distributed method when using eight machines.

Overview

OCAM_gccd — A novel file structure designed for OCAM, named grid-based compressed sparse column (GCSC) (left), and disk access patterns of OCAM based on GCSC (right).

In this paper, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. The key idea of OCAM is to load only a small part of an entire matrix at a time to handle a large-scale matrix. For this, we design an efficient file structure to reduce the space requirements by half while keeping efficient access of matrix. OCAM accesses a disk-resident matrix in GCSC file grid row-by-grid row and grid column-by-grid column when updating factor matrices W and H, respectively.

Experiments

Datasets

	Netflix	Yahoo! Music	HugeWiki
# rows	480,189	1,000,990	50,082,603
# cols	17,770	624,961	39,780
# training	99,072,112	252,800,275	3,411,259,583
# test	1,408,395	4,003,960	34,458,060

Competitors

Model	Description
CCD and CCD++	The state-of-the-art in-memory competitors. [Download]
GraphChi	The state-of-the-art out-of-core graph processing engine. CCD is implemented as an application. [Download]
OOC-CCD++	The straightforward out-of-core extension of CCD++.
MLGF-MF	The out-of-core SGD-MC method. [Download]
NOMAD, CCD++ and DSGD++	The distributed competitors. [Download]
OCAM and OCAM_opt	Our proposed methods [Download]

Results

1. The efficiency of OCAM

OCAM significantly outperforms the competitors. It is up to 10x faster than the state-of-the-art out-of-core method. On the other hand, OOC-CCD++ shows the slowest convergence due to its large amount of disk I/O.

2. The effect of two-phase optimization

Solid line: RevuAhn SSD, Dashed line: S850Pro SSD.

Our proposed optimization scheme improves OCAM to converge significantly faster at an early stage of training. Two-phase optimization boosts the convergence of OCAM by reducing time per iterations during the first phase (i.e., iter < T_boost).

3. The scalability of OCAM

OCAM significantly outperforms both CCD and CCD++ in terms of scalability. OCAM can decompose a very large matrix whereas in-memory CD-MC methods fail due to out-of-memory. Moreover, the elapsed time per iteration increases proportional to the number of observed entries, and it shows the linear scalability of OCAM.

4. Comparison with a SGD-MC method

OCAM_opt consistently shows the best results while the convergence of MLGF-MF largely depends on the step-size. Ill-conditioned MLGF-MF sometimes takes more iterations to converge to the fine-tuned solution (s1), or cannot achieve the same level of RMSE (s2). OCAM_opt also shows slightly better results than that of MLGF-MF with the best step-size value without an extensive and costly search over η parameter space.

5. Comparison with distributed methods

In this experiment, eight machines are used for the distributed methods, and each machine utilizes four cores. y-axis is test RMSE and x-axis is the number of seconds multiplied by the total number of used cores. Figures show that OCAM_opt is more cost-efficient than the other competitors. The distributed methods are not always faster than an out-of-ocre method, because heavy communication costs degrade the efficiency per core.

OCAM_opt also shows linear scaling with respect to the number of machines when using the distributed validation; each machine trains an independent model with a different hyper-parameter value for model validation. It significantly outperforms all competitors when the same number of machines are used, and is up to 4.1x faster than distributed CCD++ which shows the best performance in Yahoo! Music. Moreover, OCAM_opt can decompose the HugeWiki matrix more efficiently with fewer resources compared to the distributed methods.

OCAM running on a single machine is more scalable and faster than Spark-ALS running on four machines even though it uses fewer resources. Specifically, OCAM is up to 9.95x faster than Spark-ALS in Netflix and Yahoo! Music, and Spark-ALS cannot handle HugeWiki due to its memory limitation.

Download

Our code is here [Github]
This code is implemented based on the author code of Yu et al., "Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems", in ICDM, 2012.

The following two executable files are built: (1) build-gcsc creates a GCSC file based on a raw text file, (2) train-ocam performs CCD using the GCSC file. A toy dataset is additionally included for a simple test. In the target directory, type the following command to build a GCSC file. The default grid granularity (g) is set to 16, and you can change it using the argument -g. In addition, the option -d 1 enables to use small amount of memory by building a GCSC file based on a disk.

./build-gcsc ./toy-example 
./build-gcsc -g 8 ./toy-example
./build-gcsc -g 8 -d 1 ./toy-example

Then, run OCAM or OCAM_opt by using the following commands. The default mode is OCAM, and you can execute OCAM_opt using the argument -o 1. Likewise, -n, -k and -t determines the number of threads, latent dimensions and iterations, repectively. A regularizer also can be specified by -l argument. You have to set the grid granularity (g) the same with that of the GCSC file you made.

./train-ocam -n 4 -k 20 -t 10 -l 0.05 ./toy-example 
./train-ocam -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example

./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 ./toy-example 
./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example