OCAM: Out-of-core Coordinate Descent Algorithm for Matrix Completion

Dongha Lee1, Jinoh Oh2, Hwanjo Yu*1

1Pohang University of Science and Technology (POSTECH), Pohang, South Korea
2Adobe Systems Inc., San Jose, CA, USA
{dongha.lee, hwanjoyu}@postech.ac.kr1, joh@adobe.com2

Abstract


Recently, there are increasing reports that a majority of datasets can be actually stored in disks of a single off-the-shelve workstation, and utilizing out-of-core methods is much cheaper and even faster than using a distributed system. For these reasons, out-of-core methods have been actively developed for machine learning and graph processing. The goal of this paper is to develop an efficient out-of-core matrix completion method based on coordinate descent approach. Coordinate descent based matrix completion (CD-MC) has two strong benefits over the other approaches: 1) it does not involve heavy computation such as matrix inversion and 2) it does not have step-size hyper-parameters, which reduces much effort for hyper-parameter tuning. Existing solutions for CD-MC are developed and analyzed for in-memory setting and they do not take disk-I/O into account. Thus, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. Our evaluation results and cost analyses results provide sound evidences supporting the following benefits of OCAM: (1) Scalability - OCAM is a truly scalable out-of-core method with and thus decomposes a matrix larger than the size of memory, (2) Efficiency - OCAM is super fast. OCAM is up to 10x faster than the state-of-the-art out-of-core method, and up to 4.1x faster than a competing distributed method when using eight machines.

Overview


OCAM_gccd
A novel file structure designed for OCAM, named grid-based compressed sparse column (GCSC) (left), and disk access patterns of OCAM based on GCSC (right).

In this paper, we propose OCAM, a novel out-of-core coordinate descent algorithm for matrix completion. The key idea of OCAM is to load only a small part of an entire matrix at a time to handle a large-scale matrix. For this, we design an efficient file structure to reduce the space requirements by half while keeping efficient access of matrix. OCAM accesses a disk-resident matrix in GCSC file grid row-by-grid row and grid column-by-grid column when updating factor matrices W and H, respectively.

Experiments


Datasets

Netflix Yahoo! Music HugeWiki
# rows 480,189 1,000,990 50,082,603
# cols 17,770 624,961 39,780
# training
99,072,112
252,800,275
3,411,259,583
# test 1,408,395 4,003,960 34,458,060

Competitors

Model Description
CCD and CCD++ The state-of-the-art in-memory competitors. [Download]
GraphChi The state-of-the-art out-of-core graph processing engine. CCD is implemented as an application. [Download]
OOC-CCD++
The straightforward out-of-core extension of CCD++.
MLGF-MF
The out-of-core SGD-MC method. [Download]
NOMAD, CCD++ and DSGD++
The distributed competitors. [Download]
OCAM and OCAMopt Our proposed methods [Download]

Results

1. The efficiency of OCAM

(a) Dataset: Netflix
(b) Dataset: Yahoo! Music
(c) Dataset: HugeWiki

OCAM significantly outperforms the competitors. It is up to 10x faster than the state-of-the-art out-of-core method. On the other hand, OOC-CCD++ shows the slowest convergence due to its large amount of disk I/O.

2. The effect of two-phase optimization

(a) Dataset: Netflix
(b) Dataset: Yahoo! Music
(c) Dataset: HugeWiki
Solid line: RevuAhn SSD, Dashed line: S850Pro SSD.

Our proposed optimization scheme improves OCAM to converge significantly faster at an early stage of training. Two-phase optimization boosts the convergence of OCAM by reducing time per iterations during the first phase (i.e., iter < Tboost).

3. The scalability of OCAM

Dataset: synthetic

OCAM significantly outperforms both CCD and CCD++ in terms of scalability. OCAM can decompose a very large matrix whereas in-memory CD-MC methods fail due to out-of-memory. Moreover, the elapsed time per iteration increases proportional to the number of observed entries, and it shows the linear scalability of OCAM.

4. Comparison with a SGD-MC method

(a) Dataset: Netflix
(b) Dataset: Yahoo! Music
(c) Dataset: HugeWiki

OCAMopt consistently shows the best results while the convergence of MLGF-MF largely depends on the step-size. Ill-conditioned MLGF-MF sometimes takes more iterations to converge to the fine-tuned solution (s1), or cannot achieve the same level of RMSE (s2). OCAMopt also shows slightly better results than that of MLGF-MF with the best step-size value without an extensive and costly search over η parameter space.

5. Comparison with distributed methods

(a) Dataset: Netflix
(b) Dataset: Yahoo! Music
(c) Dataset: HugeWiki

In this experiment, eight machines are used for the distributed methods, and each machine utilizes four cores. y-axis is test RMSE and x-axis is the number of seconds multiplied by the total number of used cores. Figures show that OCAMopt is more cost-efficient than the other competitors. The distributed methods are not always faster than an out-of-ocre method, because heavy communication costs degrade the efficiency per core.

(a) Dataset: Netflix
(b) Dataset: Yahoo! Music
(c) Dataset: HugeWiki

OCAMopt also shows linear scaling with respect to the number of machines when using the distributed validation; each machine trains an independent model with a different hyper-parameter value for model validation. It significantly outperforms all competitors when the same number of machines are used, and is up to 4.1x faster than distributed CCD++ which shows the best performance in Yahoo! Music. Moreover, OCAMopt can decompose the HugeWiki matrix more efficiently with fewer resources compared to the distributed methods.

OCAM running on a single machine is more scalable and faster than Spark-ALS running on four machines even though it uses fewer resources. Specifically, OCAM is up to 9.95x faster than Spark-ALS in Netflix and Yahoo! Music, and Spark-ALS cannot handle HugeWiki due to its memory limitation.

Download


The following two executable files are built: (1) build-gcsc creates a GCSC file based on a raw text file, (2) train-ocam performs CCD using the GCSC file. A toy dataset is additionally included for a simple test. In the target directory, type the following command to build a GCSC file. The default grid granularity (g) is set to 16, and you can change it using the argument -g. In addition, the option -d 1 enables to use small amount of memory by building a GCSC file based on a disk.

./build-gcsc ./toy-example 
./build-gcsc -g 8 ./toy-example
./build-gcsc -g 8 -d 1 ./toy-example

Then, run OCAM or OCAMopt by using the following commands. The default mode is OCAM, and you can execute OCAMopt using the argument -o 1. Likewise, -n, -k and -t determines the number of threads, latent dimensions and iterations, repectively. A regularizer also can be specified by -l argument. You have to set the grid granularity (g) the same with that of the GCSC file you made.

./train-ocam -n 4 -k 20 -t 10 -l 0.05 ./toy-example 
./train-ocam -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example
./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 ./toy-example 
./train-ocam -o 1 -n 4 -k 20 -t 10 -l 0.05 -g 8 ./toy-example