DMCC

Disk-based Matrix Completion for Memory Limited Devices

Dongha Lee¹, Jinoh Oh², Christos Faloustsos³, Byungju Kim¹, Hwanjo Yu^*1

¹Pohang University of Science and Technology (POSTECH), Pohang, South Korea
²Adobe Systems Inc., San Jose, CA, USA
³School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
{dongha.lee, lightningz, hwanjoyu}@postech.ac.kr¹, joh@adobe.com², christos@cs.cmu.edu³

Abstract

More and more data need to be processed or analyzed within mobile devices for efficiency or privacy reasons, but performing machine learning tasks with large data within the devices is challenging because of their limited memory resources. For this reason, disk-based machine learning methods have been actively researched, which utilize storage resources without holding all the data in memory. This paper proposes D-MC², a novel disk-based matrix completion method that (1) supports incremental data update (i.e., data insertion and deletion) and (2) spills both data and model to disk when necessary; these functionalities are not supported by existing methods. First, D-MC² builds a two-layered index to efficiently support incremental data update; there exists a trade-off relationship between model learning and data update costs, and our two-layered index simultaneously optimizes the two costs. Second, we develop a window-based SGD scheduler to efficiently support the dual spilling; a huge disk I/O is incurred when the size of model is larger than that of memory, and our new scheduler substantially reduces it. Our evaluation results show that D-MC² is significantly more scalable and faster than other disk-based competitors under the limited memory environment. In terms of the co-optimization, D-MC² outperforms the baselines that only optimize one of the two costs up to 48x. Furthermore, the window-based scheduler improves the training speed 10.7x faster compared to a naive scheduler.

Overview

This paper proposes a new disk-based matrix completion method for memory limited devices, D-MC², which efficiently supports 1) incremental data update (i.e., insertion or deletion) and 2) dual spilling - spilling both data and model to disk when necessary. We first introduce a two-layered index to efficiently support incremental data update. Our two-layered index co-optimizes both model learning and data update costs by separating the unit for model learning from the unit for data update. Second, we devise a window-based SGD scheduling algorithm for matrix completion to efficiently support the dual spilling. Our proposed scheduler restricts the region of blocks to be selected using a sliding window scheme, which improves the temporal locality of model access and substantially reduces the amount of disk I/O.

Experiments

Datasets

	Netflix	Yahoo! Music	HugeWiki
# rows	480,189	1,000,990	50,082,603
# cols	17,770	624,961	39,780
# training	99,072,112	252,800,275	3,411,259,583
# test	1,408,395	4,003,960	34,458,060

Competitors

Model	Description
GraphChi	The disk-based graph processing engine. SGD for matrix completion is implemented as an application. [Download]
FPSGD	The disk-based version of FPSGD method. [Download]
MLGF-MF	The state-of-the-art disk-based matrix completion method. [Download]
D-MC^naive	The baseline which uses a naive scheduler instead of the window-based scheduler
D-MC^M and D-MC^D	The baseline which supports dual spilling by conventional MLGF rather than the two-layered index
D-MC²	Our proposed method [Download]

Results

1. D-MC² efficiently manages the data file for incremental data update

	Netflix	Yahoo! Music	HugeWiki
FPSGD	118.65	330.11	5408.73
MLGF-MF	40.35	111.98	2854.47
D-MC²	1.02	2.61	35.25

The elapsed time to insert 100K records

We first measure the elapsed time to update the data file for 100K new inserted records on the current dataset. As the table shows, D-MC² takes significantly less time to update the data file compared to FPSGD and MLGF-MF. We do not report the result of GraphChi, because it supports the processing of dynamic graphs and shows the similar result with D-MC². Second, we grow the dataset steadily and retrain the model after inserting every 100k records. That is, we measure the total cumulative time to update the data file and train the model for every 100K record insertion. D-MC² takes the least time in total among the disk-based competitors. Each bar in the result is composed of two parts; the upper part represents the time for data update, and the lower part represents the time for model learning.

2. D-MC² decomposes a matrix even when the size of model grows beyond that of memory

We measure the elapsed time per iteration on a large-scale dataset, HugeWiki. HugeWiki has more than 3.4 billion observed entries, and the size of HugeWiki in the CSV text format is about 87.19GB. We use various sizes of the rank k, from 20 to 80, which correspond to the sizes of model from 4.02GB to 16.08GB. Thus, a large amount of memory is needed to hold both data and model, and the memory requirement linearly increases with k.

D-MC² successfully decomposes a matrix even if the size of model exceeds the size of memory. Specifically, when the model is over 6.5GB, it starts using the disk to spill the model. On the contrary, GraphChi and MLGF-MF halt due to out-of-memory when the model grows larger than 7.5GB. In case of FPSGD, it cannot perform matrix completion at all under our experimental setting although it utilizes the disk for better scalability.

3. Window-based scheduler optimizes the model learning cost

(c) Learning time, Dataset: Yahoo! Music.

The window-based scheduler significantly reduces the model learning cost. The elapsed time for model learning of D-MC^naive rapidly increases as the memory budget becomes smaller, whereas that of D-MC² is hardly affected by the memory budget. In addition, the figure in Exp. 2 shows that D-MC² takes almost the same time as it does not spill the model. Specifically, D-MC² is up to 10.7x faster than D-MC^naive in HugeWiki, and this difference becomes greater as the size of model grows larger.

4. Two-layered index simultaneously optimizes both model learning and data update costs

The effect of two-layered index on the model learning cost

We evaluate the efficiency of D-MC² in terms of model learning. D-MC² shows almost the best performance. D-MC² shows as low learning cost as D-MC^M with a page size of 1024KB even though D-MC² uses 16KB physical page in Yahoo! Music datatset ((c) and (d)), and shows even better results than D-MC^M in Netflix dataset ((a) and (b)). On the contrary, D-MC^D with a page size of 16KB shows far higher learning cost than D-MC².

This performance enhancement is mainly due to the reduction of cache misses. D-MC² shows almost the same I/O cost as D-MC^M and D-MC^D, and their I/O costs increases as the memory budget decreases ((b) and (d)). However, the elapsed time for model learning is not affected by the memory budget ((a) and (c)). This results strongly indicate that the performance difference among D-MC^M, D-MC^D and D-MC² is not caused by the disk I/O cost, but by the cache misses.

The effect of two-layered index on the data update cost

(c) Elapsed time, Dataset: Yahoo! Music.

We evaluate the efficiency of D-MC² in terms of data update. In overall, D-MC² shows the best performance. D-MC² is as efficient as D-MC^D with a page size of 16KB ((a) and (c)). Two methods show almost identical elapsed time to insert one million records. However, D-MC^M with a page size of 1024KB is much more inefficient than D-MC² and D-MC^D. It is mainly because D-MC^M incurs more I/O cost ((b) and (d)).

The effect of two-layered index on co-optimization

We plot the performance of D-MC² in terms of both model learning and data updates. For the efficiency of model learning, we use the result of 50% memory budget. D-MC² simultaneously optimizes both model learning and data update costs, whereas D-MC^M and D-MC^D only optimizes one of the two costs. For example, D-MC^M is very efficient in terms of model learning, but, at the same time, is terribly inefficient in terms of data update. On the contrary, D-MC^D is very inefficient in terms of model learning, but is efficient in terms of data update. However, D-MC² optimizes the model learning cost as much as the best performance of the others while it also shows the best performance in terms of data update.

Download

We upload an executable file and a toy dataset. [Download]

Mode selection
- --update-index : Build (or incrementally update) a two-layered index and a db file from a plain dataset.
- --train : Train a model (i.e., factor matrices W and H), and output the factor matrices to a modelfile.
- --predict : Predict the score of ratings in the test dataset, and report the validation RMSE.
General parameters
- --thread-size: The number of concurrent threads
- --dbufsize: The size of "data" memory buffer (MB)
- --mbufsize: The size of "model" memory buffer (MB)
Learning parameters
- --validate-iteratively: Enable reporting validation RMSE for each iteration
- --iteration-size: The number of iterations
- --row-lambda: The regularizer for the factor matrix W
- --column-lambda: The regularizer for the factor matrix H
- --stepsize: The stepsize η
- --latent-dim-size: The size of latent dimensions (k value in the paper)

Example

First, build a two-layered index from the training dataset as follows.

./dmcc --update-index ./toy_train ./toy_db

You can incrementally manage the db file to deal with data insertion.

./dmcc --update-index ./toy_insert ./toy_db

Then, train a model using the db file you made

./dmcc --train --validate ./toy_test --validate-iteratively --iteration-size 10 --latent-dim-size 20 ./toy_db ./toy_model