More and more data need to be processed or analyzed within mobile devices for efficiency or privacy reasons, but performing machine learning tasks with large data within the devices is challenging because of their limited memory resources. For this reason, disk-based machine learning methods have been actively researched, which utilize storage resources without holding all the data in memory. This paper proposes D-MC2, a novel disk-based matrix completion method that (1) supports incremental data update (i.e., data insertion and deletion) and (2) spills both data and model to disk when necessary; these functionalities are not supported by existing methods. First, D-MC2 builds a two-layered index to efficiently support incremental data update; there exists a trade-off relationship between model learning and data update costs, and our two-layered index simultaneously optimizes the two costs. Second, we develop a window-based SGD scheduler to efficiently support the dual spilling; a huge disk I/O is incurred when the size of model is larger than that of memory, and our new scheduler substantially reduces it. Our evaluation results show that D-MC2 is significantly more scalable and faster than other disk-based competitors under the limited memory environment. In terms of the co-optimization, D-MC2 outperforms the baselines that only optimize one of the two costs up to 48x. Furthermore, the window-based scheduler improves the training speed 10.7x faster compared to a naive scheduler.
This paper proposes a new disk-based matrix completion method for memory limited devices, D-MC2, which efficiently supports 1) incremental data update (i.e., insertion or deletion) and 2) dual spilling - spilling both data and model to disk when necessary. We first introduce a two-layered index to efficiently support incremental data update. Our two-layered index co-optimizes both model learning and data update costs by separating the unit for model learning from the unit for data update. Second, we devise a window-based SGD scheduling algorithm for matrix completion to efficiently support the dual spilling. Our proposed scheduler restricts the region of blocks to be selected using a sliding window scheme, which improves the temporal locality of model access and substantially reduces the amount of disk I/O.
Netflix | Yahoo! Music | HugeWiki | |
---|---|---|---|
# rows | 480,189 | 1,000,990 | 50,082,603 |
# cols | 17,770 | 624,961 | 39,780 |
# training |
99,072,112 |
252,800,275 |
3,411,259,583 |
# test | 1,408,395 | 4,003,960 | 34,458,060 |
Model | Description |
---|---|
GraphChi | The disk-based graph processing engine. SGD for matrix completion is implemented as an application. [Download] |
FPSGD | The disk-based version of FPSGD method. [Download] |
MLGF-MF |
The state-of-the-art disk-based matrix completion method. [Download] |
D-MCnaive | The baseline which uses a naive scheduler instead of the window-based scheduler |
D-MCM and D-MCD | The baseline which supports dual spilling by conventional MLGF rather than the two-layered index |
D-MC2 | Our proposed method [Download] |
1. D-MC2 efficiently manages the data file for incremental data update
We first measure the elapsed time to update the data file for 100K new inserted records on the current dataset. As the table shows, D-MC2 takes significantly less time to update the data file compared to FPSGD and MLGF-MF. We do not report the result of GraphChi, because it supports the processing of dynamic graphs and shows the similar result with D-MC2. Second, we grow the dataset steadily and retrain the model after inserting every 100k records. That is, we measure the total cumulative time to update the data file and train the model for every 100K record insertion. D-MC2 takes the least time in total among the disk-based competitors. Each bar in the result is composed of two parts; the upper part represents the time for data update, and the lower part represents the time for model learning.
2. D-MC2 decomposes a matrix even when the size of model grows beyond that of memory
We measure the elapsed time per iteration on a large-scale dataset, HugeWiki. HugeWiki has more than 3.4 billion observed entries, and the size of HugeWiki in the CSV text format is about 87.19GB. We use various sizes of the rank k, from 20 to 80, which correspond to the sizes of model from 4.02GB to 16.08GB. Thus, a large amount of memory is needed to hold both data and model, and the memory requirement linearly increases with k.
D-MC2 successfully decomposes a matrix even if the size of model exceeds the size of memory. Specifically, when the model is over 6.5GB, it starts using the disk to spill the model. On the contrary, GraphChi and MLGF-MF halt due to out-of-memory when the model grows larger than 7.5GB. In case of FPSGD, it cannot perform matrix completion at all under our experimental setting although it utilizes the disk for better scalability.
3. Window-based scheduler optimizes the model learning cost
The window-based scheduler significantly reduces the model learning cost. The elapsed time for model learning of D-MCnaive rapidly increases as the memory budget becomes smaller, whereas that of D-MC2 is hardly affected by the memory budget. In addition, the figure in Exp. 2 shows that D-MC2 takes almost the same time as it does not spill the model. Specifically, D-MC2 is up to 10.7x faster than D-MCnaive in HugeWiki, and this difference becomes greater as the size of model grows larger.
4. Two-layered index simultaneously optimizes both model learning and data update costs
We evaluate the efficiency of D-MC2 in terms of model learning. D-MC2 shows almost the best performance. D-MC2 shows as low learning cost as D-MCM with a page size of 1024KB even though D-MC2 uses 16KB physical page in Yahoo! Music datatset ((c) and (d)), and shows even better results than D-MCM in Netflix dataset ((a) and (b)). On the contrary, D-MCD with a page size of 16KB shows far higher learning cost than D-MC2.
This performance enhancement is mainly due to the reduction of cache misses. D-MC2 shows almost the same I/O cost as D-MCM and D-MCD, and their I/O costs increases as the memory budget decreases ((b) and (d)). However, the elapsed time for model learning is not affected by the memory budget ((a) and (c)). This results strongly indicate that the performance difference among D-MCM, D-MCD and D-MC2 is not caused by the disk I/O cost, but by the cache misses.
We evaluate the efficiency of D-MC2 in terms of data update. In overall, D-MC2 shows the best performance. D-MC2 is as efficient as D-MCD with a page size of 16KB ((a) and (c)). Two methods show almost identical elapsed time to insert one million records. However, D-MCM with a page size of 1024KB is much more inefficient than D-MC2 and D-MCD. It is mainly because D-MCM incurs more I/O cost ((b) and (d)).
We plot the performance of D-MC2 in terms of both model learning and data updates. For the efficiency of model learning, we use the result of 50% memory budget. D-MC2 simultaneously optimizes both model learning and data update costs, whereas D-MCM and D-MCD only optimizes one of the two costs. For example, D-MCM is very efficient in terms of model learning, but, at the same time, is terribly inefficient in terms of data update. On the contrary, D-MCD is very inefficient in terms of model learning, but is efficient in terms of data update. However, D-MC2 optimizes the model learning cost as much as the best performance of the others while it also shows the best performance in terms of data update.
--update-index
: Build (or incrementally update) a two-layered index and a db file from a plain dataset../dmcc --update-index <input_datafile> <target_db_file>
--train
: Train a model (i.e., factor matrices W and H), and output the factor matrices to a modelfile../dmcc --train <input_dbfile> <output_model_file>
--predict
: Predict the score of ratings in the test dataset, and report the validation RMSE../dmcc --predict <input_modelfile> <input_testfile>
--thread-size
: The number of concurrent threads--dbufsize
: The size of "data" memory buffer (MB)--mbufsize
: The size of "model" memory buffer (MB)--validate-iteratively
: Enable reporting validation RMSE for each iteration --iteration-size
: The number of iterations--row-lambda
: The regularizer for the factor matrix W--column-lambda
: The regularizer for the factor matrix H--stepsize
: The stepsize η --latent-dim-size
: The size of latent dimensions (k value in the paper)./dmcc --update-index ./toy_train ./toy_db
./dmcc --update-index ./toy_insert ./toy_db
./dmcc --train --validate ./toy_test --validate-iteratively --iteration-size 10 --latent-dim-size 20 ./toy_db ./toy_model