Skip to content

Data Sources

jcanny edited this page May 7, 2014 · 34 revisions

Matrix Data Sources

A matrix datasource is a class which implements a block iterator over a set of input matrices. The iterator supports next() and hasNext() methods. The input matrices can be dense or sparse. Each call to next() on the datasource returns an array of Mat instance (generic matrices). The length of this array is the same as the number of matrices in the source. For unsupervised learning, the datasource usually holds a single matrix and returns an array of length 1 containing a single matrix on each call to next() (this matrix will contain main instance however). For supervised learning, there is usually a data matrix and a matrix of labels. They should have the same number of columns, since the column index references each input instance. You construct a matrix datasource like this:

> val dopts = new MatDS.Options
> val m = new MatDS(Array(datamat, labelsmat), dopts)

You can adjust the options even after creating the datasource. They are not used until you call init on the datasource.

The options to MatDS are:

   var batchSize = 10000
   var sizeMargin = 3f
   var sample = 1f
   var addConstFeat:Boolean = false
   var featType:Int = 1                 // 0 = binary features, 1 = linear features
   var putBack = -1

The batchSize is the size of the minibatch (number of instances and number of columns) returned by each call to next().

sizeMargin is a parameter that supports caching matrix containers. Since the minibatch size stays the same during a training run, every matrix return by next has the same size in rows x columns. You can save a lot of allocation effort by caching these matrices. But sparse matrices have a variable number of non-zeros. The caching system re-uses a sparse matrix container until it is found to be too small. Then it grows it by this factor. This heuristic works quite well at avoiding future allocations.

sample is a parameter which determines whether the input is sampled (i.e. whether every input instance is used, or a random subset). This parameter specifies the random fraction in the later case. Sampling is done without replacement.

addConstFeat is true adds an extra row with all 1's to each block of the first output matrix when next() is called. This adds a constant term that is useful in many algorithms (regression, collaborative filtering).

featType can be either 0 (binary features) or 1 (linear features). When this flag is 0, all non-zero output values are set to 1's. If this flag is 1, the output values are exactly the same as in the input matrices.

putBack Option

putBack is an integer which specifies the index of the matrix returned by next() which is "put Back" into the datasource on the next call to next(). That is, putBack turns the datasource into a data source/sink. For instance:

> dopts.putBack = 1                // matrix number 1 is the putback matrix
> val m = new MatDS(Array(datamat, predsmat), dopts)
> m.init
> val x = m.next                   // X is an array of two matrix blocks
> val dblock = x(0)
> val pblock = x(1)                // Contents of this matrix will be put back in the DS
> ...                              // Modify pblock
> val y = m.next                   // pblock contents are save into "predmat"

putBack is a scalable approach to prediction and factor models. The size of a matrix of predictions grows with the size of the data source that provides the data. For factor models which approximate a data matrix M as X * Y, the number of columns of Y match the columns of M which can be very large. Its natural to support an update method which pushes those values into the datasource from which data instances are being pulled. For matrix data sources, this simply means that the predictions can be taken from the second matrix argument to the MatDS constructor. For File data sources, that means predictions will be saved to the filesystem.

Files Data Source

Files data sources support much larger datasets, up to the capacity of the file system. A files datasource is implemented with a collection of files on disk, with each file holding a matrix of source data or e.g. class labels. A basic data source FilesDS fronts a collection of files each storing and individual dense or sparse matrix. FilesDS abstracts the location and size of these files. The size of the matrix block returned by next can be smaller or larger than the size of these matrices on disk.

To create a FilesDS, you need to define a function fnames:List[(Int)=>String] that maps integers to names of files. This is the sequence of files that will be retrieved from disk. Other parameters are:

lookahead(8): the number of files to prefetch, and the number of threads to use. 
sampleFiles(1f): randomly sample this fraction of columns.
nstart(0): the starting file index.
nend(0): the ending file index.
dorows(false): output blocks of rows (if true) or columns.
order(1): randomize the file read order if (1), don't otherwise. 

Files data sources work very well with non-RAID disk arrays, and generally give higher throughput. We recorded a throughput of 1.5 GB/s on an array of 16 commodity 2TB disks.

Clone this wiki locally