
Sampling - Guide - Apache DataFu Pig
Simple Random Sampling produces samples of a specific size, where each item has the same probability of being chosen. DataFu has scalable implementations of this that will generate samples …
datafu.pig.sampling (DataFu 1.2.0)
Sampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc.
SimpleRandomSample (datafu-pig 1.3.3 API)
It takes a bag of n items and a sampling probability p as the inputs, and outputs a simple random sample of size exactly ceil (p*n) in a bag, with probability at least 99.99%.
datafu.pig.sampling (datafu-pig 1.3.3 API)
Sampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc. See: Description
SimpleRandomSample (DataFu 1.1.0)
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil (p*n) with probability at least 99.99%, where $n$ is the size of the population.
SimpleRandomSampleWithReplacementVote (datafu-pig 1.3.3 API)
We can simply draw a number from this distribution, determine the positions by sampling without replacement, and then generate random scores for those positions.
SampleByKey (DataFu 1.2.0)
The method of sampling is to convert the key to a hash, derive a double value from this, and then test this against a supplied probability. The double value derived from a key is uniformly distributed …
ReservoirSample (datafu-pig 1.3.3 API)
All Classes Summary: Nested | Field | Constr | Method Detail: Field | Constr | Method datafu.pig.sampling
Guide - Apache DataFu Pig
Set Operations: set intersection, union, difference Sessions: sessionize streams of data Sampling: simple random sample with/without replacement, weighted sample, sample by keys Hashing: SHA …
Overview (DataFu 1.2.0)
datafu.pig.sampling Sampling UDFs, including weighted sample, reservoir sampling, sampling by key, etc. datafu.pig.sessions