Database Reference
In-Depth Information
The sampling method shown earlier is a simple random sampling technique that has an
"equal probability of selection" (EPS) design.
EPS samples are considered useful because the variance of the sample attributes is similar
to the variance of the original data set. Though, bear in mind that this is only useful if you are
considering variances.
Simple random sampling can make the eventual sample biased towards more frequently
occurring data. For example, if you have 1% sample of data on which some kinds of data
occur only 0.001% of the time, you may end up with a data set that doesn't have any of that
outlying data.
What you might wish to do is to pre-cluster your data, and take different samples from
each group, to ensure that you have a sampled data set that includes many more outlying
attributes. A simple method might be to:
F Include 1% of all normal data
F Include 25% of outlying data
Note that if you do this, then it is no longer an "EPS" sample design.
See also
There are no doubt statisticians who will be in apoplexy after reading this. You're welcome to
use the facilities of the SQL language to create a more accurate sample. Please, just make
sure that you know what you're doing and/or check out some good statistical literature,
websites, or textbooks.
Loading data from a spreadsheet
Spreadsheets are the most obvious starting place for most data stores. Studies within a
range of businesses consistently show that more than 50% of smaller data stores are held in
spreadsheets or small desktop databases. Loading data from these sources is a frequent and
important task for many DBAs.
Getting ready
Spreadsheets combine data, presentation, and programs all in one file. That's perfect for power
users wanting to work quickly. Like other relational databases, PostgreSQL is mainly concerned
with the lowest level of data, so extracting just the data can present some challenges.
 
Search WWH ::




Custom Search