From: Fred Marshall on
Perhaps I should post this elsewhere but we speak the same language
here. I may have asked a similar question some time ago but now I have
a new perspective and want to investigate.

I have a wastewater process that's being sampled periodically (uniform
sampling for what it's worth).
The sample rate is way too low to avoid aliasing but the samples are
real enough and the data is continuously available and very likely not
amenable to being sampled more often (economics).

It's a bit like sampling a random series except that I "know" there is
an underlying pattern that repeats each day with variable amplitude no
doubt. That, plus transients, would be the highest frequency content
and seasonal things are the lowest frequency content which I'm not too
worried about. And, while I'd like to know when transients happen and
how big they are, I'm afraid that's out of the question.

In fact, what's of value here is to estimate how much plant capacity is
being "used up". By my reckoning, 6 months of data during our peak
months is a good averaging period - as it's the peak months that
determine our capacity "use" for regulatory purposes.
In the shorter term, the numbers are used for determining charges for
overly high concentrations, shared use, etc.

To make things a bit more complicated, the regulatory agency has us
report the weekly data on a monthly basis (actually here there are 2
samples per week) and average it for the month.
If there are 3 contiguous months with these averages exceeding our
"capacity" or some large fraction of it, then we are put on notice that
planning for future capacity must begin. So, this is one "measure"
that's in concrete. But, I digress a bit .....

Here is my question:

Instead of worrying about aliasing which is where I go to first of
course, is there a statistical measure that might help me better
understand the "quality" of our numbers or how much variation is
"expected" given those numbers?
For example, given 4 to 8 weeks of data (4 to 8 samples), what can be
said the data set in a statistical sense? How might one best put the
answer to use in a case like this?

Where should I be looking?

Fred
From: jim on
I didn't see where you reveal what is being sampled. Is it how full a tank
is? How much fluid is flowing in a pipe?

Fred Marshall wrote:

> Perhaps I should post this elsewhere but we speak the same language
> here. I may have asked a similar question some time ago but now I have
> a new perspective and want to investigate.
>
> I have a wastewater process that's being sampled periodically (uniform
> sampling for what it's worth).
> The sample rate is way too low to avoid aliasing but the samples are
> real enough and the data is continuously available and very likely not
> amenable to being sampled more often (economics).
>
> It's a bit like sampling a random series except that I "know" there is
> an underlying pattern that repeats each day with variable amplitude no
> doubt. That, plus transients, would be the highest frequency content
> and seasonal things are the lowest frequency content which I'm not too
> worried about. And, while I'd like to know when transients happen and
> how big they are, I'm afraid that's out of the question.
>
> In fact, what's of value here is to estimate how much plant capacity is
> being "used up". By my reckoning, 6 months of data during our peak
> months is a good averaging period - as it's the peak months that
> determine our capacity "use" for regulatory purposes.
> In the shorter term, the numbers are used for determining charges for
> overly high concentrations, shared use, etc.
>
> To make things a bit more complicated, the regulatory agency has us
> report the weekly data on a monthly basis (actually here there are 2
> samples per week) and average it for the month.

Depending on what is being sampled that could be a complete accounting or
an incomplete accounting of usage. If each sample records how much was used
since the last sample was taken, then when you add them together you have
complete accounting of the usage for the month. If all that the sample is
measuring is the instantaneous usage at the instant the sample is taken
then you have a very incomplete accounting of usage and could make it mean
just about anything you want it to.

-jim



>
> If there are 3 contiguous months with these averages exceeding our
> "capacity" or some large fraction of it, then we are put on notice that
> planning for future capacity must begin. So, this is one "measure"
> that's in concrete. But, I digress a bit .....
>
> Here is my question:
>
> Instead of worrying about aliasing which is where I go to first of
> course, is there a statistical measure that might help me better
> understand the "quality" of our numbers or how much variation is
> "expected" given those numbers?
> For example, given 4 to 8 weeks of data (4 to 8 samples), what can be
> said the data set in a statistical sense? How might one best put the
> answer to use in a case like this?
>
> Where should I be looking?
>
> Fred

From: Steve Pope on
Fred Marshall <fmarshallx(a)remove_the_xacm.org> wrote:

>Instead of worrying about aliasing which is where I go to first of
>course, is there a statistical measure that might help me better
>understand the "quality" of our numbers or how much variation is
>"expected" given those numbers?
>For example, given 4 to 8 weeks of data (4 to 8 samples), what can be
>said the data set in a statistical sense? How might one best put the
>answer to use in a case like this?
>
>Where should I be looking?

Something like a Student's T test can tell you if a sample
or group of samples is out-of-line.

(I think I may have said the same thing, the last time you
asked a similar question.)

Steve
From: Jerry Avins on
On 7/13/2010 12:31 PM, Fred Marshall wrote:
> Perhaps I should post this elsewhere but we speak the same language
> here. I may have asked a similar question some time ago but now I have
> a new perspective and want to investigate.
>
> I have a wastewater process that's being sampled periodically (uniform
> sampling for what it's worth).
> The sample rate is way too low to avoid aliasing but the samples are
> real enough and the data is continuously available and very likely not
> amenable to being sampled more often (economics).
>
> It's a bit like sampling a random series except that I "know" there is
> an underlying pattern that repeats each day with variable amplitude no
> doubt. That, plus transients, would be the highest frequency content and
> seasonal things are the lowest frequency content which I'm not too
> worried about. And, while I'd like to know when transients happen and
> how big they are, I'm afraid that's out of the question.
>
> In fact, what's of value here is to estimate how much plant capacity is
> being "used up". By my reckoning, 6 months of data during our peak
> months is a good averaging period - as it's the peak months that
> determine our capacity "use" for regulatory purposes.
> In the shorter term, the numbers are used for determining charges for
> overly high concentrations, shared use, etc.
>
> To make things a bit more complicated, the regulatory agency has us
> report the weekly data on a monthly basis (actually here there are 2
> samples per week) and average it for the month.
> If there are 3 contiguous months with these averages exceeding our
> "capacity" or some large fraction of it, then we are put on notice that
> planning for future capacity must begin. So, this is one "measure"
> that's in concrete. But, I digress a bit .....
>
> Here is my question:
>
> Instead of worrying about aliasing which is where I go to first of
> course, is there a statistical measure that might help me better
> understand the "quality" of our numbers or how much variation is
> "expected" given those numbers?
> For example, given 4 to 8 weeks of data (4 to 8 samples), what can be
> said the data set in a statistical sense? How might one best put the
> answer to use in a case like this?
>
> Where should I be looking?

Other things being equal, clustering should follow a Poisson
distribution. If you measure flow -- a quantity that can be heavily
influenced by rainfall -- only twice a week, how do you bill equitably?

Jerry
--
Engineering is the art of making what you want from things you can get.
�����������������������������������������������������������������������
From: Fred Marshall on
Jerry Avins wrote:

>
> Other things being equal, clustering should follow a Poisson
> distribution. If you measure flow -- a quantity that can be heavily
> influenced by rainfall -- only twice a week, how do you bill equitably?
>
> Jerry

Jerry,

I don't imagine that we bill entirely "equitably" - more like "agreeably".

We measure flow continuously to get the volume and concentration once or
twice a week.

The concentration is assumed to apply for the entire measured volume
between concentration samples. So, one may say that we sample loading
in that fashion.

I think I answered my own question to the point where I can deal with it:

We have the weekly or twice-weekly samples and have computer monthly
averages - as the latter have some regulatory importance.
You might consider these monthly averages to be lowpassed versions of
the samples.
Then, one can compute the distribution of outcomes and infer(?) the
amount of loading.

My "backwards" sort of reasoning goes like this:
We take a set of samples.
We determine the distribution of those sample values over a suitably
long time such that daily and even annual variations are included in the
distribution.
The caution here is that trends get wiped out - so a suitable time frame
or set of them needs to be selected that has some meaning where gross
trends are concerned.
If we assume that the distribution represents a reasonable estimate of
ground truth, then we can infer in quantitative terms what's happening -
such as over-loading (i.e. loading that's above some determined threshold).
It's surely not "perfect" but it's better than nothing ... I think.

Fred