[gentoo-user] Pipe Lines - A really basic question

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] Pipe Lines - A really basic question
@ 2010-09-09 17:24 Matt Neimeyer
  2010-09-09 18:03 ` Etaoin Shrdlu
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Matt Neimeyer @ 2010-09-09 17:24 UTC (permalink / raw
  To: gentoo-user

My generic question is: When I'm using a pipe line series of commands
do I use up more/less space than doing things in sequence?

For example, I have a development Gentoo VM that has a hard drive that
is too small... I wanted to move a database off of that onto another
machine but when I tried the following I filled my partition and 'evil
things' happened...

mysqldump blah...
gzip blah...

In this specific case I added another virtual drive, mounted that and
went on with life but I'm curious if I could have gotten away with the
pipe line instead. Will doing something like this still use "twice"
the space?

mysqldump | gzip > file.sql.gz

OR going back to my generic question if I pipe line like "type | sort
| unique > output" does that only use 1x or 3x the disk space?

Thanks in advance!

Matt

P.S. If the answer is "it depends" how do know what it depends on?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 17:24 [gentoo-user] Pipe Lines - A really basic question Matt Neimeyer
@ 2010-09-09 18:03 ` Etaoin Shrdlu
  2010-09-09 18:25 ` Andrea Conti
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 11+ messages in thread
From: Etaoin Shrdlu @ 2010-09-09 18:03 UTC (permalink / raw
  To: gentoo-user

On Thu, 9 Sep 2010 13:24:16 -0400 Matt Neimeyer <matt@neimeyer.org> wrote:

> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
> 
> For example, I have a development Gentoo VM that has a hard drive that
> is too small... I wanted to move a database off of that onto another
> machine but when I tried the following I filled my partition and 'evil
> things' happened...
> 
> mysqldump blah...
> gzip blah...
> 
> In this specific case I added another virtual drive, mounted that and
> went on with life but I'm curious if I could have gotten away with the
> pipe line instead. Will doing something like this still use "twice"
> the space?
> 
> mysqldump | gzip > file.sql.gz
> 
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
> 
> Thanks in advance!
> 
> Matt
> 
> P.S. If the answer is "it depends" how do know what it depends on?

Pipes live in memory and do not take any disk space. Doing the same
operations one after another instead of using pipes instead usually needs
temporary file, which *do* take disk space.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 17:24 [gentoo-user] Pipe Lines - A really basic question Matt Neimeyer
  2010-09-09 18:03 ` Etaoin Shrdlu
@ 2010-09-09 18:25 ` Andrea Conti
  2010-09-09 19:19   ` Florian Philipp
  2010-09-09 19:09 ` [gentoo-user] " Florian Philipp
  2010-09-09 20:46 ` Daniel Troeder
  3 siblings, 1 reply; 11+ messages in thread
From: Andrea Conti @ 2010-09-09 18:25 UTC (permalink / raw
  To: gentoo-user

> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?

When you use a pipe you don't need the space to store intermediate
results between the two programs. Thepipe is backed by a small
system-allocated RAM buffer (4k under linux AFAIK) and program execution
is controlled according to the amount of data in the buffer.

Not having to save intermediate results generally means that you will
need less disk space: this is especially true in the mysqldump-gzip
example as the uncompressed dump will not be written to the disk at any
stage.

Note however (this is the "it depends" part :) that piping does not
affect whatever the programs might allocate or save internally: in your
second example (which does not involve any disk writing in either case)
"sort" needs to see the complete input before producing any output, so
it will allocate enough memory to store it whether it is invoked alone
or as part of a pipeline (in which case it will also stall the
downstream pipeline section until the upstream pipe is closed).

HTH,
andrea

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 17:24 [gentoo-user] Pipe Lines - A really basic question Matt Neimeyer
  2010-09-09 18:03 ` Etaoin Shrdlu
  2010-09-09 18:25 ` Andrea Conti
@ 2010-09-09 19:09 ` Florian Philipp
  2010-09-09 20:46 ` Daniel Troeder
  3 siblings, 0 replies; 11+ messages in thread
From: Florian Philipp @ 2010-09-09 19:09 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 781 bytes --]

Am 09.09.2010 19:24, schrieb Matt Neimeyer:
> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
> 
[...]
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
> 
> Thanks in advance!
> 
> Matt
> 
> P.S. If the answer is "it depends" how do know what it depends on?
> 

It depends on whether you use MS-DOS or a better OS ;)
DOS was the last operating system which I know of which used temporary
files for pipes. Every other system uses in-memory FIFOs
(first-in-first-out).

BTW: your last example "type | sort | uniq" can be shortened to "type |
sort -u"

Hope this helps,
Florian Philipp


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 18:25 ` Andrea Conti
@ 2010-09-09 19:19   ` Florian Philipp
  2010-09-09 20:28     ` [gentoo-user] " Grant Edwards
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Philipp @ 2010-09-09 19:19 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1012 bytes --]

Am 09.09.2010 20:25, schrieb Andrea Conti:
> Note however (this is the "it depends" part :) that piping does not
> affect whatever the programs might allocate or save internally: in your
> second example (which does not involve any disk writing in either case)
> "sort" needs to see the complete input before producing any output, so
> it will allocate enough memory to store it whether it is invoked alone
> or as part of a pipeline (in which case it will also stall the
> downstream pipeline section until the upstream pipe is closed).
> 

When you look closer at `sort`, it is actually a quite impressive tool.
It sorts in-memory for small amounts of data and switches to temporary
files for larger. It can even compress those files to save disk space.

And it is still faster than most "business-grade" software for importing
data into data warehouses.

Throw `cut`, `paste`, `join` and `grep` into the mix and you can build
your own relational database system based on shell scripts ;)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [gentoo-user] Re: Pipe Lines - A really basic question
  2010-09-09 19:19   ` Florian Philipp
@ 2010-09-09 20:28     ` Grant Edwards
  2010-09-10 16:34       ` Florian Philipp
  0 siblings, 1 reply; 11+ messages in thread
From: Grant Edwards @ 2010-09-09 20:28 UTC (permalink / raw
  To: gentoo-user

On 2010-09-09, Florian Philipp <lists@f_philipp.fastmail.net> wrote:

> When you look closer at `sort`, it is actually a quite impressive
> tool. It sorts in-memory for small amounts of data and switches to
> temporary files for larger. It can even compress those files to save
> disk space.
>
> And it is still faster than most "business-grade" software for
> importing data into data warehouses.
>
> Throw `cut`, `paste`, `join` and `grep` into the mix and you can
> build your own relational database system based on shell scripts ;)

Sort of linke /rdb:   http://www.rdb.com/

-- 
Grant Edwards               grant.b.edwards        Yow! Am I in Milwaukee?
                                  at               
                              gmail.com            




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 17:24 [gentoo-user] Pipe Lines - A really basic question Matt Neimeyer
                   ` (2 preceding siblings ...)
  2010-09-09 19:09 ` [gentoo-user] " Florian Philipp
@ 2010-09-09 20:46 ` Daniel Troeder
  2010-09-10 15:10   ` Paul Hartman
  2010-09-10 15:22   ` Matt Neimeyer
  3 siblings, 2 replies; 11+ messages in thread
From: Daniel Troeder @ 2010-09-09 20:46 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1721 bytes --]

On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
> My generic question is: When I'm using a pipe line series of commands
> do I use up more/less space than doing things in sequence?
> 
> For example, I have a development Gentoo VM that has a hard drive that
> is too small... I wanted to move a database off of that onto another
> machine but when I tried the following I filled my partition and 'evil
> things' happened...
> 
> mysqldump blah...
> gzip blah...
> 
> In this specific case I added another virtual drive, mounted that and
> went on with life but I'm curious if I could have gotten away with the
> pipe line instead. Will doing something like this still use "twice"
> the space?
> 
> mysqldump | gzip > file.sql.gz
> 
> OR going back to my generic question if I pipe line like "type | sort
> | unique > output" does that only use 1x or 3x the disk space?
> 
> Thanks in advance!
> 
> Matt
> 
> P.S. If the answer is "it depends" how do know what it depends on?
> 
Everyone already answered the disk space question. I want to add just
this: It also saves you lots of i/o-bandwidth: only the compressed data
gets written to disk. As i/o is the most common bottleneck, it is often
an imperative to do as much as possible in a pipe. If you're lucky it
can also mean, that multiple programs run at the same time, resulting in
higher throughput. Lucky is, when consumer and producer (right and left
of pipe) can work simultaneously because the buffer is big enough. You
can see this every time you (un)pack a tar.gz.

Bye,
Daniel


-- 
PGP key @ http://pgpkeys.pca.dfn.de/pks/lookup?search=0xBB9D4887&op=get
# gpg --recv-keys --keyserver hkp://subkeys.pgp.net 0xBB9D4887


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 20:46 ` Daniel Troeder
@ 2010-09-10 15:10   ` Paul Hartman
  2010-09-10 15:22   ` Matt Neimeyer
  1 sibling, 0 replies; 11+ messages in thread
From: Paul Hartman @ 2010-09-10 15:10 UTC (permalink / raw
  To: gentoo-user

On Thu, Sep 9, 2010 at 3:46 PM, Daniel Troeder <daniel@admin-box.com> wrote:
> On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
>> My generic question is: When I'm using a pipe line series of commands
>> do I use up more/less space than doing things in sequence?
>>
>> For example, I have a development Gentoo VM that has a hard drive that
>> is too small... I wanted to move a database off of that onto another
>> machine but when I tried the following I filled my partition and 'evil
>> things' happened...
>>
>> mysqldump blah...
>> gzip blah...
>>
>> In this specific case I added another virtual drive, mounted that and
>> went on with life but I'm curious if I could have gotten away with the
>> pipe line instead. Will doing something like this still use "twice"
>> the space?
>>
>> mysqldump | gzip > file.sql.gz
>>
>> OR going back to my generic question if I pipe line like "type | sort
>> | unique > output" does that only use 1x or 3x the disk space?
>>
>> Thanks in advance!
>>
>> Matt
>>
>> P.S. If the answer is "it depends" how do know what it depends on?
>>
> Everyone already answered the disk space question. I want to add just
> this: It also saves you lots of i/o-bandwidth: only the compressed data
> gets written to disk. As i/o is the most common bottleneck, it is often
> an imperative to do as much as possible in a pipe. If you're lucky it
> can also mean, that multiple programs run at the same time, resulting in
> higher throughput. Lucky is, when consumer and producer (right and left
> of pipe) can work simultaneously because the buffer is big enough. You
> can see this every time you (un)pack a tar.gz.

And if you have a huge amount of data where compression causes CPU to
become the bottleneck you can use something like pbzip2 which uses all
CPUs/cores in parallel to speed up [de]compression. :)



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Pipe Lines - A really basic question
  2010-09-09 20:46 ` Daniel Troeder
  2010-09-10 15:10   ` Paul Hartman
@ 2010-09-10 15:22   ` Matt Neimeyer
  1 sibling, 0 replies; 11+ messages in thread
From: Matt Neimeyer @ 2010-09-10 15:22 UTC (permalink / raw
  To: gentoo-user

Thanks all for your help! I knew it was something simple I "should" have known.

Matt

On Thu, Sep 9, 2010 at 4:46 PM, Daniel Troeder <daniel@admin-box.com> wrote:
> On 09/09/2010 07:24 PM, Matt Neimeyer wrote:
>> My generic question is: When I'm using a pipe line series of commands
>> do I use up more/less space than doing things in sequence?



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [gentoo-user] Re: Pipe Lines - A really basic question
  2010-09-09 20:28     ` [gentoo-user] " Grant Edwards
@ 2010-09-10 16:34       ` Florian Philipp
  2010-09-10 18:33         ` Grant Edwards
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Philipp @ 2010-09-10 16:34 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

Am 09.09.2010 22:28, schrieb Grant Edwards:
> On 2010-09-09, Florian Philipp <lists@f_philipp.fastmail.net> wrote:
> 
>> When you look closer at `sort`, it is actually a quite impressive
>> tool. It sorts in-memory for small amounts of data and switches to
>> temporary files for larger. It can even compress those files to save
>> disk space.
>>
>> And it is still faster than most "business-grade" software for
>> importing data into data warehouses.
>>
>> Throw `cut`, `paste`, `join` and `grep` into the mix and you can
>> build your own relational database system based on shell scripts ;)
> 
> Sort of linke /rdb:   http://www.rdb.com/
> 

Interesting. I've just read the paper they have posted.

You know what I'd really like to do? Build a graphical dataflow-centric
programming language for generating shell scripts. Since dataflows are
the real strength of shells, I figure it would be a neat tool for
improving more complex tasks. Usually I resort to temporary files when
stuff gets more complicated than a simple sequential pipe. That really
hurts performance. A more abstract representation could really help in
those situations.

Well, I figure someone has already done this with Eclipse GMF or
something like that and I just don't know it. Well, whatever. Nice to
know such stuff exists, though.

Thanks for the pointer ;)

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 262 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [gentoo-user] Re: Pipe Lines - A really basic question
  2010-09-10 16:34       ` Florian Philipp
@ 2010-09-10 18:33         ` Grant Edwards
  0 siblings, 0 replies; 11+ messages in thread
From: Grant Edwards @ 2010-09-10 18:33 UTC (permalink / raw
  To: gentoo-user

On 2010-09-10, Florian Philipp <lists@f_philipp.fastmail.net> wrote:
> Am 09.09.2010 22:28, schrieb Grant Edwards:

>>> Throw `cut`, `paste`, `join` and `grep` into the mix and you can
>>> build your own relational database system based on shell scripts ;)
>> 
>> Sort of linke /rdb:   http://www.rdb.com/
>
> Interesting. I've just read the paper they have posted.

About 12 years ago, I bought a copy and used it to track bugs and
failures for a product that was in beta.  The the installer and
node-locked licensing was awkward and slightly buggy.  Once installed,
it was pretty easy to use.  I even had a simple web UI using vanilla
cgi shellscripts, and a report generator that generated output via
LaTeX.

-- 
Grant Edwards               grant.b.edwards        Yow! Is it 1974?  What's
                                  at               for SUPPER?  Can I spend
                              gmail.com            my COLLEGE FUND in one
                                                   wild afternoon??




^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-09-10 19:10 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-09-09 17:24 [gentoo-user] Pipe Lines - A really basic question Matt Neimeyer
2010-09-09 18:03 ` Etaoin Shrdlu
2010-09-09 18:25 ` Andrea Conti
2010-09-09 19:19   ` Florian Philipp
2010-09-09 20:28     ` [gentoo-user] " Grant Edwards
2010-09-10 16:34       ` Florian Philipp
2010-09-10 18:33         ` Grant Edwards
2010-09-09 19:09 ` [gentoo-user] " Florian Philipp
2010-09-09 20:46 ` Daniel Troeder
2010-09-10 15:10   ` Paul Hartman
2010-09-10 15:22   ` Matt Neimeyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox