[gentoo-user] PacketShader - firewall using GPU

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] PacketShader - firewall using GPU
@ 2011-09-23  4:06 Pandu Poluan
  2011-09-23 13:49 ` Michael Mol
  2011-09-23 16:46 ` James
  0 siblings, 2 replies; 7+ messages in thread
From: Pandu Poluan @ 2011-09-23  4:06 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 106 bytes --]

Saw this on the pfSense list:

http://shader.kaist.edu/packetshader/

anyone interested in trying?

Rgds,

[-- Attachment #2: Type: text/html, Size: 188 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] PacketShader - firewall using GPU
  2011-09-23  4:06 [gentoo-user] PacketShader - firewall using GPU Pandu Poluan
@ 2011-09-23 13:49 ` Michael Mol
  2011-09-23 15:14   ` [gentoo-user] " James
  2011-09-23 15:16   ` [gentoo-user] " Mark Knecht
  2011-09-23 16:46 ` James
  1 sibling, 2 replies; 7+ messages in thread
From: Michael Mol @ 2011-09-23 13:49 UTC (permalink / raw
  To: gentoo-user

On Fri, Sep 23, 2011 at 12:06 AM, Pandu Poluan <pandu@poluan.info> wrote:
> Saw this on the pfSense list:
>
> http://shader.kaist.edu/packetshader/
>
> anyone interested in trying?

I see a lot of graphs touting high throughput, but what about latency?
That's the kind of stuff that gets in my way when I'm messing with
things like VOIP.

My first thought when I saw they were using a GPU for processing was
concerns about latency:
1) RTT between a video card and the CPU will cause an increase in
latency from doing processing on-CPU. Maybe DMA between the video card
and NICs could help with this, but I don't know. Certainly newer CPUs
with on-die GPUs will have an advantage here.
2) GPGPU coding favors batch processing over small streams. That's
part of its nature, after all. That means that processed packets would
come out of the GPU side of the engine in bursts.

They also tout a huge preallocated packet buffer, and I'm not sure
that's a good thing, either. It may or may not cause latency problems,
depending on how they use it.

They don't talk about latency at all, except for one sentence:
"Forwarding table lookup is highly memory-intensive, and GPU can
acclerate it with both latency hiding capability and bandwidth."

-- 
:wq

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-user] Re: PacketShader - firewall using GPU
  2011-09-23 13:49 ` Michael Mol
@ 2011-09-23 15:14   ` James
  2011-09-23 15:16   ` [gentoo-user] " Mark Knecht
  1 sibling, 0 replies; 7+ messages in thread
From: James @ 2011-09-23 15:14 UTC (permalink / raw
  To: gentoo-user

Michael Mol <mikemol <at> gmail.com> writes:

> > http://shader.kaist.edu/packetshader/
> > anyone interested in trying?

A firewall router based on a GPU+CPU is a great idea, who's
stability is probably a few years away.

Basis: GPU use very fast memory often with special features,
based on architecture. But the proof is in the pudding; i.e.
such a device being tested in a variety of scenarios. The basic
problem is you have different details on GPU and often the
necessary details of the hardware, are not published in a general
access type of document. Most chip vendors, when they do get
a software firewall router working on a chipset (GPU + CPU)
will most likely want to sell a solution, rather than open
source this solution. It has huge ramification commercially.

So for this to be a fruitful effort, I'd suggest waiting until you
have one of those fancy new AMD chips where the GPU and mutli-core
CPU are on the same die PLUS and open source project.

Intel has nice (CPU) hardware, but video is a pig (dog_slow) on
the Intel GPU. The intel GPU is a DOG......

Nvidia has some nice software offerings, but no robust CPU multi
core to work with the GPU (on the same die). Also, Nvidia has a 
weak history of open-source support. In fact the project you 
mention may not even publish other parts of the sourcecode, according 
to the website. So why would you waste your time on that code offering.

When somebody (GPU team) get's iptables and gentoo running on a new,
integrated (GPU + CPU) AMD chip, just use vanilla tools via
the gentoo organization? Then IP tables just has to be modified
to take advantage of the GPU. Maybe GCC will handle this some day.
maybe AMD will open source some internal knowledge to make it
happen; Maybe not.

> I see a lot of graphs touting high throughput, but what about latency?

This is a good point. Wait until the GPU and CPU are on the same
die... (think AMD). I just do not see Intel or Nvidia being the first
to make this truely a commodity (too much money to made selling the
proprietary solutions). For example, and high-end compiler vendor
could buy an exclusive license from Intel/Nvidia, to make this
a unique and expensive offering. Think DOD contractors that 
limit the solution to VME buss based systems (just a random thought).

> They also tout a huge preallocated packet buffer, and I'm not sure
> that's a good thing, either. It may or may not cause latency problems,
> depending on how they use it.

Traditionally, searching and sorting algos smoke on GPU and GPU
type ram. Other processes not so wonderful. That's why you need
the GPU to offload processes, that it can run much faster than
the multi-core CPU. Both are needed most of the time.

> They don't talk about latency at all, except for one sentence:

I think that site is just trying to get folks to do some testing
for them. They do not seem to be 'open-source' minded, imho....

Personally, I would not waist my time. But do watch out for new
offerings from AMD..... Maybe Intel (naw, just kidding, they
never release or support anything open source, until they
have to....) I bet those developers had to sign some serious
NDAs with some nasty corporate types of lawyers....

just my opinion
hth,
James

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] PacketShader - firewall using GPU
  2011-09-23 13:49 ` Michael Mol
  2011-09-23 15:14   ` [gentoo-user] " James
@ 2011-09-23 15:16   ` Mark Knecht
  2011-09-23 15:33     ` Michael Mol
  2011-09-23 15:49     ` [gentoo-user] " James
  1 sibling, 2 replies; 7+ messages in thread
From: Mark Knecht @ 2011-09-23 15:16 UTC (permalink / raw
  To: gentoo-user

On Fri, Sep 23, 2011 at 6:49 AM, Michael Mol <mikemol@gmail.com> wrote:
> On Fri, Sep 23, 2011 at 12:06 AM, Pandu Poluan <pandu@poluan.info> wrote:
>> Saw this on the pfSense list:
>>
>> http://shader.kaist.edu/packetshader/
>>
>> anyone interested in trying?
>
> I see a lot of graphs touting high throughput, but what about latency?
> That's the kind of stuff that gets in my way when I'm messing with
> things like VOIP.
>
> My first thought when I saw they were using a GPU for processing was
> concerns about latency:
> 1) RTT between a video card and the CPU will cause an increase in
> latency from doing processing on-CPU. Maybe DMA between the video card
> and NICs could help with this, but I don't know. Certainly newer CPUs
> with on-die GPUs will have an advantage here.
> 2) GPGPU coding favors batch processing over small streams. That's
> part of its nature, after all. That means that processed packets would
> come out of the GPU side of the engine in bursts.
>
> They also tout a huge preallocated packet buffer, and I'm not sure
> that's a good thing, either. It may or may not cause latency problems,
> depending on how they use it.
>
> They don't talk about latency at all, except for one sentence:
> "Forwarding table lookup is highly memory-intensive, and GPU can
> acclerate it with both latency hiding capability and bandwidth."
>
> --
> :wq

While I'm not a programmer at all I have been playing with some CUDA
programming this year. The couple of comments below are based around
that GPU framework and might differ for others.

1) I don't think the GPU latencies are much different than CPU
latencies. A lot of it can be done with DMA so that the CPU is hardly
involved once the pointers are set up. Of course it depends on the
system but the GPU is pretty close to the action so it should be quite
fast getting started.

2) The big deal with GPUs is that they really pay off when you need to
do a lot of the same calculations on different data in parallel. A
book I read + some online stuff suggested they didn't pay off speed
wise until you were doing at least 100 operations in parallel.

3) You do have to get the data into the GPU so for things that used
fixed data blocks, like shading graphical elements, that data can be
loaded once and reused over and over. That can be very fast. In my
case it's financial data getting evaluated 1000 ways so that's
effective. For data like a packet I don't know how many ways there are
to evaluate that so I cannot suggest what the value would be.

None the less it's an interesting idea and certainly offloads computer
cycles that might be better used for other things.

My NVidia 465GTX has 352 CUDA cores while the GS8200 has only 8 so
there can be a huge difference based on what GPU you have available.

Just some thoughts,
Mark

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [gentoo-user] PacketShader - firewall using GPU
  2011-09-23 15:16   ` [gentoo-user] " Mark Knecht
@ 2011-09-23 15:33     ` Michael Mol
  2011-09-23 15:49     ` [gentoo-user] " James
  1 sibling, 0 replies; 7+ messages in thread
From: Michael Mol @ 2011-09-23 15:33 UTC (permalink / raw
  To: gentoo-user

On Fri, Sep 23, 2011 at 11:16 AM, Mark Knecht <markknecht@gmail.com> wrote:
> On Fri, Sep 23, 2011 at 6:49 AM, Michael Mol <mikemol@gmail.com> wrote:
> While I'm not a programmer at all I have been playing with some CUDA
> programming this year. The couple of comments below are based around
> that GPU framework and might differ for others.
>
> 1) I don't think the GPU latencies are much different than CPU
> latencies. A lot of it can be done with DMA so that the CPU is hardly
> involved once the pointers are set up. Of course it depends on the
> system but the GPU is pretty close to the action so it should be quite
> fast getting started.

As long as stuff is done wholly in the GPU, the kind of latency I was
worried about (GPU<->system RAM<->CPU) isn't a problem. The problem is
going to be anything that involves data being passed back and forth,
or decisions needing to be made by the CPU. I concur with James that
CPU+GPU parts will help a great deal in that regard.

> 2) The big deal with GPUs is that they really pay off when you need to
> do a lot of the same calculations on different data in parallel. A
> book I read + some online stuff suggested they didn't pay off speed
> wise until you were doing at least 100 operations in parallel.
>
> 3) You do have to get the data into the GPU so for things that used
> fixed data blocks, like shading graphical elements, that data can be
> loaded once and reused over and over. That can be very fast. In my
> case it's financial data getting evaluated 1000 ways so that's
> effective. For data like a packet I don't know how many ways there are
> to evaluate that so I cannot suggest what the value would be.

Yeah, that's the problem. Cache loses its utility the less and less
you have to revisit the same pieces of data. When they're talking
about multiple gigabits per second of throughput, cache won't be much
good for more than prefetches.

>
> None the less it's an interesting idea and certainly offloads computer
> cycles that might be better used for other things.

Earlier this year, I experimented a little bit in how one could
implement a Turing-complete language in a branchless, like on GPGPUs*.
I figure it's doable, but you waste cores and memory with discarded
results. (Similar to when CPUs mispredict branches misprediction, but
worse.)

* OK, they're not branchless, but branches kill performance; I recall
my reading of the CUDA manual indicating that code has to be brought
back in step after a branch before any of the results are available.
But that was about two years ago when I read it.

>
> My NVidia 465GTX has 352 CUDA cores while the GS8200 has only 8 so
> there can be a huge difference based on what GPU you have available.

-- 
:wq

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-user] Re: PacketShader - firewall using GPU
  2011-09-23 15:16   ` [gentoo-user] " Mark Knecht
  2011-09-23 15:33     ` Michael Mol
@ 2011-09-23 15:49     ` James
  1 sibling, 0 replies; 7+ messages in thread
From: James @ 2011-09-23 15:49 UTC (permalink / raw
  To: gentoo-user

Mark Knecht <markknecht <at> gmail.com> writes:

> 1) I don't think the GPU latencies are much different than CPU
> latencies. A lot of it can be done with DMA so that the CPU is hardly
> involved once the pointers are set up. Of course it depends on the
> system but the GPU is pretty close to the action so it should be quite
> fast getting started.

Privately, multi-core and GPUs are license for some folks to build
out using the latest FPGA in massive efforts for some  large clusters.
These clusters are very private and the latency issue alluded to
is gone and very much a positive attribute, if you have large
sums of cash....

> 2) The big deal with GPUs is that they really pay off when you need to
> do a lot of the same calculations on different data in parallel. A
> book I read + some online stuff suggested they didn't pay off speed
> wise until you were doing at least 100 operations in parallel.

I always knew you were very sharp (Mark) here a few websites to
further establish what you are saying.

[1]
http://www.prnewswire.com/news-releases/passware-kit-101-cracks-rar-and-truecrypt-encryption-in-record-time-99539629.html

[2] http://www.tomshardware.com/news/nvidia-gpu-wifi-hack,6483.html

These are just the tip of the berg....

> 3) You do have to get the data into the GPU so for things that used
> fixed data blocks, like shading graphical elements, that data can be
> loaded once and reused over and over. That can be very fast. In my
> case it's financial data getting evaluated 1000 ways so that's
> effective. For data like a packet I don't know how many ways there are
> to evaluate that so I cannot suggest what the value would be.

When you license core technologies, put them on FPGAs or ASICs and have
lots of money, you can build special purpose busses that move
data around very fast and massive on that custome hardware. 
Consumers and business don't want to pay for that sort of thing, 
but others are far ahead of the hacker-trains using massive numbers of
workstations around the net. Those massive hacker efforts use the 
Internet like a buss. Others build custom busses that are faster
and with more bandwidth that what vendors put under a 10G ethernet
interface.  A lot of very smart folks are studying the 
hacker communities with advance hardware for analysis, like
you cannot believe.  

What the original poster has proposed, has been around for a long time.
Ever wonder why not much progress is being made by the related open-source
projects (compared to what's going on behind deep_pockets)?

The best hope is for AMD stock to fall to a point where the owners
are truely desparate. Then AMD may be motivated to offer something
that every Linux user (world wide) want to go out and purchase...

just my opinion...

hth,
James

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [gentoo-user] Re: PacketShader - firewall using GPU
  2011-09-23  4:06 [gentoo-user] PacketShader - firewall using GPU Pandu Poluan
  2011-09-23 13:49 ` Michael Mol
@ 2011-09-23 16:46 ` James
  1 sibling, 0 replies; 7+ messages in thread
From: James @ 2011-09-23 16:46 UTC (permalink / raw
  To: gentoo-user

Pandu Poluan <pandu <at> poluan.info> writes:

> Saw this on the pfSense list:
> http://shader.kaist.edu/packetshader/
> anyone interested in trying?

To make my rant complete, here a few links
for those proactive (young and brilliant) minds:

http://netfpga.org/

http://opencores.org/

I suggest these sites, as over the years some
programmers that are accomplished, move into the
programmable hardware arena (FPGA etc) and 
develop an extraordinary clarity of problem
solving. Much more so than folks with PH.d
in Hardware. Hardware is not hard to understand
and pushing the envelop is more about your coding
skills than knowledge of hardware, imho.

James

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2011-09-23 16:47 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-23  4:06 [gentoo-user] PacketShader - firewall using GPU Pandu Poluan
2011-09-23 13:49 ` Michael Mol
2011-09-23 15:14   ` [gentoo-user] " James
2011-09-23 15:16   ` [gentoo-user] " Mark Knecht
2011-09-23 15:33     ` Michael Mol
2011-09-23 15:49     ` [gentoo-user] " James
2011-09-23 16:46 ` James

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox