From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 90E521389FE for ; Fri, 31 Oct 2014 20:26:04 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id BCF9CE08E5; Fri, 31 Oct 2014 20:25:56 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id AB47BE082F for ; Fri, 31 Oct 2014 20:25:55 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by smtp.gentoo.org (Postfix) with ESMTP id 7C5F433BE9D for ; Fri, 31 Oct 2014 20:25:54 +0000 (UTC) X-Virus-Scanned: by amavisd-new using ClamAV at gentoo.org X-Spam-Flag: NO X-Spam-Score: -0.523 X-Spam-Level: X-Spam-Status: No, score=-0.523 tagged_above=-999 required=5.5 tests=[AWL=1.071, BAYES_00=-1.9, DKIM_ADSP_CUSTOM_MED=0.001, FREEMAIL_FROM=0.001, NML_ADSP_CUSTOM_MED=0.9, RP_MATCHES_RCVD=-0.594, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001] autolearn=no Received: from smtp.gentoo.org ([127.0.0.1]) by localhost (smtp.gentoo.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id ZNwVf227UI3i for ; Fri, 31 Oct 2014 20:25:47 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.gentoo.org (Postfix) with ESMTPS id 5B1A83404B9 for ; Fri, 31 Oct 2014 20:25:47 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1XkIlU-0006Iv-AX for gentoo-user@gentoo.org; Fri, 31 Oct 2014 21:25:44 +0100 Received: from 67-130-15-94.dia.static.qwest.net ([67.130.15.94]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 31 Oct 2014 21:25:44 +0100 Received: from grant.b.edwards by 67-130-15-94.dia.static.qwest.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 31 Oct 2014 21:25:44 +0100 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-user@lists.gentoo.org From: Grant Edwards Subject: [gentoo-user] Re: OT Best way to compress files with digits Date: Fri, 31 Oct 2014 20:25:32 +0000 (UTC) Message-ID: References: <20141031153659.GA13217@solfire> <5453AE7D.8060505@ramses-pyramidenbau.de> <20141031155917.GB13217@solfire> <20141031185545.GA536@grusum.endjinn.de> X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: 67-130-15-94.dia.static.qwest.net User-Agent: slrn/1.0.1 (Linux) Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org X-Archives-Salt: 0119c972-489b-4ee0-9a84-93edb61bdb40 X-Archives-Hash: 79f5344ce3549288dab8df8d328e3d68 On 2014-10-31, Rich Freeman wrote: > On Fri, Oct 31, 2014 at 2:55 PM, David Haller wrote: >> >> On Fri, 31 Oct 2014, Rich Freeman wrote: >> >>>I can't imagine that any tool will do much better than something like >>>lzo, gzip, xz, etc. You'll definitely benefit from compression though >>>- your text files full of digits are encoding 3.3 bits of information >>>in an 8-bit ascii character and even if the order of digits in pi can >>>be treated as purely random just about any compression algorithm is >>>going to get pretty close to that 3.3 bits per digit figure. >> >> Good estimate: >> >> $ calc '101000/(8/3.3)' >> 41662.5 >> and I get from (lzip) >> $ calc 44543*8/101000 >> 3.528... (bits/digit) >> to zip: >> $ calc 49696*8/101000 >> ~3.93 (bits/digit) > > Actually, I'm surprised how far off of this the various methods are. > I was expecting SOME overhead, but not this much. > > A fairly quick algorithm would be to encode every possible set of 96 > digits into a 40 byte code (that is just a straight decimal-binary > conversion). Then read a "word" at a time and translate it. This > will only waste 0.011 bits per digit. You're cheating. The algorithm you tested will compress strings of arbitrary 8-bit values. The algorithm you proposed will only compress strings of bytes where each byte can have only one of 10 values. -- Grant Edwards grant.b.edwards Yow! I want another at RE-WRITE on my CEASAR gmail.com SALAD!!