Reply by Don Y November 29, 20202020-11-29
On 11/29/2020 2:16 AM, George Neuner wrote:
> On Sat, 28 Nov 2020 00:47:00 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 11/27/2020 5:40 AM, Wojciech Zabolotny wrote: >> >>> Here you are: https://groups.google.com/g/alt.sources/c/YeeAV3fBAVc/m/AZgPoFxS4NYJ >>> The Python code has completely removed indentation. >> >> Indentation and whitespace /tend/ to be insignificant to the operation >> of the code. Of course, presence in string literals is a different >> story -- where even replacing tabs with spaces is a hazard. > > In Python, indentation is required syntax: in general, it is an error > for code in the same scope not to be vertically aligned.
Sorry, I didn't even examine the "content" of the archive; rather, concentrated on the "SHAR wrapper" as it was quite obviously corrupted.
> However, with a nested 'if-else', logic actually depends on the > indentation: > > if <expr1>: > <statements1> > if <expr2>: > <statements2> > else: > > is very different from > > if <expr1>: > <statements1> > if <expr2>: > <statements2> > else: > > In C the 'else' goes to the nearest 'if' regardless of whitespace. In > Python, the 'else' goes to the nearest 'if' with which it vertically > aligned.
Yes. I dislike Python as my naming and coding styles rely on long logical lines. I prefer to let a pretty-printer clean up my code to my own coding standards (indents, braces, function templates, etc.) than to let the language dictate what my code HAS TO look like. [I most often don't write in an IDE so can't rely on the "editor" to "correct" formatting for me if, for example, I prepend an "if" to a block of code or wrap it into some other explicit block]
> Significant whitespace sucks!
There are still places where a space is not a space and you have to deal with it. I frequently find tabs and spaces interchanged for each other when cutting and pasting across systems; the machine sees things that the human doesn't care about. Try CONCLUSIVELY sorting out whether you're looking at " \t", " " or "\t " (or variations thereof) from a paper printout! But, there are also annoyances with things as banal as typefaces that needlessly confound. Or, displays that have opted to use particular glyphs that can't readily be resolved as being rightside up or upside down. Is "529" five hundred and twenty nine? Or, six hundred and twenty five?
Reply by Don Y November 29, 20202020-11-29
On 11/29/2020 1:57 AM, George Neuner wrote:
> On Sat, 28 Nov 2020 00:53:43 -0700, Don Y > <blockedofcourse@foo.invalid> wrote: > >> On 11/26/2020 9:14 PM, George Neuner wrote: >>> >>> On Thu, 26 Nov 2020 19:01:33 -0700, Don Y >>> <blockedofcourse@foo.invalid> wrote: >> >> Hi George! >> >> Have not heard from you in a while -- was beginning to think that you >> may have been coviderated! Hopefully, that's not the case (?) > > Nope. I had a viral flu in early 2018 that had eerily similar symptoms > to what is claimed for Covid-19: I was really sick with respiratory > problems for ~5 weeks, and it was ~14 weeks before I really felt well > again. I was never hospitalized, so that virus was never identified, > but I'm hoping that was a coronavirus because some studies in Europe > found that prior exposure to other coronaviruses *may* give some > increased resistance to this one. > > In any event, I don't have your current email.
<frown> I was evaluating lawyers and their ilk (good use of that word in that context) a few months back and "consumed" several email addresses in the process -- giving them out "temporarily" and then canceling the accounts once I'd made up my mind to cut off further communication from the "undesirables" (Q: are ANY of them "desirables" :> ) I thought I'd picked accounts that I wasn't actively using. But, may have screwed up. I'll check my mail archive to see what you were using to see if it was affected. In either case, you should have a couple of addresses for me (?)
>>>> Are you sure the "corruption" can't be stripped from the post >>>> with a filter (script)? >>> >>> The junk is HTML formatting. The worry is that things like C++ source >>> legitimately may contain angle bracket delimited text. You'd need a >>> smart filter that understands HTML tags. >> >> Or, scrape the posts manually? E.g., highlight text in browser, >> copy, paste? > > Laborious if someone posted a long program.
Of course. My point was that the "content" isn't really "lost", just less easily accessed! (I had to resort to scans of much of my earliest work to get them back into electronic form)
>> If posted as an "image" of text (to deliberately hinder capture), >> a screen capture program feeding an OCR... and manual touch-up. > > Yuck! On average OCR still makes ~1 mistake per line.
I've not seen that sort of problem with good images. Much worse with scanned stuff (esp if scanned at too low resolution). In any case, it appears that much of the delimiters that SHAR introduces are arbitrarily removed from those posts. Perhaps google thinking a leading nonspace character is indicative of an indent level in quoting? (you can specify which character to use in many MUAs)
>> Though, having seen Wojciech's example, it appears that there is >> more involved than just eliding HTML tags! I've not actively studied >> the (apparent) transformation to try to codify the rules that may >> have been applied... > > The problem there is Python. For almost any other language, your idea > of scraping it manually would work. For Python, you have to > understand the logic to reinstate the required indentation. > > I have always been opposed to significant whitespace in a language.
Reply by Paul Rubin November 29, 20202020-11-29
George Neuner <gneuner2@comcast.net> writes:
> Significant whitespace sucks!
You'll love my new language "Point Blank". Its file extension is a space character. There is also my Haskell dialect for embedded microprocessors. It is called Control-H. its file extension is a backspace. ;-)
Reply by George Neuner November 29, 20202020-11-29
On Sat, 28 Nov 2020 00:47:00 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 11/27/2020 5:40 AM, Wojciech Zabolotny wrote: > >> Here you are: https://groups.google.com/g/alt.sources/c/YeeAV3fBAVc/m/AZgPoFxS4NYJ >> The Python code has completely removed indentation. > >Indentation and whitespace /tend/ to be insignificant to the operation >of the code. Of course, presence in string literals is a different >story -- where even replacing tabs with spaces is a hazard.
In Python, indentation is required syntax: in general, it is an error for code in the same scope not to be vertically aligned. However, with a nested 'if-else', logic actually depends on the indentation: if <expr1>: <statements1> if <expr2>: <statements2> else: is very different from if <expr1>: <statements1> if <expr2>: <statements2> else: In C the 'else' goes to the nearest 'if' regardless of whitespace. In Python, the 'else' goes to the nearest 'if' with which it vertically aligned. Significant whitespace sucks! George
Reply by George Neuner November 29, 20202020-11-29
On Sat, 28 Nov 2020 00:53:43 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 11/26/2020 9:14 PM, George Neuner wrote: >> >> On Thu, 26 Nov 2020 19:01:33 -0700, Don Y >> <blockedofcourse@foo.invalid> wrote: > >Hi George! > >Have not heard from you in a while -- was beginning to think that you >may have been coviderated! Hopefully, that's not the case (?)
Nope. I had a viral flu in early 2018 that had eerily similar symptoms to what is claimed for Covid-19: I was really sick with respiratory problems for ~5 weeks, and it was ~14 weeks before I really felt well again. I was never hospitalized, so that virus was never identified, but I'm hoping that was a coronavirus because some studies in Europe found that prior exposure to other coronaviruses *may* give some increased resistance to this one. In any event, I don't have your current email.
>>> Are you sure the "corruption" can't be stripped from the post >>> with a filter (script)? >> >> The junk is HTML formatting. The worry is that things like C++ source >> legitimately may contain angle bracket delimited text. You'd need a >> smart filter that understands HTML tags. > >Or, scrape the posts manually? E.g., highlight text in browser, >copy, paste?
Laborious if someone posted a long program.
>If posted as an "image" of text (to deliberately hinder capture), >a screen capture program feeding an OCR... and manual touch-up.
Yuck! On average OCR still makes ~1 mistake per line.
>Though, having seen Wojciech's example, it appears that there is >more involved than just eliding HTML tags! I've not actively studied >the (apparent) transformation to try to codify the rules that may >have been applied...
The problem there is Python. For almost any other language, your idea of scraping it manually would work. For Python, you have to understand the logic to reinstate the required indentation. I have always been opposed to significant whitespace in a language. George
Reply by Don Y November 28, 20202020-11-28
On 11/26/2020 9:14 PM, George Neuner wrote:
> > On Thu, 26 Nov 2020 19:01:33 -0700, Don Y > <blockedofcourse@foo.invalid> wrote:
Hi George! Have not heard from you in a while -- was beginning to think that you may have been coviderated! Hopefully, that's not the case (?)
>> On 11/26/2020 4:11 PM, Wojciech Zabo?otny wrote: >>> A few Usenet groups allowed users to post their source code as shar archive. >>> The Google Groups website supported access to those groups, viewing the >>> message in the original (raw) format and upacking the sources. >>> Unfortunately, last update of Google Groups has dropped a possibility >>> to access the original of the Usenet posts. >>> The "formatted" (in fact corrupted) version of the message does not >>> allow to unpack the (now damaged) shar archive. >> >> I don't understand what you mean by "corrupted"? Do you have a >> pointer to an example that I can examine (without a google login)? >> >> Are you sure the "corruption" can't be stripped from the post >> with a filter (script)? > > The junk is HTML formatting. The worry is that things like C++ source > legitimately may contain angle bracket delimited text. You'd need a > smart filter that understands HTML tags.
Or, scrape the posts manually? E.g., highlight text in browser, copy, paste? If posted as an "image" of text (to deliberately hinder capture), a screen capture program feeding an OCR... and manual touch-up. Though, having seen Wojciech's example, it appears that there is more involved than just eliding HTML tags! I've not actively studied the (apparent) transformation to try to codify the rules that may have been applied...
> And there may be a *lot* of it. I've seen usenet messages sent (or > forwarded) from Google Groups with ... not kidding! ... ~10,000 lines > of deeply nested HTML surrounding ~10 lines of text. > >>> Does it mean that all sources that were posted to Usenet are now >>> lost for us forever? >>> Is there any other way to access the old Usenet messages in their original >>> form? > > Since Google has removed the option to see the raw message, the only > way to get things unmangled is from some other source.
For a small-ish post, I'd wager you could scrape (as above) and manually edit the resulting text to something that's faithful to the original intent. Tedious and potentially error prone but denies "lost forever".
> Unfortunately few NNTP servers go back further than about 10 years, > and ftp.uu.net (the original usenet archive) is no longer operating. > > You can try > https://usenetarchives.com/ or > https://www.crunchbase.com/organization/the-usenet-archive. > > Many(most?) of the historically popular groups are available, and that > includes pretty much everything in the comp.* and sci.* hierarchies. > But searching is not easy, and if you're looking for something > esoteric you may not find it.
Some of the better known sources are also available on FTP servers. E.g., I think Vixie's cron(8) is available like this.
Reply by Don Y November 28, 20202020-11-28
On 11/27/2020 5:40 AM, Wojciech Zabolotny wrote:
> pi&#261;tek, 27 listopada 2020 o 03:21:39 UTC+1 Don Y napisa&#322;(a): >> On 11/26/2020 4:11 PM, Wojciech Zabo&#322;otny wrote: >>> A few Usenet groups allowed users to post their source code as shar archive. >>> The Google Groups website supported access to those groups, viewing the >>> message in the original (raw) format and upacking the sources. >>> Unfortunately, last update of Google Groups has dropped a possibility >>> to access the original of the Usenet posts. >>> The "formatted" (in fact corrupted) version of the message does not >>> allow to unpack the (now damaged) shar archive. >> I don't understand what you mean by "corrupted"? Do you have a >> pointer to an example that I can examine (without a google login)? > > Here you are: https://groups.google.com/g/alt.sources/c/YeeAV3fBAVc/m/AZgPoFxS4NYJ > The Python code has completely removed indentation.
Indentation and whitespace /tend/ to be insignificant to the operation of the code. Of course, presence in string literals is a different story -- where even replacing tabs with spaces is a hazard.
>> Are you sure the "corruption" can't be stripped from the post >> with a filter (script)? > > No, the indentation space are simply removed. There is no way to recover them.
From a quick look, it seems like the problem goes beyond that. Note that the leading 'X' is stripped from most -- but not all -- lines of "encoded" files.
>>> Does it mean that all sources that were posted to Usenet are now >>> lost for us forever? >>> Is there any other way to access the old Usenet messages in their original >>> form?
As you appear to be the owner of the file (and presumably have another copy stashed away), you might try reposting it as a SHAR but uuencoded, first. I would suspect that would be more robust wrt whatever pretty-printing algorithm google is trying to impose. Or, just keep a copy on some other public archive.
Reply by Wojciech Zabolotny November 27, 20202020-11-27
pi&#261;tek, 27 listopada 2020 o&nbsp;03:21:39 UTC+1 Don Y napisa&#322;(a):
> On 11/26/2020 4:11 PM, Wojciech Zabo&#322;otny wrote: > > A few Usenet groups allowed users to post their source code as shar archive. > > The Google Groups website supported access to those groups, viewing the > > message in the original (raw) format and upacking the sources. > > Unfortunately, last update of Google Groups has dropped a possibility > > to access the original of the Usenet posts. > > The "formatted" (in fact corrupted) version of the message does not > > allow to unpack the (now damaged) shar archive. > I don't understand what you mean by "corrupted"? Do you have a > pointer to an example that I can examine (without a google login)? >
Here you are: https://groups.google.com/g/alt.sources/c/YeeAV3fBAVc/m/AZgPoFxS4NYJ The Python code has completely removed indentation.
> Are you sure the "corruption" can't be stripped from the post > with a filter (script)?
No, the indentation space are simply removed. There is no way to recover them.
> > Does it mean that all sources that were posted to Usenet are now > > lost for us forever? > > Is there any other way to access the old Usenet messages in their original > > form?
Reply by George Neuner November 27, 20202020-11-27
On Thu, 26 Nov 2020 19:01:33 -0700, Don Y
<blockedofcourse@foo.invalid> wrote:

>On 11/26/2020 4:11 PM, Wojciech Zabo?otny wrote: >> A few Usenet groups allowed users to post their source code as shar archive. >> The Google Groups website supported access to those groups, viewing the >> message in the original (raw) format and upacking the sources. >> Unfortunately, last update of Google Groups has dropped a possibility >> to access the original of the Usenet posts. >> The "formatted" (in fact corrupted) version of the message does not >> allow to unpack the (now damaged) shar archive. > >I don't understand what you mean by "corrupted"? Do you have a >pointer to an example that I can examine (without a google login)? > >Are you sure the "corruption" can't be stripped from the post >with a filter (script)?
The junk is HTML formatting. The worry is that things like C++ source legitimately may contain angle bracket delimited text. You'd need a smart filter that understands HTML tags. And there may be a *lot* of it. I've seen usenet messages sent (or forwarded) from Google Groups with ... not kidding! ... ~10,000 lines of deeply nested HTML surrounding ~10 lines of text.
>> Does it mean that all sources that were posted to Usenet are now >> lost for us forever? >> Is there any other way to access the old Usenet messages in their original >> form?
Since Google has removed the option to see the raw message, the only way to get things unmangled is from some other source. Unfortunately few NNTP servers go back further than about 10 years, and ftp.uu.net (the original usenet archive) is no longer operating. You can try https://usenetarchives.com/ or https://www.crunchbase.com/organization/the-usenet-archive. Many(most?) of the historically popular groups are available, and that includes pretty much everything in the comp.* and sci.* hierarchies. But searching is not easy, and if you're looking for something esoteric you may not find it. George
Reply by Don Y November 26, 20202020-11-26
On 11/26/2020 4:11 PM, Wojciech Zabo&#322;otny wrote:
> A few Usenet groups allowed users to post their source code as shar archive. > The Google Groups website supported access to those groups, viewing the > message in the original (raw) format and upacking the sources. > Unfortunately, last update of Google Groups has dropped a possibility > to access the original of the Usenet posts. > The "formatted" (in fact corrupted) version of the message does not > allow to unpack the (now damaged) shar archive.
I don't understand what you mean by "corrupted"? Do you have a pointer to an example that I can examine (without a google login)? Are you sure the "corruption" can't be stripped from the post with a filter (script)?
> Does it mean that all sources that were posted to Usenet are now > lost for us forever? > Is there any other way to access the old Usenet messages in their original > form?