News Feed App in 12 Lines of Code: Small Tech for Valuing the Marginal Edges of Web Publishing TPS-0111

Date: 2024-02-29

Tags: file, lines, code, rss, server, page, news, feed, application, tag, email, headlines, bash, website, tech, script, order, items, content, variables, tool, text, source, protocol, list, html, command, chef, work, url, story

Download MP3 ▽

Revised Transcript:

As I develop web applications for this project that I'm doing, everything is meant to be shared, and there's interesting nuance around litigating the liabilities and responsibilities of open source.

I'm choosing not to officially issue any of the typical open source software type licenses for any of the code that I'm writing at this time.

I am creating these applications, and they live on my server and are free to use at at this time.

All of them are free to use, I don't know how or when or why that would change, but for now, they're free to use.

The source code is visible in terms of what the HTML pages are that I'm building.

But what happens in some circumstances, which is on the server side, is that I'm providing the user experience on the front end of the HTML files, which you can just right click and view source, and you'll see all the code that is being sent to your browser from my server and the HTML.

The lines of code per page that are there. In all my apps up to this point. There's nothing that is happening once the file is served. There's no communication back and forth with the server.

There are applications that will be developed in time that those features will exist. I'll be excited to share what the logic is and what considerations are for for those situations.

So, with that said, bit of a disclaimer, what I just finished doing right now, is noteworthy because it will be integrated into future future tools and applications.

What it basically is, for me, is a way to generate a very unique rss feed of sorts it's almost like an rss feed light without using the markup language that is the essence of what that protocol actually does.

I really appreciate it, I use it everyday. I'm using the rss protocol in order to get this show out.

I wanna over the anatomy of this websites syndicator. So real simple syndication, this would be MSS, micro simple syndication, or something like that.

But it's twelve lines of code, and it is a BASH Script.

Born Again Shell script is what I understand it to be.

All of these crazy designer code chef snippets that I need to get jobs done, those I get often times through searching forums and then just taking the recipe and adapting it to exactly what I'm doing and some of the stuff builds on itself.

It's very intuitive. A lot of it is, you say to yourself, there is no way, in a million years, even if I read all these manuals backwards and forwards, which I end up doing a lot, I'll read the help and the manual pages a hundred times, and nowhere in there would I get the emergent property of what somebody baked and showed.

What you can do using some of these just very alien language, bizarre syntax kind of structures, but sparing all those details to get to a point where I could get a job done that emulates UM, an RSS protocol to get that done in twelve lines of code using Bash. It's able to pull in all kinds of different tooling.

At this point you could consider it native because if it started out as something external and kind of unique to one platform, well, a lot of things get absorbed and bundled into a modern server.

However many, probably millions of lines of third party, open source, licensed code that gets bundled into it.

There are these interesting command line tools that you can pull into a bash script. So the first aha moment I had about working with Linux was this concept not totally unique to Linux. Obviously, having a command line interface. Those exist across platforms by different names, but the Bash scripting environment, the terminal environment on Linux operating systems.

It immediately made sense to me how it was a paradigm shift from what I was used to of just, oh, I have one icon on the desktop. I click on it. That's an application. If I want the file that I create that application to work in another application, I have to save that file somewhere in the desktop, and I have to open that other application in another window and use their interface to look for that file.

So the workflow between, let's say video editing, where you're adding audio clips, and you're using layers from image editing, it's tool suite and then you end up trying to chain those together into workflow.

You're bouncing back between all these different interfaces, and it's great for what that is. But it clicked for me that, oh, you can basically take the command structure and the input and output of code from one application to the next.

The word often used is pipe, connect input is very permaculture, in a sense.

You're connecting one system elements output to the input of another, and getting richness and complexity and emergent properties that can be just magical when they all kind of snap together and work well, you get so much, there's so much value in that and efficiency.

Because now what I would have had to sit there and do one app at a time, opening it, closing it, moving a file, finding it again, all those things can happen in a script.

Think of it like a script for a play or a movie in the sense that the application interprets the commands and then it chains them together and then the behaviors that you would have had to do manually they all happen almost literally in most circumstances depending on what you're doing.

If you're just creating and deleting and sorting and outputing different files and trying to get really like being a chef. I consider it being a code chef. You take the raw material and then you do your code chef thing, and you end up with this very elegant product, whatever that is meant to be at the endpoint.

So this little micro version of an RSS, what I'm doing here, that's a little bit of broader scope background on the whole concept of these frameworks.

But getting it focused on the project of the day. It's this essentially micro RSS, in this twelve lines of bash scripting.

If I were to tell the story of these twelve lines, the story is line by line.

I'm gonna translate and tell the story of this application line by line.

So here it goes, I will enumerate the lines as they go along.

So line one, hey, computer, this is a bash script, which means first thing you're gonna do is read it as an application, versus how you might just read it as lines of text that don't do anything or mean anything.

This is an executable file, and it is being identified at the top line with what they call a shebang, which is like a hash tag and exclamation point.

The syntax is introducing itself as a BASH script or application.

Line two is where I'm using an application that's built into most modern systems, and it gives you a suite of HTTP protocol request architecture so that you can without using a browser...

This is where this command line stuff gets really interesting, is that in one line of code, my server can ask another server on the Internet based on its IP address or its, DNS, its url...it says go to this url and read it and copy the contents of it and then build a file at a certain location on my server with the contents of that file so it's essentially what your browser does when you visit a web page, you give it a url, or you click on a link, and that gives it the url.

The url opens in the browser, and you are essentially downloading a temporary copy of that file to then be parsed and interpreted and displayed by your browser.

But back to line two. It's doing that task of retrieving the file and then copying it to where you wanna save that file, because you're gonna be working with it further down the story.

So lines three and four are setting variables for what I want to start with, to bracket an extraction method from that file.

Line two was was, was go to the file, save it to a specific location on the server.

Now we're getting into parsing it, meaning picking it apart and taking out what we want.

So I'm setting a variable, which is the contents of an HTML tag.

What's in between the two left and right angle brackets.

That's the first variable I'm gonna set, and I'm gonna set a second variable that is what's within the closing bracket.

The reason I had to create two separate variables is that if you wanted to get all of the inner contents of what's within the body of an HTML document, then you would just be able to say, create this variable as just the word body. Because the opening tag and the closing tag, they have the same contents.

However, if you have style sheet information, or you're attributing that element within HTML file, giving it an ID, an identifier, then obviously the opening tag is gonna have a more and different text in it, then the closing tag.

That's why I had to establish two separate variables for both the opening and the closing.

So number five is where I'm going to make use of having set those variables by running a number of applications together in a chain where I'm gonna read that file that I saved, which is that web page that I downloaded and saved.

I'm gonna have the scripting language read that file, and then use another tool.

The next tool is gonna take those variables as input to a command that's going to crop out everything from each line in that file, each line of text in that file, it's gonna remove everything except for the content that exists between those two tags. So for me, this is the way to not have to look at a million ads every day.

It's also my own version of the reader mode that a lot of browsers now have.

What it's doing is it's saying, I don't want any of the other information on this page. I just want the actual text that's between these tags.

So for me, it was a title tag. If you're looking at a web page and you see a header section. Normally the biggest one is at the top. It's like the page title, article title, what have you.

Then down throughout that page of the article, there are sub sections that have, typically smaller, but still prominent of headline formatting.

Maybe it's bold, maybe it's a larger size font, etc. So say it's H1, obviously, what I'm trying to get at, if I'm making my own little RSS feed, I don't need to get large snippets or excerpts from the site. I really only want a list of headlines from the page. I don't want the ads. I don't want the description so much. I just wanna list of the headlines. If I find anything interesting, then I'll go and search, or look it up, or go to the source and find it.

Because this is private, I don't need to get anything else other than exactly what I want for myself.

So that is just to get the text content between the open and closing tag of a header tag, and to just strip everything else out per line.

So that I end up with as many lines as there were header tags, and then only the content within that.

There's a couple of other steps in order to remove any of the crumbs.

There's crumbs left. Talking about the chef metaphor. There's a couple of crumbs that are left in that process.

There are many different tools and different kind of recipes in order to remove those crumbs and just get the pure text that you want.

So there's a couple of tools here that are used in a couple of extra steps to clean it up and to trim it up.

Just a quick summary recap. Using the variables for the tags, putting them into the cropping out tool and then chaining it together to clean up the crumbs.

Now what I've got is a new file. It pipes that, ultimately to a new file that just has just those lines.

Okay, but there's issues with that file. It's not ready for me to enjoy.

A couple other things have to happen. There are some items within this list that are not within my scope of interest, in terms of the content.

So I'm taking another couple steps to within that new file that was created.

I'm on line six now, and I'm gonna strip out the lines within that list of headlines that contain, let's say, news items are irrelevant to my interests.

If I'm lucky, they will have patterns that will repeat so I can easily identify them, group them and strip them out just by taking a sample of a pattern that repeats.

With line six reading the new file specifying the patterns within those lines that I want to identify, and then strip those lines out, and then I'm sending that to a new file.

Now we're on line seven.

With that new file, I want to make an even more nuanced filter, which is to eliminate the news items that get caught up in the list that are sort of orphaned. They may be referring to a something that makes sense in the context of viewing that page but they get sort of caught up in this net.

They're getting caught up in the original net. I need to get them filtered out.

The ones that were not actual news headlines, they were orphaned material, they would tend to have only a couple or a few words.

So there's a way to only strip out all of the lines that don't have a minimum of a of a reasonable headline.

There aren't gonna be very many three word headlines out there.

So I set the parameter, I set the number for my cut off.

So maybe losing out on extremely important news that is only a few words in the headline, but I'm gonna risk it because I don't wanna be cluttered with this sort of noise.

Now we're on line eight.

It's going to remove the empty lines and remove the empty spaces so the file isn't all these weird tabs and empty spaces and empty lines, it's gonna strip that out and align everything adjusted all over to the left, and send that to a new file.

These are tiny files there. There are ways if I wanted to go even further and reduce the number of files that are being created here.

But to make a readable and to also kind of have these checkpoints as you move along, it helps to do a series of tasks that are comprehensible to the human mind, my human mind, at least, and then stop it there, have a file.

Then you can go back and audit. I've already iterated through this, and I've eliminated one of these file steps, but I'm comfortable with the number of operations that are being done per file. It gives me a way to kind of segment and troubleshoot and do bug hunting.

So this line also not only is removing empty space, but sorting the lines and sorting them is what prepares them for the most important feature for me of this tool, which is that every time this tool runs, which it runs on a what's called a cron table, a chronological table, every day this code is going to be executed by the server, and it's going to repeat this process.

I've have a couple of considerations there, which is to not just have a million megabytes of the same lines of code over and over.

The important thing is the sorting part of this which is helpful in what's gonna happen further down the line.

Now I have this the version that has everything stripped out, except for a list of headlines that I want.

They're sorted in alphabetical order, and they're also de duplicated in the sense that, as this process repeats, there is a bit of piling up so that I can identify and remove the duplicate lines.

Because this website does not offer an RSS feed of its own that would guarantee me only fresh new items I have to remove old items. It's like if I did this manually, I will be sitting in front of a computer, and I will be refreshing a page once a day, but I would be looking at a lot of the same news items because they don't provide an RSS feed, and they don't provide a segmentation by the date that would allow me to read only the new headlines from today and not from yesterday.

It's just sort of an aggregation of news items, and some of them continue to get updates.

So the reason I had to build this was that I stopped going to that website because, there was a change of paradigms with being able to have custom RSS feeds from certain social media platforms, shall I say.

So I had to kind of look into, what would it take for me to replicate the tooling that used to be available, and do this DIY.

The problem to solve here is that there's a website that does not have an RSS feed, and on its news page.

It doesn't clearly distinguish between old news and new news.

So in order for me to do that with my twelve lines of code, one of these lines has got to keep ongoing keep track of every day when I make a copy of the page, I've got to compare the contents of that.

It has got to say, oh, this is a news headline that you've already read on a previous day, on a previous cycle of reading this, because it's appearing again, we're gonna take out.

The code is gonna, is gonna take out, not just the duplicate, and leave the original, if you will. It's gonna take out both. Because obviously logically if I've already seen it before it can go away. I don't need to see it again if I took note of it and I was interested in it then I would have saved it.

I wanna make sure that every day, when this cycle is done, I am only getting headlines that have not appeared before in the past.

However many days I let it pile up before I clear it all out and start fresh so maybe I'll do that once a week or once a month which I can also automate.

The point is, how do you not rely on big tech to kind of spoon feed you what you want from the Internet. How do you not have to load a page and then go, oh, I don't remember which I read before.

Half of what I'm reading, I've already read.

It's not infuriating, but it becomes a waste of time, and it becomes more trouble than it's worth. But it's really worth it. I want this news feed. I want this as part of my personal daily briefing, if you will, my personal dashboard on the world everyday.

This is extremely, extremely important. I've let it deprecate in my life because of big tech cutting it off or cutting off it being able to be built into an RSS feed by other means.

So for me, I had to do this, it took a while, and then I decided it's time for me to not be left in the dark by the arbitrary decisions of big tech.

I'm gonna do it DIY, so that is the short story long of line number eight.

Now, line number nine.

It creates a subject line for an email.

Then line ten is going to take the contents of the final product that I described, all the steps leading up to that, to number eight, where it creates the file that has everything exactly as I want. It's stripped of everything I don't want.

So line ten grabs the content of that file with everything I want. It puts it into this email template.

Then line 11 does the operation using an email server, a light type of email server it's going to format the email protocol, the content into the email protocol in order to send that email out, to authenticate and send an email.

Line twelve is where it cleans up after itself and removes the files that are no longer necessary from one cycle to next, so it doesn't clutter.

So for now, that is the description of a custom newsletter application in twelve lines of bash scripting. This is in the spirit of permaculture, the spirit of do it yourself. The spirit of design and solving problems.

And then, in terms of the intelligence product, I like the idea that I've heard on these espionage shows, they talk about in their language, they would say, oh, I'm gonna hire this intelligence shop, meaning, think tank, basically.

Then they use words like the belt way, and there's all this insider lingo.

It's just funny to me, but I like the idea that I'm a one man intelligence shop.

I have some clients. I'm a client to myself, first and foremost, but I was just on a client call a few minutes ago when we were talking about all this kind of stuff, privacy and security tooling upgrades for people who have been abused by big tech their whole lives, like me.

I wish I would have got started earlier with all this, but the reason why it's important for me to document his journey and to share it is that, hey, it was a journey for me to go from publishing to a popular content management system, a popular blog website system to saying, I need to push myself to create my own RSS feed and manage it and update it, and do everything, literally from the command line, which is giving instructions without a lot of fluffy kind of user friendly graphical user interfaces.

One of my favorite quotes is Linux is like driving a stick versus automatic.

It's good to have that experience, that knowledge. If you understand that analogy, you know, there's a lot more control that you can have and that feeds into a lot of different aspects.

So I think it's good for me, a new paradigm for approaching my relationship with technology, which is that I don't want to delegate. I certainly don't want to over delegate as many things as I can do myself.

That's that's how much more insight I will have into the code that I'm running, that runs my life and that runs my mind.

If I can make a 12 lines of code to take any website in the world, essentially, and adapt these twelve lines and make my own feed... This is important because think about what's happening in Congress right now.

It's all comes down to who is legally culpable for the mayhem that's caused by algorithmic news feeds on social media platforms.

That is as big of a deal as anything gets on this planet, from the mega billionaire, big tech social media prerogatives to all the nuances of people wanting to protect their families online.

The Congress in the US is having a very difficult time cooperating on getting important safety measures, guard rails in place.

But for me, I don't wanna be surprised by fake news or deep fakes or anything.

There's plenty of stuff going on there, and I feel that I stay very well informed but I'm very selective and really most of that comes from rss.

The only reason I'm doing this right now is because of a big tech backlash against the freedom of rss doing its thing. Life gives you lemons. Make lemonade.

This was a whole day of work doing this, and I learned a lot, and now I have even more advanced experience. It just really expands the mind when you get into the logic of how these tools work. And just the way to be this sort of code chef.

And you look at other code chefs that are amazing, and you see their work.

And I'm having a lot of fun with this. This wasn't the way I grew up.

I didn't take a lot of things apart and put them back together.

I didn't write programs when I was a kid.

So this is a second opportunity to understand this kind of fun and make it be very useful.

These days they say you better learn to code because tech is gonna take your job, whatever.

Perhaps that's true. But I like to be a little more positive about it and say, hey, learn to code because for the same reasons you would learn a drive stick, you never know when your life may be dependent on that skill, and you may enjoy it.

You may think, hey, this is cool. I think I wanna have a vehicle that has a stick shift, because it's its own experience.

So I could go on about it. But I am proud in this moment to say that I tested it thoroughly. I tested it over and over. I hope that the website that I'm syndicating for myself privately, I hope that they were not annoyed at all.

I'm sure their servers are doing just fine. I did not exhaust their server by doing my exhaustive testing.

But I did have to test this many times in order to resolve certain formatting issues and get it to display in the email the way I wanted, etc.

So thank you to them for allowing me to do that.

They don't seem to have booted me off their network and block my IP.

So thank you very much. And who knows, maybe someday I'll convince them to do the RSS feed.

But for now, I like the idea that I can take these twelve lines of code, I can adapt them to any website that doesn't currently provide an RSS feed, or even if I wanna customize it in a way, I have the skills to do that, and that gives me more control over my intellectual diet.

And I wanna have a healthy and holistic intellectual diet. And I hope you do as well, and I hope you find ways to take the power back in your digital life.