The hidden anatomy of websites (BEHIND THE PAGE)

The hidden anatomy of websites (BEHIND THE PAGE)

Introduction

Before getting proper into building a website, it is necessary to understand how websites and the internet work. To understand how things work, we dissect the congregate down to its bare essentials. A website, like everything else in the information world, is made out of code, which is appropriately displayed in a browser. Sounds simple, right? That’s because 99% of what is happening wasn’t taken into account. In this chapter, we take you behind the scenes, into the inner workings of the Internet.

The Internet Protocol Suite

Consider the scale of the internet. Billions of devices with exponentially more connections between them communicating at the same time. How does one device (say, a browser) know which computer is the one it is looking for? How does it let the other computer know what it needs? How do technologically different devices communicates with each other? How do they coordinate the exchange seamlessly every time?

Just as the rules functionally define a game, the Internet is functionally defined by a set of protocols, collectively called the Internet Protocol Suite. Today’s Internet Protocol Suite began as one of the first research projects for the development of computer networking models, funded by DARPA of the United States Department of Defence. That effort led to the creation of the Transmission Control Protocol/Internet Protocol (TCP/IP) by Robert E. Kahn and Vinton Cerf, which was later into split into further layers. TCP/ IP provides end-to-end data communication specifying how data should be packetized, addressed, transmitted, routed and received; it tells the Internet how to work, and it is still in use today.

For formulating the TCP/IP, Cerf and Kahn had the intention of ena- bling the communication of multiple different networks with their own local protocols. They wanted to make a network model that handled the transfer of information as efficiently as possible, with all other intelligence handled at the nodes. Today’s Internet Protocol Suite is composed of four layers of abstraction between the information on the computer and the information that is physically transmitted. Within the abstraction layers there are many individual protocols for specific purposes which are as different from each other as sugar and salt. The central two layers are the Internet Layer and Transport Layer, of which the Internet Protocol (IP v4, v6) and Transmission Control Protocol (TCPv4) are major components. The peripheral layers are the Application Layer on the software side, and Link Layer on the hardware side.

What does any of this have to do with websites, you ask? Well, to play a game well, one needs to understand the rules. Similarly, understanding the protocols of the IP Suite will definitely give you a leg-up in your dealings on the Internet. We will discuss the pertinent protocols of the Application Layer and refrain from proceeding down the funnel of abstraction, as they don’t pertain specifically to websites.

 

Application Layer – Protocols for functioning websites

The application layer handles the interface between the message data from host and the message that is communicated. It is not responsible for the communication itself, and has no specifications for how the data is to be communicated (mapping, routing, etc). The various protocols that we are about to go over briefly have one overarching theme – they specify the standard for how communication is packaged. The transmission is left to the higher abstraction levels of the TCP/IP.

DNS

The Domain Name System is a protocol for mapping resources between the two major namespaces – domain names and IP addresses. IP addresses are the equivalent of physical locations of websites (technically, of the servers that host the websites) on the internet, while domain names are easily recognizable labels that refer to those locations. This reference is what is taken care of automatically and efficiently by DNS. It essentially provides a worldwide, distributed, decentralized and scalable directory system that makes it possible for websites to have human-readable URLs. When you enter a url into your browser, the DNS Name Server directs the request to the appropriate IP address, according to the domain name entered by the user. However, the DNS Name Server also resides on a computer connected to the internet, so the IP address of the name server must be entered manually into the configuration settings of the internet connection of the client computer for it to be able to access the internet. Some ISPs have their own DNS Name Servers, else you can always use servers provided by OpenDNS or the Google Public DNS servers.  For your own website, if you purchase your domain name and hosting from the same place, the DNS settings and registration will be taken care of automatically for you. However, if you bought a domain name and want to use free hosting provided by someone else, you will need to manually specify the Name Servers of your host.

Web pages and hyperlinks (HTTP)

If there was a popularity contest for protocols, the HyperText Transfer Protocol would surely be one of the top contenders.

Consider the process to open a website: You enter the URL of the website you want to visit, because you only know the domain name. The DNS Name Server parses the domain name and maps it to the corresponding IP address. Your request for the website reaches the Web Server, and it serves you back the page you requested. Of course, this is one of the simplest case scenarios, but almost any interaction of a user and a website is handled according to HTTP. That is why almost all URLs start with “http://” or “https://”. HTTPS is HTTP using Secure Socket Layer (SSL) for encryption.

HTTP was designed as a ‘request-response’ protocol, meaning that is modelled after how humans communicate in the real world. In general, a client sends a request to a server, which appropriately sends back a response. The response can also contain resources, if requested. The latest version of HTTP is HTTP 2.0 which was standardized in 2015. Prior to that we had HTTP 1.1, which is very much in use even today, because older technologies take time to phase out. HTTP (along with HTML) was invented by Tim Berners-Lee and his team at CERN, the original development starting in 1989.

 

HTTP is essentially a format for packaging data, and only hardcore web developers need to know more than the four parts of a request packet. The second entry field in that format is the header, containing the request method, which deserves some attention. The first version of HTTP had only one method, GET, which is sent by the client to request some resource (like a webpage) from the server. Some progress later, HTTP 1.1 has eight methods, and the second most popular request method is the POST method which is how elements like forms return user data to a web page. There are also the PUT and DELETE methods, which can be used for telling a database engine in the backend what to do (for example appending details in the database for creating a new user account).

File Transfer Protocol (FTP)

The File Transfer Protocol was first developed by Abhay Bhushan in 1971. It is the predecessor of HTTP, which was originally developed as an enhance- ment of FTP, optimised for transfer of information on the scale of websites. FTP is still in use today for transferring web content, many hosting vendors’ interfaces for uploading your website’s files are driven by FTP transfers. There are many dedicated client softwares for FTP data transfer with a server, such as the open source FileZilla. Some WYSIWYG web page building software also include direct FTP upload functionality. However, FTP was not designed to be secure, and has many vulner- abilities and holes as data is not encrypted. In fact, all transmissions are cleartext, so mere packet sniffing can easily land the username and password into malicious hands. To overcome this, FTP over Secure Shell (SSH) and FTP secured by SSL (FTPS) were developed.

SMTP

SMTP stands for Simple Mail Transfer Protocol, which defines the how electronic mail is sent and received amongst the various mail servers on the networks of the internet. Mail servers are gateways between the client computer/application and the internet, that specialize in sending and receiving email. You, as a user, use an email client software (ex – Microsoft Outlook, Apple Mail, Google Inbox) to read and write your emails. The client software sends emails to the mail server using SMTP, however, it retrieves emails from the mail server using POP3 (Post Office Protocol version 3) or IMAP4 (Internet Message Access Protocol version 4). When you purchase the domain and hosting space for your website, you may also get a few email accounts. This means you can have an email address like [email protected] for official purposes. To enable usage on your favourite email client, you will need the IMAP/POP settings enabled in your mail server provider, which you then copy while entering your authorization details into the email client.

TLDR

The Internet runs according to a set of rules (protocols) called the IP Suite. The IP Suite manages the packaging and transmission of data across the entire World Wide Web. Some of those protocols are DNS, for accessing websites using URLs made of common words, HTTP, which is how websites (servers) and browsers (clients) communicate, FTP, for clients and servers to transfer data, and SMTP, which is how email is sent across the Internet.

Loading

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.