A Deep Dive Into The Dark Web

As the title suggest, the topic of this post is the dark web. It is actually some of my more recent work abbreviated into a blog post to help me prepare for a lecture I am presenting in the upcoming week. The whole process of discovering and learning about mechanisms driving the dark web was disturbingly amusing. I do not think it is quite common for someone to enjoy reading about iterative, dynamically-created virtual circuits utilizing TLS-encrypted nodes, PuK Encryption, and 128-bit AES cipher. This past sentence may not make any sense right now, it represents the core principle of the onion routing protocol, but I aim to make it all clear by the end of this paper. Sadly, this does not include a discussion about the Tor browser, but goes in depths of the Tor network instead. To satisfy any present hunger for knowledge I am planning to also prepare the browser overview, as well as a guide on how to make an access point for anonymizing (encrypting) your home network traffic! Cool stuff, now let’s get back to today’s topic.

Dark Web, Dark Net, Deep Web…?

Before we go any further with this relatively complex topic, let’s make sure there is no dilemma between the following commonly misinterpreted terms.

Surface web (clear web, surface web, world wide web) is all the indexed content that can be reached by search engines. In other words; anything you can “google”.

Deep web (invisible web, hidden web) – represents all the content (webpages) that can not be reached by search engines and therefore is not indexed. This part of The Internet represents the biggest portion in terms of size and usually can not be accessed without knowing exactly where and how to find the destination.

Dark net –  is not interchangeable with the term “dark web”. Dark net is a term used for the overlaying networks that require specific configurations, software, or permissions to reach them. They are still a part of the Internet, but only available for accessing with special configurations.

Dark web – exists on darknet networks. It entails all content available on dark nets, which means that websites are not going to be indexed by the regular search engines operating on the Internet since they can not reach them. Perhaps the most famous dark net known today is Tor network.

Let’s give these terms a bit of a visual context by associating them with a picture of an iceberg:

Deep-Web-Dark-Web

Source

It is important that we understand what these terms represent before we go any deeper. They are referenced further on in the work, especially dark net and dark web since they are the focal points of this research through identifying Tor.

Tor

The chances are, you have heard of this thing called “Tor”, but never really bothered too much to understand what it is, how it works, or why use it. Get ready, because by the end of this post, that is if you bear with me, you will be familiar with all the dirty details about Tor network, and how anonymous you actually are when utilizing dark nets.

So, what exactly is Tor? As described in “Tor: The Second-Generation Onion Router”– the official design document of Tor, it is “a circuit-based low-latency anonymous communication service”[1]. Let’s simplify this and first decipher “anonymous communication service”. The main selling point of Tor is its ability to anonymize a user with the use of encrypted communication and a special routing protocol, developed for the sole purpose of concealing what is being transferred, and between who that data flows. The previous adjective “low-latency”, hints at one of the design goals – create a service that is usable by relatively non-expensive computational processes. In other words; make it work in a way where one would find it inviting to use, instead of being repelled by long connection times. Finally, “circuit-based” means that the service is built on connections, called virtual circuits.

Tor Design

Warning: beyond this point you may want to put your scuba gear on – we are about to dive into the deep blue. Looking at Tor from the surface level, it is an overlying network on top of the Internet, but here we discuss its structure in depth.

The network utilizes a principle called onion routing, a fairly old concept which was taken as the basis for implementing Tor. It consists of a set of nodes, recognized by Tor service and deemed trusted. These nodes are also referred to as “onion routers” or “OR”, which take care of the traffic flow. Every OR will maintain a TLS (Transport Layer Security protocol) connection between every other node it connects to, essentially forming virtual circuits. A user trying to connect to Tor network must have downloaded the required software before establishing a connection. This process is very self-explanatory and will enable the use of a “onion proxy” or “OP”, which will instigate forming of a circuit on its own. OP’s other roles are to fetch directories from Tor’s directory servers, handle the functions of various user applications, and multiplex incoming TCP streams through circuits.

Onion routers are purposed to store the values of different types of identifiers (keys), used for encryption/decryption purposes – e.g. signing TLS certificates or authenticate the validity of their own information, such as: ID, bandwidth, exit policy, etc. The communication between ORs\OPs happens via TLS utilizing short-lived keys. A very important thing to note is that at no point will one OR know of more than its predecessor and successor – the node which passes it the data, and the node which it sends the data to. This ensures that the path can not possibly be traced back through the network. As we see, there is a lot of consideration involved to ensure data integrity and secrecy.

Data Cells, Circuits, and Streams

All the communication among these nodes consists of 512 bytes large building blocks, called cells. There are two types of cells: control or relay. Control cells will always be interpreted by the node that receives them, while a relay node will simply be passed on to another node. They differentiate themselves by the cell’s structure:

blog_2_1

blog_2_2

We will not go into depth of what each segment of the data represents.

Tor enables multiple TCP streams to share a circuit, which drastically reduces the somewhat expensive circuit creation overhead (0.1s). New circuits are formed all the time in the background of OPs, thus a failed creation of one does not impact user’s experience. The process of establishing a circuit, followed by inputting a stream goes as follows:

blog_2_3

Source

The figure above has Alice (the user) connecting only to two nodes ( a feature of Tor’s onion routing principle – leaky pipe topology, enables the data exiting at any node in the circuit ), before connecting to the final destination – website. The process of forming a circuit may seem complex, but let’s see what happens by simplifying the figure:

  1. A user (Alice) establishes a circuit with OR 1 (a pair of keys is generated).
  2. The circuit is extended to OR 2 (another pair of keys is generated).
  3. Communication flow and data encryption is now enabled through the virtual circuit.

Streams come in place when a user wants a TCP connection to a certain address with and a port. The process for that happens in the following manner:

  1. A user (Alice) user asks OP to establish a connection.
  2. OP selects which circuit to connect to
  3. Stream opens when the OP sends a specifically constructed (relay) cell to the exit node. Once the node connects to the destination host, it notifies the OP and the user application instigating the process.
  4. Alice is now able to accept data from the TCP stream, and constructing cells to be sent along the circuit.

In rough, this is the underlying process of creating and establishing a communication between nodes on the Tor network, and communications to outside destinations. Note that a lot of details, such as explaining encryption protocols, hashsing between each node, etc. have been left out for brevity purposes.

Congestion control

Congestion is a term used in the world of networks for when connections become saturated due to a high amount of traffic. Tor utilizes a technique called bandwidth limiting to help maintain the stability of connections and a more sustainable flow of data. It aims to level out the amount of information coming in and out of circuits (Tor network).

However, this is not enough to prevent congestion. For that, Tor implements two levels of congestion control; circuit-level throttling and stream-level throttling. As the names suggest, one concerns the way traffic flows through circuits, while the other one governs traffic in and out of streams. They both work on the same principle of having two dynamic variables, changing based on the amount of data cells being processed in the environment. Based on those two variables they stop incoming traffic, or accept it.

Other Security and Design Considerations

Security is crucial for Tor’s success and reliability, thus it was more than just an afterthought and was incorporated in the design for the beginning. Of course, with it being a low-latency solution to anonymization, there has to be a trade-off between how much security can actually be incorporated in the design itself in order to fulfill the pre-defined set of goals.

The first design issue has to do with streams and the differences in DNS (Domain Name Server) resolution process. Some applications are configured in a way that hostnames are passed to the Tor client by using alphanumeric notation, while others resolve it to an IP address first, and then pass it to the Tor client. The latter case can potentially expose the user and reveal its location if the remote DNS server is the first destination. In case this process takes place, it diminishes all the reasons why one would want to use Tor network in the first place. Luckily, there is a relatively easy fix for that – using a service called Privoxy, which always passes hostnames to the Tor client and eliminates DNS leaks.

Integrity checks are a very important addition to Tor. Their purpose is to ensure that the data entering the virtual circuit remains the same when it exits it. Every time a user negotiates a key with a OR, they both initialize a SHA-1 digest with the use of that key – hashes are commonly used in this manner. This hash is used to add 4 bytes of overhead to the digest of every new cell created, ensuring their integrity remains unchanged as it passes through the circuit. This principle diminishes a sizeable array of attacks, including data-spoofing.

Another security characteristic is the implementation of so-called exit policies. Every OR has one, and have their functionality defined by the type of an exit policy applied to it. Tor differentiates among three types: open, middleman, and private. Open exit nodes can connect to anywhere, middleman nodes are used solely for relaying traffic among nodes, and private nodes only allow connections to a local host or local network. These policies are applied by the node’s administrators, remember – Tor is a volunteer-based network, so no node can be forced into operating in ways it does not want to.

The last component discussed here are directory servers. This is set of well-known and trusted nodes, whose role is to keep track of Tor network’s topology, including node state, their keys, and exit policies. Every directory server acts similar to an HTTP server, enabling clients to obtain data, as well as allowing other ORs to refresh data and the state of nodes. Whenever a new node desires to get on the list it has to be approved by an administrator to ensure its validity and intents – the last thing anyone wants to happen is to see directory service be compromised.

Wrap-up

I tricked myself into thinking that a somewhat broad research could be condensed into a nice and short blog post. I do believe I have covered the important mechanisms giving life to Tor, even though I had to leave out a couple of sections, such as Tor’s hidden services – a very cool component, as well as the Tor browser. The latter, when used correctly, is hands down a great tool for accessing dark web and anonymous browsing. I strongly believe in it being a great choice for browsing the web without showing the big brothers where your curious fingers take you. The good news is that since Tor is an open source project, it is continuously being improved as the community strives to better its performance and patch it up as soon as any vulnerabilities are discovered.

Thanks for reading!

Sources:

[1]  Dingledine et. al, “Tor: The Second-Generation Onion Router.” torproject, svn.torproject.org/svn/projects/design-paper/tor-design.pdf. Accessed 4 April 2017.

One thought on “A Deep Dive Into The Dark Web

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s