DC- Chapter - 1 : Distributed Computing: An Introduction



1.1         Definition and introduction to distributed computing

A Distributed Computing System is a collection of systems which are distributed at different remote sites and connected using a common network like LAN, WAN etc and are capable of collaborating a task.

Earlier computing was performed on a single processor or uni-processor called as monolithic computing, which makes use of a single CPU processor to execute one or more programs for each application.

Computers are considered to be independent if they don’t share memory or program execution space with each other. Such computers are called loosely coupled computers. And opposite to that, we have tightly coupled computers which share memory space with each other.

The computer programs running on independent computers collaborate with each other to perform computing such as network services and web-based applications.

Network services are the services that run on a network provided by a special kind of program called a server. Examples include WWW, e-mail and FTP applications.
Whereas a network application is an application that runs on a network for end users like online shopping carts, electronic auction sites, chatrooms and network games.

1.2 History of distributed computing

In the beginning there were stand alone computers each of which was capable of executing stored programs. They needed to be connected using cables to exchange information with each other.

Then came the Internet Request for Comments (RFC), RFC1, which specifies how participating hosts can exchange information with each other through messages, whereas there may have been individual attempts to create network applications, where the earliest one was e-mail in which the message was sent in 1972 using a four node ARPANET. After that, the automated file transfer mechanism was introduced, which allowed data files to be exchanged between two or more hosts. To this day e-mail and file transfer mechanisms remain two of the most popular network services.

After that, came the best known network service which was the World Wide Web Server (WWW), which could allow the exchange of hypertexts (Web pages) over the network. The WWW has since become the common platform for network applications and services including emails, search engines, e-commerce etc.

The web was originally conceived in the 1980s by scientists at the Swiss research Institute, CERN in GENEVA. The WWW was responsible for an explosion in the scale of the internet. Until 1990 ARPANET was the primary data network used by scientists, researchers and learners. But after 1990, WWW completely took over ARPANET as the primary data network throughout the world.

1.3 Different forms of computing

There are basically 4 forms of computing: Monolithic, Distributed, Parallel and Cooperative computing.

Monolithic or Centralized Computing

It is the simplest form of computing where a single computer such as a PC is used for computing. The computer is not connected to any network and thus it can use only those resources available locally inside that computer.

Here the computing is performed on a single processor or uni-processor, which makes use of a single CPU to execute one or more programs for each application and most of the times used by only one user at a time.

Multiple users can also engage in monolithic computing, where the users share the resources of a single computer. This technique is known as timesharing. The computer that provides the centralized resource is usually called as a mainframe to differentiate it from smaller computers such as minicomputers and microcomputers. The devices called as terminals are used to interact with the central mainframe computers.

Distributed Computing

In contrast we have distributed computing which involves computing performed among multiple network-connected computers, each of which has their own processor and resources. In this method whenever a task is given it is shared among all the computers connected in the network. Here, the user on one computer can also access the resources from some other remote computer connected in the network. The WWW is a good example for this type of computing where on requesting a file or web page from the browser, the file may be fetched from yet another remote server or computer.


Fig 1.1 Centralized vs. Distributed Computing
 
Parallel Computing

Similar to but distinct from distributed computing is a form of computing known as parallel computing or parallel processing, which uses more than one processor simultaneously to execute a single program. Parallel computing is typically performed on a single computer that has multiple CPUs, but it is also possible to perform parallel processing by connecting the computers in a network. However, this type of parallel processing requires very sophisticated software called distributed processing software.

Cooperative Computing

Recently, the term distributed computing has also been applied to cooperative computing projects such as the Search for Extraterrestrial Intelligence (SETI) and distributed.net.  These are projects that parcel out large-scale computing to workstations on Internet hosts, making use of surplus CPU cycles. Here, whenever a task is given to a computer, if the task is very big, it can share it with another computer which is having a small task to complete and hence the name cooperative computing.

1.4 Strengths and Weaknesses of Distributed Computing

There are a number of reasons for the popularity of distributed computing (Strengths):

1) The affordability of computers and availability of network access:

Today's personal computer has computing power superior to that of the mainframe computers of the early days, at a fraction of the size and the cost. Coupled with the fact that connectivity to the Internet has become universally available and generally affordable, the large number of interconnected computers makes for an ideal community for distributed computing.

2) Information Sharing among Distributed Users: Distributed Computing provides efficient person-to-person communication facility by sharing information over great distances. In a distributed computing system, information generated by one of the users can be easily and efficiently shared by the users working at other nodes of the system.
For example, a user can access his bank account details from one branch of a bank, located in his current state from another state’s branch.

3) Resource sharing: Using distributed computing, organizations can pool their resources very effectively. Different resources like information, softwares and hardware resources like printers, antivirus applications etc can be effectively shared in a distributed environment. The Web, for example, is a powerful platform for sharing documents and other resources within and among organizations.

4) Fault tolerance: Compared to monolithic computing, distributed computing provides the opportunity for fault tolerance in that a resource can be replicated (or mirrored) to sustain its availability in the presence of failures. For example, backup copies of a database can be maintained on different systems on the network, so that when one system fails, other copies can be accessed without disrupting the service.
5) Scalability: With monolithic computing, the available resources are limited to the capacity of one computer. By contrast, distributed computing provides scalability in that increasing demand for resources can be addressed effectively with additional resources. For example, more computers providing a service such as email can be added to the network to satisfy an increase in the demand for that service.

6) Shorter response Times and Higher Throughput: Due to multiplicity of processors, distributed computing systems are expected to have better performance and provide faster results than single-processor centralized systems.

In any form of computing, there is always some or the other issues or weaknesses.
Some of the most significant weaknesses in distributed computing are:

1) Multiple points of failure: There are more points of failure in distributed computing. Since multiple computers are involved, all of which depend on the network for communication, the failure of one or more computers, or one or more network links, can spell trouble for a distributed computing system. Thus in a distributed system, "the failure of a computer you didn't even know existed can make your own computer unusable."

2) Security concerns: In a distributed system, there are more opportunities for security breaches and unauthorized attacks, whereas in a centralized system all the computers and resources are typically under the control of a single administration and hence less vulnerable to attacks.

3) Maintenance: As there are many computers connected to each at different remote sites, it is difficult as well as expensive to maintain the distributed system.

1.5 Basics of Operating Systems

Distributed computing involves programs running on multiple computers. Let us look at some of the concepts involved with the execution of programs in modern-day computers.

1.5.1 Computer Programs and Processes

A software program is an artifact constructed by a software developer using some form of programming language. Typically, the language is a high-level one that requires a compiler or an interpreter to translate it into machine language.

When a program is "run" or executed on a computer, it is represented as a process. On modern computers, a process consists of an executing program, its current values, state information, and the resources used by the operating system to manage the execution of the program. In other words, a process is dynamic entity that exists only when a program is run.

Figure 1.2 illustrates the state transitions during the lifetime of a process. A process enters a ready state when a program is at the start of its execution, where it is placed in a queue by the operating system, along with other programs that are to be executed.


   Fig 1.2 A simplified state transition diagram of a process
 When system resources (such as the CPU) are available for its actual execution, the process is dispatched, at which point it enters the running state. It continues to execute until the process must wait for the occurrence of an event (such as the completion of some input/output operation), at which time it enters a blocked state. Once the anticipated event occurs, the process will be placed on the execution queue and await its turn to execute once again. The process repeats the ready-running-blocked cycle for as many times as necessary until the execution of the process is completed, at which time the process is said to be terminated.

1.5.2 Concurrent Programming

Distributed computing involves concurrent programming, which is programming that involves the simultaneous execution of processes.

There are basically three kinds of concurrent programming.

1) Concurrent processes executed on multiple computers

In this method the processes interact with each other by exchanging data over the network, but their execution is otherwise completely independent. When you access a Web page using a browser, a process of the browser program, running on your machine interacts with a process running on the Web server machine.
Concurrent programming involving multiple machines requires programming support. That is, the software for the participating program must contain the logic to support the interaction between processes.

2) Concurrent processes executed on a single computer

Modern computers are supported by multitasking operating systems, which allow multiple task or processes, to be executed concurrently. The concurrency may be real or virtual. True concurrent multitasking on a single computer is feasible only with computers which has multiple CPUs, so that each CPU can execute a separate process. On a computer that has only one CPU, timesharing or time-slicing is used to allow processes to take turns being executed and thus creating an illusion that they are being executed in parallel.

  3) Concurrent programming in a process

In addition to concurrent programming in separate processes, it is often necessary for a single program to initiate tasks that are to be executed concurrently. For example, it may be necessary for a program to perform other tasks while waiting indefinitely for user input in one user interface window. It may also be desirable for a program to execute tasks in parallel, for performance reasons. Concurrent programming within a process is performed using two types of facilities provided by the operating system – Parent-child processes and Threads.

Parent and Child Processes

At run time, a process may spawn subordinate processes, or child processes. Through real or virtual multitasking, the original process, called the parent process, continues to run simultaneously with the child processes. A child process is a complete process, consisting of an executing program, its own current values, and state information, some of which is inherited from the parent process.



Fig 1.3 Concurrent Processing within a process

Threads

Instead of child processes, a process can generate threads, also known as light-weight processes. Threads carry a minimum of state information, but otherwise behave the same as processes. Since they incur less overhead, threads are preferred over child processes.
             The spawning and coordination of child threads requires programming support. The software for the program must be written to contain logic to support the spawning of the threads and to coordinate, or synchronize, the execution of the family of threads spawned by the parent thread.

The concurrent execution of threads may result in a race condition. A race condition occurs when a series of commands in a program are executed in parallel in an arbitrarily interleaved fashion, yielding nondeterministic execution outcome.
Race conditions can be avoided if mutual exclusion is provided to a code segment to ensure that the commands in the segment can only be executed by one thread at a time. Such a code segment is called a critical region.

Programming using threads is called multi-threaded programming. A multi-threaded program that is written to guard against race conditions is said to be thread-safe.

Java Threads

The Java Virtual Machine enables an application to have multiple threads of execution running concurrently. When a Java Virtual machine starts up, there is usually a single thread that typically calls the method named main of some designated class, such as the class of an application that you wrote. Additional threads can be spawned from an active thread, and each thread will run independently and in parallel with other threads until it terminates.
To support threading in a program, Java provides a class named Thread as well as an interface named Runnable interface.

From within a Java program, there are two ways to create a new thread of execution:

     1.  Declare a class to be a subclass of Thread. This subclass should  override the run method of class Thread. When an instance of the subclass is allocated and started, the code in the run method is executed concurrently with the main method

2.   Declare a class that implements the Runnable interface. That class implements the  run method of the interface. When an instance of the class is allocated and started, the code in the run method is executed concurrently with the main thread.


1.6 Network Basics

Having looked at some key concepts in operating systems that are relevant to distributed computing, next we will do the same with network basics.

1.6.1 Protocols

In the context of communications, a protocol is a set of rules that must be observed by participants, while communicating over a network involving computers.

For each protocol, there must be rules that specify the following:
• How is the data exchange encoded?
• How are events (sending, receiving) synchronized or ordered so that the participants can send and receive in a coordinated manner?

For example, Hypertext Transfer Protocol (HTTP) specifies the rules that must be observed between a Web browser process and a Web server process. Any Web server program written in conformance to these rules satisfies the protocol, regardless of what programming language or syntax is employed.

1.6.2 Network Architecture Protocols

The Internet Protocol is mainly used for identifying computers on the network and for routing the data.
At the transport layer, there are two widely used protocols: The Transmission Control Protocol (TCP) provides connection-oriented communication, while the User Datagram Protocol (UDP) supports connectionless communication.
The TCP and protocols of the Internet and transport layers of this architecture are universally employed for data communication over the Internet.
Finally, at the application layer, protocols such as File Transfer Protocol (FTP), Simple Network Mail Protocol (SNMP) and Hypertext Transmission Protocol (HTTP) are specified for network applications.

1.6.3 Connection-Oriented versus Connectionless Communication

In connection-oriented communication, a connection may be physical provided using hardware such as cables, modems, etc or logical using software that emulates a connection and it is established between two parties, the caller and the callee.
In Connection-oriented communication, once a connection is established, data can be sent repeatedly over the connection continuously until the session is over. Once the data reaches the destination, you will get an acknowledgement message, confirming the delivery of data. It also ensures that data packets are delivered safely and in order, along an established connection, at the cost of additional processing overhead.
As the name implies, connectionless communication involves no connection. Instead, data is sent a packet at a time. Each packet must be explicitly addressed by the sender to the receiver. Each packet takes a separate path from source to destination and there is no guarantee of the data delivery. The exchange continues until the session, is over.
Similarly at the transport layer of the TCP/IP suite, the User Datagram Protocol (UDP) is a connectionless protocol, while the Transmission Control Protocol (TCP) is a connection-oriented protocol.
 
Fig 1.5 Connection-Oriented versus Connectionless Communication
           
Connection Oriented
Connectionless
Addressing
Specified at connection time; there is no need to re-specify with each subsequent operation (send or receive).
Addressing is specified with each operation.
Connection overhead
There is an overhead for establishing a connection.
Not applicable.
Addressing overhead
No addressing overhead with each individual operation.
Overhead is incurred with each operation.
Data delivery order
The connection abstraction allows the IPC mechanism to maintain the order of delivering data packets.
The lack of a connection makes it difficult for the IPC facility to maintain delivery order.
Protocols
The mode of communication is appropriate for protocols that require exchange of a large stream of data and/or a large number of rounds of exchange.
The mode of communication is appropriate for protocols that exchange a small amount of data in a limited number of rounds of exchange

Table 1.1 Comparisons of Connection-Oriented and Connection less Interprocess Communication (IPC)
 1.6.4 Network Resources
Network resources refer to the resources that are available to the participants of a distributed computing community. For example, on the Internet the network resources include hardware such as computers (including Internet hosts and routers) and equipment (printers, facsimile machines, cameras, etc and software such as processes, email mailboxes, files, and Web documents. Important class of network resources is network services, such as the World Wide Web (WWW) and file transfer service, which are provided by specific processes running on computers.

1.6.5 Host Identification and Internet Protocol Addresses

Physically, the Internet is a gigantic mesh of network links and computers. Conceptually, the main arteries of the Internet are a set of high bandwidth network links that constitute the "backbone" of the network. Connected to the backbone are individual networks, each of which has a unique identifier. Computers with TCP/IP support, called Internet hosts, are linked individual networks. Through this system of "information highways" data be transmitted from a host H1 on network NI to another host H2 on network N2. To transfer data from within a program, it must be possible to uniquely identify the process that is to receive the data, similar to addressing the recipient of a data delivered by the postal service.

The internet consists of an hierarchy of networks, interconnected via a network backbone. Each network has a unique network address. Computers or hosts are connected to a network.  Each host has a unique ID within its network. Each process running on a host is associated with zero or more ports for data transmission. Fig 1.6 shows the architecture of an internet topology.
  

Fig 1.6 Networks Topology
Each computer is given a unique IP address by the Internet Assigned Numbers Authority (IANA) for uniquely identifying each compute on a network. In IP version 4 (IPV4), each address is 32 bit long. The address space accommodates 232 (4.3 billion) addresses in total. Addresses are divided into 5 classes (A through E)


In this class A, the first octet (8 bits) is used for the network address, and the remaining three octets (24 bits) represent a host address. Therefore, a Class A network can have up to 16,777,216 (256*256*256) devices.  The range of Class A is from 1.0.0.0 to 127.255.255.255.
Class B uses the first two octets for the network address and the last two octets for the host address. Class B networks can have up to a maximum of 65,536 (256*256) hosts. Class B addresses range from 128.0.0.0 to 191.255.255.255.
In Class C, the first, second, and third octets are used to denote the network address in Class C while the fourth octet denotes the host address. As a result, a Class C network can accommodate only 255 hosts. The range of Class C is from 192 to 223.
 Class D addresses are used for multicast and Class E addresses are reserved for experimental purposes.
Suppose the dotted-decimal notation for a particular Internet address is129.65.24.50.   The 32-bit binary expansion of the notation is as follows:
Since the leading bit sequence is 10, the address is a Class B address.  Within the class, the network portion is identified by the remaining bits in the first two bytes, that is, 00000101000001, and the host portion is the values in the last two bytes, or 0001100000110010.  For convenience, the binary prefix for class identification is often included as part of the network portion of the address, so that we would say that this particular address is at network 129.65 and then at host address 24.50 on that network.

1.6.6 Identifying Processes with Protocol Ports
Specifying the correct domain name or its corresponding IP address allows us to locate a computer or host on the Internet. But in network applications, data needs to be delivered to a specific process running on a computer. Thus we need a naming scheme to allow us to uniquely identify such a process. There are any numbers of possible schemes to do so. For example, one possibility is to make use of a unique process identifier (PID) assigned to the process by the operating system.
On the Internet, the protocol for process identification involves the use of a logical entity known as a protocol port, or a port for short. The transport layer in the Internet architecture is responsible for dispatching data to processes, and that two well-known protocols are in use at this layer: TCP and UDP. Each of these protocols uses a separate set of ports on each host for this purpose. A process that wishes to exchange data with another process using either TCP or UDP must be assigned one of the ports.
In the TCP and UDP protocols, protocol numbers 0 through 1023 (210) are reserved for well-known services known as well-known ports, these numbers are assigned by the Internet Assigned Numbers Authority (IANA) and on some systems these numbers can only be used by system processes or by programs executed by privileged users. Each popular network service, such as telnet, FTP, HTTP, or SMTP, is assigned one of these port numbers (23, 21, 80, and 25, respectively) .

1.6.7 Email Addresses
An email address takes the form of username@DomainName. For example, mliu@csc.calpoly.edu identifies the author of a particular book. When you send an email identifying that email address as the recipient, a mailer program on the IP host with the specified domain name delivers the email to the mailbox of the specified user on that system, in this case the author of a book.

1.6.8 URLs

Users of Web browsers are familiar with Uniform Resource Locators (URLs). When you enter a name string such as http://www.csc.calpoly.edu in the browser to visit a particular Web site, you are specifying a URL.
A URL is a naming scheme under the more general scheme known as Uniform Resource Identifiers (URIs). URIs are short strings that identify resources on the Web, including documents, images, downloadable files, services, and electronic mailboxes. The URI scheme provides a uniform way of addressing these resources under a variety of naming schemes used in individual application protocols such as HTTP, FTP, and Internet mail.
The Uniform Resource Name (URN) is a scheme specified by RFC2141 which provides persistent names within a namespace, thus allowing a permanent object to be mirrored over several known sites. Thus, if a site is unavailable, the object could be found or resolved at another site.
The general format of a URL can be written as:
<protocol>/ / <user>:<password>@<host-ID>:<port-number>/<directory path>
where
<protocol> is the exact but case-insensitive name of the application-layer protocol you wish to use to access the resource; for example, HTTP if you are attempting to access a Web browser.
<user>:<password> is for access authorization, if required by the protocol.
<host-ID> is the domain name or dotted-decimal IP address of the host that provides the service allowing you to access the protocol; for example, www.w3schools.com.
<port-number> is the transport-layer protocol port for the process that pro-vides the service on the remote host; for example, 80 (by default) for HTTP or Web servers.
<directory path> specifies where in the file system of the remote host the resource can be located. It can also be called the root of the document directory hierarchy; for example: w3shools/csc102/index.html.

A shortened form of a URL, termed a relative URL, can be used at  times, where if another file exists in that same directory called courses.html, then the URL courses.html can be named in that document instead of the full URL.

1.6.9 Extensible Name Service (XNS)

Extensible Name Service (XNS) is an Internet naming service managed by the XNS Public Trust Organization (XNSORG), an independent, open-forum organization. The service supports a naming scheme that allows a single, universal address to be used by a user to perform communications of all types - email phone, fax, Web pages, instant messaging, even postal mail.
As a naming and addressing service, XNS operates at a higher level than DNS. It is designed to resolve a name into the address of an Internet host computer. XNS is designed to resolve a universal address into any type of address on any type of communications network.
An XNS is a character string. There are three types of XNS names: personal names, business names, and general names, each of which starts with a unique leading character (=, @, and +, respectively) and each of which can contain up to 64 Unicode characters.

1.6.10 Name Resolution

Whenever a symbolic name is used to identify a resource, the name must be translated to the corresponding physical address in order to locate the resource. We have already seen that a domain name such as www.google.con, for an Internet host must be translated to the numerical address, say 129.65.123.7, of that particular computer. The process of the translation is called name resolution, or more simply, name lookup.
To perform name resolution, a database also called a directory or a registry must exist containing the mappings between symbolic names and physical names. If the namespace of a naming scheme is of a limited size, then it is possible to perform name resolution manually. In the case of the DNS or XNS, manual process cannot be done and a network service has to be provided to support online name resolution.
For the DNS, the name lookup service is provided by machines that are called DNS servers. A central authority maintains the name database and sees to that the database is distributed throughout the Internet to the DNS server. When a domain name is specified - whether entered into a browser or coded in a program being executed, the name is submitted to the nearest DNS server for resolution. If the nearest server does not have the mapping, that server forwards the request to another DNS server. The propagation of the request continues until the name is resolved, at which time the mapping is sent back to the process that originated the request.

1.7 Software Engineering Basics

1.7.1 Procedural versus Object-Oriented Programming

In building network applications, there are two main classes of programming languages: Procedural language and Object-oriented language.

Procedural languages, the C language being the primary example, use procedures to break down the complexity of the tasks of an application. For example, an application may be coded using a procedure (also called a function) to perform the input, another procedure to perform the computation, and a third procedure for generating the output.

Object-oriented languages, exemplified by Java, use objects to encapsulate the details. Each object simulates an object in real life, carrying state data as well as behaviors. State data is represented as instance data (in Java) or data members (in C++). Behaviors are represented as method.

1.7.2 The Unified Modeling language (UML)

An important step in software engineering is the production of artifacts, or documents, to record the conceptual design of the application being developed. For readability, these documents should be written using a universal set of notations and languages. The Unified Modeling Language (UML), developed by the Object Management Group (OMG) is such a facility. UML provides a common set of language and notations for specifying, visualizing, constructing and documenting the artifacts of software systems.

OMG-UML provides a rich set of tools for all aspects of software engineering, the coverage of which belongs in software engineering courses. UML makes use of UML class diagrams (and only a subset of them), for documenting the relationships of some of the Java classes that appear in our presentation.

Fig 1.7 A Subset of UML class diagrams

1.7.3 The Architecture of Distributed Applications

The idea of using a multilayer architecture to organize the functionalities of data network can be applied to distributed applications. Figure 1.20 presents an example of such an architecture.

Presentation
Application (Business) Logic
Services

Fig 1.8 The architecture of distributed applications
Using this architecture, the functionalities of a distributed application can be classified in three layers:

• The Presentation layer provides the user interface. For example, if the application is a shopping cart, this layer generates the set of Web pages that are viewable by a shopper using a browser.

The Application logic layer provides the computation for the application. This layer is also called the business logic layer for enterprise applications.
In a shopping cart application, this layer is responsible for tasks like credit verification and computing the dollar amounts of the orders, sales tax, and delivery cost.

The Service layer provides the underlying services needed to support the functionalities of the top two layers. Services may include data access facilities (such as a database management system), directory services for name lookups (such as the Domain Name Service), and interprocess communication.

4 comments:

  1. How do we download. Make it as pdf please

    ReplyDelete
  2. Please make it a pdf to download

    ReplyDelete
  3. Need DC NOTES PLEASE DM ME @MR._.SHEIK
    ON INSTAGRAM

    ReplyDelete
  4. I also want the DC notes.
    Is there anyone,who can provide me pdf's

    ReplyDelete

Whatsapp Button works on Mobile Device only

Start typing and press Enter to search