CaltechAUTHORS: Combined
https://feeds.library.caltech.edu/people/Martin-A-J/combined.rss
A Caltech Library Repository Feedhttp://www.rssboard.org/rss-specificationpython-feedgenenThu, 20 Jun 2024 19:43:17 -0700An Axiomatic Definition of Synchronization Primitives
https://resolver.caltech.edu/CaltechCSTR:1982.5046-tr-82
Year: 1981
DOI: 10.1007/BF00261260
The semantics of a pair of synchronization primitives is characterized by three fundamental axioms: boundedness, progress, and fairness. The class of primitives fulfilling the three axioms is semantically defined. Unbuffered communication primitives, the symmetrical P and V operations, and the usual P and V operations are proved to be the three instances of this class. The definitions obtained are used to prove a series of basic
theorems on mutual exclusion, producer-consumer coupling, deadlock, and linear and circular arrangements of communicating buffer-processes. An implementation of P and V operations fulfilling the axioms is proposed.https://resolver.caltech.edu/CaltechCSTR:1982.5046-tr-82The torus: an exercise in constructing a processing surface
https://resolver.caltech.edu/CaltechCSTR:1982.5047-tr-82
Year: 1982
DOI: 10.7907/cazcq-6fz54
A "Processing Surface" is defined as a large, dense, and
regular arrangement of processor and storage modules on a two-dimensional surface, e.g. a VLSI chip. A general method is described for distributing parallel recursive computations over such a surface. Scope rules enforcing
the "locality" of variables and procedure parameters are introduced in the programming language. These rules and a particular interconnection of the modules on the surface make it possible to transmit parameter and variable
values between modules without using extraneous communication actions.
The choice of the Processing Surface topology for binary recursive computations is discussed and a torus-like topology is chosen.https://resolver.caltech.edu/CaltechCSTR:1982.5047-tr-82The Design of a Self-timed Circuit for Distributed Mutual Exclusion
https://resolver.caltech.edu/CaltechCSTR:1983.5097-tr-83
Year: 1983
DOI: 10.7907/b2dbm-s0762
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1983.5097-tr-83A General Proof Rule for Procedures in Predicate Transformer Semantics
https://resolver.caltech.edu/CaltechCSTR:1983.5075-tr-83
Year: 1983
DOI: 10.7907/desx8-jhv04
Given a general definition of the procedure call based on the
substitution rule for assignment, a general proof rule is derived for
procedures with unrestricted value, result, and value- result parameters,
and global variables in the body of the procedure. It is then extended
for recursive procedures. Assuming that it has been proved that the body
establishes a certain postcondition I, the "intention," for a certain
precondition J, the proof rule permitting to determine under which
conditions a certain procedure call establishes the post condition E, the
"extension", is based on finding an "adaptation" A , a s weak as possible,
such that A ~ I -- E ( E ' is derived from E by some substitution of parameter variables.) It is preferable, but not essential, that the body
be "transparent " for the value parameters, i.e . , that the value parameters are not changed by the body.https://resolver.caltech.edu/CaltechCSTR:1983.5075-tr-83A Characterization of Product-Form Queuing Networks
https://resolver.caltech.edu/CaltechAUTHORS:20190111-145202534
Year: 1983
DOI: 10.1145/322374.322378
Queuing network models have proved effective in the design and analysis of computing systems. The class of queuing network models having product-form solutions is amenable to efficient, general solution techniques. The purpose of this
paper is to characterize such queuing systems. With this characterization it will be easy to determine whether the product-form algorithms can be used to analyze a system.https://resolver.caltech.edu/CaltechAUTHORS:20190111-145202534On David Gries's plateau problem
https://resolver.caltech.edu/CaltechAUTHORS:20161130-142130832
Year: 1984
DOI: 10.1145/1005968.1005974
[no abstract]https://resolver.caltech.edu/CaltechAUTHORS:20161130-142130832The Probe: An Addition to Communication Primitives
https://resolver.caltech.edu/CaltechCSTR:1984.5124-tr-84
Year: 1984
DOI: 10.7907/w8azk-3fk36
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1984.5124-tr-84Fair Mutual Exclusion with Unfair P and V Operations
https://resolver.caltech.edu/CaltechCSTR:1984.5148-tr-84
Year: 1984
DOI: 10.7907/8nwds-15p23
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1984.5148-tr-84Distributed Mutual Exclusion on a Ring of Processes
https://resolver.caltech.edu/CaltechCSTR:1984.5080-tr-83
Year: 1984
DOI: 10.7907/t0t4e-aq296
A set of processes called "masters" are sharing a critical section on a
mutual exclusion basis. The servers communicate with each other in a ring.
Three solutions for solving the mutual exclusion problem are presented.
They all rely on the
presence of a unique privilege in the ring. The notation used extends CSP
input and output commands with a Boolean primitive, the "probe", which makes
it possible to determine whether a communication action is pending on a
channel.
A master communicates only with its private "server".
In the correctness proofs, the concept of "trace" is introduced, i.e. a
total ordering of actions corresponding to a possible interleaving of the
atomic actions of a concurrent computation.
[Note: report includes the date April 83/October 83 but published in 1984]https://resolver.caltech.edu/CaltechCSTR:1984.5080-tr-83Networks of Machines for Distributed Recursive Computations
https://resolver.caltech.edu/CaltechCSTR:1984.5147-tr-84
Year: 1984
DOI: 10.7907/abxcf-t3r94
Distributed computations may be viewed as a set of communicating
processes. If such a computation is to be executed by a multi-processor
system, the processes have to be distributed over the processors and the
communications have to be distributed over a network.
This leads to the questions of load balancing and message routing. In
this paper we consider distributed recursive computations and we propose a
class of processor networks that admits a homogeneous dist ribution of processes and trivial routing. Furthermore, we identify a subclass that
admits a planar embedding of the network.https://resolver.caltech.edu/CaltechCSTR:1984.5147-tr-84Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1985.5202-tr-85
Year: 1985
DOI: 10.7907/mrh1j-cjp65
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1985.5202-tr-85A Delay-insensitive Fair Arbiter
https://resolver.caltech.edu/CaltechCSTR:1985.5193-tr-85
Year: 1985
DOI: 10.7907/cchf5-w1g63
No Abstract. Note: report is dated June 1985, May 1986https://resolver.caltech.edu/CaltechCSTR:1985.5193-tr-85A New Generalization of Dekker's Algorithm for Mutual Exclusion
https://resolver.caltech.edu/CaltechCSTR:1985.5195-tr-85
Year: 1985
DOI: 10.7907/0qedb-76g96
No abstract.https://resolver.caltech.edu/CaltechCSTR:1985.5195-tr-85The Sync Model: A Parallel Execution Method for Logic Programming
https://resolver.caltech.edu/CaltechCSTR:1986.5221-tr-86
Year: 1986
DOI: 10.7907/brq3w-kj598
The Sync Model, a parallel execution method for logic programming, is proposed. The Sync Model is a multiple-solution data-driven model that realizes AND parallelism and OR-parallelism in a logic program assuming a message-passing multiprocessor system. AND parallelism is implemented by constructing a dynamic data flow graph of the literals in the clause body with an ordering algorithm. OR parallelism is achieved by adding special Synchronization signals to the stream partial solutions and synchronizing the multiple streams with a merge algorithr
The ordering algorithm and the merge algorithm are described. The merge algrithm is proved to be correct and therefore, the Sync Model is proved complete, i.e
the execution of a logic program under the Sync Model generates all the solutionhttps://resolver.caltech.edu/CaltechCSTR:1986.5221-tr-86On Seitz' Arbiter
https://resolver.caltech.edu/CaltechCSTR:1986.5212-tr-86
Year: 1986
DOI: 10.7907/2gjaq-xex23
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1986.5212-tr-86Compiling Communicating Processes into Delay-Insensitive VLSI Circuits
https://resolver.caltech.edu/CaltechCSTR:1986.5210-tr-86
Year: 1986
DOI: 10.7907/pnf93-qxd46
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1986.5210-tr-86Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1986.5235-tr-86
Year: 1986
DOI: 10.7907/my65t-e9565
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1986.5235-tr-86Self-Timed FIFO: An exercise in Compiling Programs into VLSI Circuits
https://resolver.caltech.edu/CaltechCSTR:1986.5211-tr-86
Year: 1986
DOI: 10.7907/jssn5-rbp39
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1986.5211-tr-86Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1986.5220-tr-86
Year: 1986
DOI: 10.7907/rzewj-csb10
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1986.5220-tr-86A Synthesis Method for Self-Timed VLSI Circuits
https://resolver.caltech.edu/CaltechCSTR:1987.5256-tr-87
Year: 1987
DOI: 10.7907/649ae-we761
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1987.5256-tr-87Synthesis of Self-Timed Circuits by Program Transformation
https://resolver.caltech.edu/CaltechCSTR:1987.5253-tr-87
Year: 1987
DOI: 10.7907/cgpwa-2j421
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1987.5253-tr-87Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1987.5240-tr-87
Year: 1987
DOI: 10.7907/7cknj-w1w80
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1987.5240-tr-87A message-passing model for highly concurrent computation
https://resolver.caltech.edu/CaltechAUTHORS:20161130-143104095
Year: 1988
DOI: 10.1145/62297.62360
[no abstract]https://resolver.caltech.edu/CaltechAUTHORS:20161130-143104095The architecture and programming of the Ametek series 2010 multicomputer
https://resolver.caltech.edu/CaltechAUTHORS:20161215-172443490
Year: 1988
DOI: 10.1145/62297.62302
During the period following the completion of the Cosmic Cube experiment [1], and while commercial descendants of this first-generation multicomputer (message-passing concurrent computer) were spreading through a community that includes many of the attendees of this conference, members of our research group were developing a set of ideas about the physical design and programming for the second generation of medium-grain multicomputers.
Our principal goal was to improve by as much as two orders of magnitude the relationship between message-passing and computing performance, and also to make the topology of the message-passing network practically invisible. Decreasing the communication latency relative to instruction execution times extends the application span of multicomputers from easily partitioned and distributed problems (eg, matrix computations, PDE solvers, finite element analysis, finite difference methods, distant or local field many-body problems, FFTs, ray tracing, distributed simulation of systems composed of loosely coupled physical processes) to computing problems characterized by "high flux" [2] or relatively fine-grain concurrent formulations [3, 4] (eg, searching, sorting, concurrent data structures, graph problems, signal processing, image processing, and distributed simulation of systems composed of many tightly coupled physical processes). Such applications place heavy demands on the message-passing network for high bandwidth, low latency, and non-local communication. Decreased message latency also improves the efficiency of the class of applications that have been developed on first-generation systems, and the insensitivity of message latency to process placement simplifies the concurrent formulation of application programs.
Our other goals included a streamlined and easily layered set of message primitives, a node operating system based on a reactive programming model, open interfaces for accelerators and peripheral devices, and node performance improvements that could be achieved economically by using the same technology employed in contemporary workstation computers.
By the autumn of 1986, these ideas had become sufficiently developed, molded together, and tested through simulation to be regarded as a complete architectural design. We were fortunate that the Ametek Computer Research Division was ready and willing to work with us to develop this system as a commercial product. The Ametek Series 2010 multicomputer is the result of this joint effort.https://resolver.caltech.edu/CaltechAUTHORS:20161215-172443490A Message-Passing Model for Highly Concurrent Computation
https://resolver.caltech.edu/CaltechCSTR:1988.cs-tr-88-13
Year: 1988
DOI: 10.7907/3sb8a-cvh96
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1988.cs-tr-88-13Syntax-Directed Translation of Concurrent Programs into Self-Timed Circuits
https://resolver.caltech.edu/CaltechCSTR:1988.cs-tr-88-14
Year: 1988
DOI: 10.7907/585wz-fra78
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1988.cs-tr-88-14Design of Synchronization Algorithms
https://resolver.caltech.edu/CaltechAUTHORS:20201008-131242613
Year: 1989
DOI: 10.1007/978-3-642-74884-4_13
In these notes we discuss the design of concurrent programs that consist of a set of communicating sequential processes. The processes communicate via shared variables and synchronize via semaphores. We present an axiomatic definition of semaphores, and prove properties about them. The split binary semaphore is introduced and it is shown how it can be used in constructing the synchronization part of concurrent processes in order to maintain a given synchronization condition.https://resolver.caltech.edu/CaltechAUTHORS:20201008-131242613Distributed Sorting
https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-90-06
Year: 1989
DOI: 10.7907/zaevr-tmm71
In this paper we present a distributed sorting algorithm, which is a variation on exchange sort, i.e.,
neighboring elements that are out of order are exchanged. We derive the algorithm by transforming a
sequential algorithm into a distributed one. The transformation is guided by the distribution of the data
over processes. First we discuss the case of two processes, and then the general case of one or more
processes. Finally we propose a more efficient solution for the general case.https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-90-06The Design of an Asynchronous Microprocessor
https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-02
Year: 1989
DOI: 10.7907/avec3-s7f02
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-02Programming in VLSI: From Communicating Processes to Delay-Insensitive Circuits
https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-01
Year: 1989
DOI: 10.7907/zmy86-a1w29
No Abstract.https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-01The First Aysnchronous Microprocessor: The Test Results
https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-06
Year: 1989
DOI: 10.7907/bsky8-c6128
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1989.cs-tr-89-06The design of an asynchronous microprocessor
https://resolver.caltech.edu/CaltechAUTHORS:20161130-144153368
Year: 1989
[no abstract]https://resolver.caltech.edu/CaltechAUTHORS:20161130-144153368The first asynchronous microprocessor: the test results
https://resolver.caltech.edu/CaltechAUTHORS:20161130-145428229
Year: 1989
DOI: 10.1145/71317.71324
We have designed the first entirely asynchronous (also called self-timed or delay-insensitive)
microprocessor. The design was reported at the Decennial Caltech Conference on VLSI, last March. The conference paper is included here as an appendix. Since the chips had not yet been fabricated at the moment of writing
the conference paper, the paper does not include the results of the experiment. The purpose of this note is to publish these results, which are quite remarkable
because of the speed reached on this first design, and, as importantly, because of the surprising robustness of the chips to variations in temperature and VDD voltage
values.https://resolver.caltech.edu/CaltechAUTHORS:20161130-145428229Asynchronous Circuits for Token-Ring Mutual Exclusion
https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-09
Year: 1990
DOI: 10.7907/47710-bts58
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-09Testing Delay-Insensitive Circuits
https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-17
Year: 1990
DOI: 10.7907/8274b-29b89
We show that a single stuck-at fault in a non-redundant delay-insensitive circuit results in a transition either not taking place or firing prematurely, or both, during an execution of the circuit. A transition not taking place can be tested easily, as this always prevents a transition on a primary output from taking place. A premature firing can also be tested but the addition of testing points may be required to enforce the premature firing and to propagate the transition to a primary output. Hence all single stuck-at faults are testable. All test sequences can be generated from the high-level specification of the circuit. The circuits are hazard-free in normal operation and during the tests.https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-17Performance Analysis and Optimization of Asynchronous Circuits
https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-18
Year: 1990
DOI: 10.7907/b11q2-j0d17
We present a method for analyzing the time performance of asynchronous circuits, in paxticulax, those derived by program transformation from concurrent programs using the synthesis approach developed by the second author. The analysis method produces a performance metric (related to the time needed to perform an operation) in terms of the primitive gate delays of the circuit. Such a metric provides a quantitative means by which to compare competing designs. Because the gate delays are functions of transistor sizes, the performance metric can be optimized with respect to these sizes. For a large class of asynchronous circuits-including those produced by using our synthesis method-these techniques produce the global optimum of the performance metric. A CAD tool has been implemented to perform this optimization.https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-18Limitations to Delay-Insensitivity in Asynchronous Circuits
https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-02
Year: 1990
DOI: 10.7907/gwkvs-p4122
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1990.cs-tr-90-02Distributed sorting
https://resolver.caltech.edu/CaltechAUTHORS:20170830-083320673
Year: 1990
DOI: 10.1016/0167-6423(90)90081-N
In this paper we present a distributed sorting algorithm, which is a variation on exchange sort, i.e., neighboring elements that are out of order are exchanged. We derive the algorithm by transforming a sequential algorithm into a distributed one. The transformation is guided by the distribution of the data over processes. First we discuss the case of two processes, and then the general case of one or more processes. Finally we propose a more efficient solution for the general case.https://resolver.caltech.edu/CaltechAUTHORS:20170830-083320673Asynchronous Datapaths and the Design of an Asynchronous Adder
https://resolver.caltech.edu/CaltechCSTR:1991.cs-tr-91-08
Year: 1991
DOI: 10.7907/j14fv-twh92
This paper presents a general method for designing delay insensitive datapath circuits. Its emphasis is on the formal derivation of a circuit from its specification. We discuss the properties required in a code that is used to transmit data asynchronously, and we introduce such a code. We introduce a general method (in the form of a theorem) for distributing the evaluation of a function over a number of concurrent cells. This method requires that the code be "distributive." We apply the method to the familiar example of a ripple-carry adder, and we give a CMOS implementation of the adder.https://resolver.caltech.edu/CaltechCSTR:1991.cs-tr-91-08Synthesis of Asynchronous VLSI Circuits
https://resolver.caltech.edu/CaltechCSTR:1991.cs-tr-93-28
Year: 1991
DOI: 10.7907/b9wzv-xrc02
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1991.cs-tr-93-28A 100-MIPS GaAs asynchronous microprocessor
https://resolver.caltech.edu/CaltechAUTHORS:TIEieeedtc94
Year: 1994
DOI: 10.1109/54.282444
The authors describe how they ported an asynchronous microprocessor previously implemented in CMOS to gallium arsenide, using a technology-independent asynchronous design technique. They introduce new circuits including a sense-amplifier, a completion detection circuit, and a general circuit structure for operators specified by production rules. The authors used and tested these circuits in a variety of designs.https://resolver.caltech.edu/CaltechAUTHORS:TIEieeedtc94An action system specification of the Caltech asynchronous microprocessor
https://resolver.caltech.edu/CaltechAUTHORS:20201124-174613902
Year: 1995
DOI: 10.1007/3-540-60117-1_9
The action system framework for modelling parallel programs is used to formally specify a microprocessor. First the microprocessor is specified as a sequential program. The sequential specification is then decomposed and refined into a concurrent program using correctness-preserving program transformations. Previously this microprocessor has been specified in a semi-formal manner at Caltech, where an asynchronous circuit for the microprocessor was derived from the specification. We propose a specification strategy that is based on the idea of spatial decomposition of the program variable space. Applying this strategy we give a completely formal derivation of a high level specification for the Caltech microprocessor. We also demonstrate the suitability of action systems and the stepwise refinement paradigm for formal VLSI circuit design.https://resolver.caltech.edu/CaltechAUTHORS:20201124-174613902Specifying the Caltech asynchronous microprocessor
https://resolver.caltech.edu/CaltechAUTHORS:20170409-083932724
Year: 1996
DOI: 10.1016/0167-6423(95)00023-2
The action systems framework for modelling parallel programs is used to formally specify a microprocessor. First the microprocessor is specified as a sequential program. The sequential specification is then decomposed and refined into a concurrent program using correctness-preserving program transformations. Previously this microprocessor has been specified at Caltech, where an asynchronous circuit for the microprocessor was derived from the specification. We propose a specification strategy that is based on the idea of spatial decomposition of the program variable space.https://resolver.caltech.edu/CaltechAUTHORS:20170409-083932724Slack elasticity in concurrent computing
https://resolver.caltech.edu/CaltechAUTHORS:20201210-161233167
Year: 1998
DOI: 10.1007/bfb0054295
We present conditions under which we can modify the slack of a channel in a distributed computation without changing its behavior. These results can be used to modify the degree of pipelining in an asynchronous system. The generality of the result shows the wide variety of pipelining alternatives presented to the designer of a concurrent system. We give examples of program transformations which can be used in the design of concurrent systems whose correctness depends on the conditions presented.https://resolver.caltech.edu/CaltechAUTHORS:20201210-161233167Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1985.5178-tr-85
Year: 2001
DOI: 10.7907/7fbb9-smt37
No abstract available.https://resolver.caltech.edu/CaltechCSTR:1985.5178-tr-85Delay-Insensitive Multiply-Accumulate Unit
https://resolver.caltech.edu/CaltechCSTR:1992.cs-tr-92-03
Year: 2001
DOI: 10.7907/Z9MG7MPP
[No abstract]https://resolver.caltech.edu/CaltechCSTR:1992.cs-tr-92-03Submicron Systems Architecture Project : Semiannual Technical Report, 1 July 1992
https://resolver.caltech.edu/CaltechCSTR:1992.cs-tr-92-17
Year: 2001
DOI: 10.7907/Z9WS8RF5
The Mosaic C is an experimental fine-grain multicomputer
based on single-chip nodes. The Mosaic C chip includes 64KB of fast dynamic RAM,
processor, packet interface, ROM for bootstrap and self-test, and a two-dimensional selftimed
router. The chip architecture provides low-overhead and low-latency handling of
message packets, and high memory and network bandwidth. Sixty-four Mosaic chips are
packaged by tape-automated bonding (TAB) in an 8 x 8 array on circuit boards that can, in
turn, be arrayed in two dimensions to build arbitrarily large machines. These 8 x 8 boards are
now in prototype production under a subcontract with Hewlett-Packard. We are planning
to construct a 16K-node Mosaic C system from 256 of these boards. The suite of Mosaic
C hardware also includes host-interface boards and high-speed communication cables. The
hardware developments and activities of the past eight months are described in section 2.1.
The programming system that we are developing for the Mosaic C is based on the
same message-passing, reactive-process, computational model that we have used with earlier
multicomputers, but the model is implemented for the Mosaic in a way that supports finegrain
concurrency. A process executes only in response to receiving a message, and may in
execution send messages, create new processes, and modify its persistent variables before
it either exits or becomes dormant in preparation for receiving another message. These
computations are expressed in an object-oriented programming notation, a derivative of
C++ called C+-. The computational model and the C+- programming notation are
described in section 2.2. The Mosaic C runtime system, which is written in C+-, provides
automatic process placement and highly distributed management of system resources. The
Mosaic C runtime system is described in section 2.3.https://resolver.caltech.edu/CaltechCSTR:1992.cs-tr-92-17Tomorrow's Digital Hardware will be Asynchronous and Verified
https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-26
Year: 2001
DOI: 10.7907/Z9125QPR
Encouraged by the results of almost a decade of research and experimentation, we claim that tomorrow's design methods for digital VLSI will be based on a concurrent programming approach to high-level synthesis, asynchronous techniques, and correctness-preserving program transformations.https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-26Submicron Systems Architecture: Semiannual Technical Report
https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-37
Year: 2001
DOI: 10.7907/Z9NS0RX7
[No abstract]https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-37An Asynchronous Microprocessor in Gallium Arsenide
https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-38
Year: 2001
DOI: 10.7907/Z9BC3WJ5
In this paper, several techniques for designing asynchronous circuits in Gallium Arsenide are presented. Several new circuits were designed, to implement specific functions necessary to the design of a full microprocessor. A sense-amplifier, a completion tree, and a general circuit structure for operators specified by production rules are introduced. These circuit were used and tested in a variety of designs, including two asynchronous microprocessors and two asynchronous static RAM's. One of the microprocessor runs at over 100 MIPS with a power consumption of 2 Watts.https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-38Submicron Systems Architecture
https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-10
Year: 2001
DOI: 10.7907/4fh9g-yr824
The first attachment to this report, a paper titled "The Design of the Caltech Mosaic C
Multicomputer," appeared in the March 1993 proceedings of the University of Washington
Symposium on Integrated Systems. This paper describes the architecture, design, and
programming of the Mosaic C multicomputer, and the status of the project as of December
1992.
The following sections supplement the detailed information in this paper with reports on
other and subsequent Mosaic-project activities and results. In addition, research efforts that
are using the prototype Mosaic multicomputers for programming experiments are described
in sections 3.1, 3.2, and 3.4.https://resolver.caltech.edu/CaltechCSTR:1993.cs-tr-93-10Low-Energy Asynchronous Memory Design
https://resolver.caltech.edu/CaltechCSTR:1994.cs-tr-94-21
Year: 2001
DOI: 10.7907/Z9X9289S
We introduce the concept of energy per operation as a measure of performance of an asynchronous circuit. We show how to model energy consumption based on the high-level language specification. This model is independent of voltage and timing considerations. We apply this model to memory design. We show first how to dimension a memory array, and how to break up this memory array into smaller arrays to minimize the energy per access. We then show how to use cache memory and pre-fetch mechanisms to further reduce energy per access.https://resolver.caltech.edu/CaltechCSTR:1994.cs-tr-94-21Quasi-Delay-Insensitive Circuits are Turing-Complete
https://resolver.caltech.edu/CaltechCSTR:1995.cs-tr-95-11
Year: 2001
DOI: 10.7907/Z9H70CV1
Quasi-delay-insensitive (QDI) circuits are those whose correct operation does not depend on the delays of operators or wires, except for certain wires that form isochronic forks. In this paper we show that quasi-delay-insensitivity, stability and noninterference, and strong confluence are equivalent properties of a computation. In particular, this shows that QDI computations are deterministic. We show that the class of Turing-computable functions have QDI implementations by constructing a QDI Turing machine.https://resolver.caltech.edu/CaltechCSTR:1995.cs-tr-95-11ET^2: A Metric For Time and Energy Efficiency of Computation
https://resolver.caltech.edu/CaltechCSTR:2001.007
Year: 2001
DOI: 10.7907/Z9K935JZ
We investigate an efficiency metric for VLSI computation that includes energy, E, and time, t, in the form E t^2. We apply the metric to CMOS circuits operating outside velocity saturation when energy and delay can be exchanged
by adjusting the supply voltage; we prove that under these assumptions, optimal Et^2 implies optimal energy and delay. We give experimental and simulation evidences of the range and limits of the assumptions. We derive
several results about sequential, parallel, and pipelined computations optimized for E t^2, including a result about the optimal length of a pipeline. We discuss transistor sizing for optimal Et^2 and show that, for fixed, nonzero execution rates, the optimum is achieved when the sum of the transistor-gate capacitances is twice the sum of the parasitic capacitances-not for minimum transistor sizes. We derive an approximation for E t^n (for arbitrary n) of an optimally sized system that can be computed without actually sizing the transistors; we show that this approximation is accurate. We prove that when multiple, adjustable supply voltages are allowed, the optimal Et^2 for the sequential composition of components is achieved when the supply voltages are adjusted so that the components consume equal power. Finally, we give rules for computing the Et^2 of the sequential and parallel compositions of systems, when the Et^2 of the components are known.https://resolver.caltech.edu/CaltechCSTR:2001.007Energy-Delay Efficiency of VLSI Computations
https://resolver.caltech.edu/CaltechAUTHORS:20161207-165804629
Year: 2002
DOI: 10.1145/505306.505330
In this paper we introduce an energy-delay efficiency metric that captures any trade-off between the energy and the delay of the computation.
We apply this new concept to the parallel and sequential composition of circuits in general and in particular to circuits optimized through transistor sizing. We bound the delay and energy of the optimized circuit and we give necessary and sufficient conditions under which these bounds are reached. We also give necessary and sufficient conditions under which subcomponents of a design can be optimized independently so as to yield global optimum when recomposed.
We demonstrate the utility of a minimum-energy function to capture high level compositional properties of circuits. The use of this minimum-energy function yields practical insight into ways of improving the overall energy-delay efficiency of circuits.https://resolver.caltech.edu/CaltechAUTHORS:20161207-165804629Transistor Sizing of Energy-Delay-Efficient Circuits
https://resolver.caltech.edu/CaltechCSTR:2002.003
Year: 2002
DOI: 10.7907/Z9ZG6Q7T
This paper studies the problem of transistor sizing of CMOS circuits optimized for energy-delay efficiency,
i.e., for optimal Et^n where E is the energy consumption and t is the delay of the circuit, while n is a fixed positive optimization index that reflects the chosen trade-off between energy and delay. We propose a set of analytical formulas that closely approximate the optimal transistor sizes. We then study an efficient iteration procedure that can further improve the original analytical solution. Based on these results, we introduce a novel transistor sizing algorithm for energy-delay efficiency.https://resolver.caltech.edu/CaltechCSTR:2002.003Global and local properties of asynchronous circuits optimized for energy efficiency
https://resolver.caltech.edu/CaltechCSTR:2002.002
Year: 2002
DOI: 10.7907/Z9FJ2DSS
In this paper we explore global and local properties of asynchronous circuits sized for the energy efficiency metric Et^2. We develop a theory that enables an abstract view on transistor sizing. These results allow us to accurately estimate circuit performance and compare circuit design choices at logic gate level without going through the costly sizing process. We estimate that the improvement in energy efficiency due to sizing is 2 to 3.5 times when compared to a design optimized for speed.https://resolver.caltech.edu/CaltechCSTR:2002.002Speed and Energy Performance of an Asynchronous MIPS R3000 Microprocessor
https://resolver.caltech.edu/CaltechCSTR:2001.012
Year: 2002
DOI: 10.7907/Z99S1P11
This paper presents the speed and energy figures for an asynchronous implementation of a MIPS R3000 microprocessor. The design is almost entirely QDI and introduces a new fine-grained pipeline. The performance figures show that this design is four times as efficient as equivalent clocked designs and that its cycle time in FO4 units compares to that of high-performance dynamic pipelines.https://resolver.caltech.edu/CaltechCSTR:2001.012Transistor sizing of energy-delay-efficient circuits
https://resolver.caltech.edu/CaltechAUTHORS:20161207-170651411
Year: 2002
DOI: 10.1145/589411.589439
This paper studies the problem of transistor sizing of CMOS circuits optimized for energy-delay efficiency, i.e., for optimal Etn where E is the energy consumption and t is the delay of the circuit, while n is a fixed positive optimization index that reflects the chosen trade-off between energy and delay.
We propose a set of analytical formulas that closely approximate the optimal transistor sizes. We then study an efficient iteration procedure that can further improve the original analytical solution. Based on these results, we introduce a novel transistor sizing algorithm for energy-delay efficiency.https://resolver.caltech.edu/CaltechAUTHORS:20161207-170651411High-level synthesis of asynchronous systems by data-driven decomposition
https://resolver.caltech.edu/CaltechAUTHORS:20170109-145144866
Year: 2003
DOI: 10.1145/775832.775962
We present a method for decomposing a high-level program description of a circuit into a system of concurrent modules that can each be implemented as asynchronous pre-charge half-buffer pipeline stages (the circuits used in the asynchronous R3000 MIPS microprocessor). We apply it to designing the instruction fetch of an asynchronous 8051 microcontroller, with promising results. We discuss new clustering algorithms that will improve the performance figures further.https://resolver.caltech.edu/CaltechAUTHORS:20170109-145144866An Architecture for Asynchronous FPGAs
https://resolver.caltech.edu/CaltechCSTR:2003.006
Year: 2003
DOI: 10.7907/Z9X9288B
We present an architecture for a quasi delay-insensitive asynchronous field-programmable gate array. The logic cell is a complete asynchronous pipeline stage and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.https://resolver.caltech.edu/CaltechCSTR:2003.006An Architecture for Asynchronous FPGAs
https://resolver.caltech.edu/CaltechCSTR:2003.006a
Year: 2003
We present an architecture for a quasi delay-insensitive asynchronous field-programmable gate array. The logic cell is a complete asynchronous pipeline stage and the interconnects are entirely delay insensitive, eliminating all timing issues from the place-and-route procedure.https://resolver.caltech.edu/CaltechCSTR:2003.006aCan asynchronous techniques help the SoC designer?
https://resolver.caltech.edu/CaltechAUTHORS:20110722-095429816
Year: 2006
DOI: 10.1109/VLSISOC.2006.313284
As technological advances make it possible to integrate an entire system on a single die, the designer of a system-on-chip (SoC) is confronted with increasing difficulties concerning complexity, reliability, energy and power consumption, and clock distribution. All those issues are aggravated by increasing parameters variability as a result of the same technological advances. This paper argues that because of the quasi-independence of asynchronous (QDI) circuits of timing, asynchronous logic alleviates the problems posed by parameter variability, and eliminates the clock distribution problem altogether. Furthermore, as some researchers attempt to turn the liability into an asset by exploiting parameter variability to design truly probabilistic computation, the flexibility and time-independence of asynchronous logic could be a natural match.https://resolver.caltech.edu/CaltechAUTHORS:20110722-095429816Slack Matching Quasi Delay-Insensitive Circuits
https://resolver.caltech.edu/CaltechAUTHORS:20110225-095524705
Year: 2006
DOI: 10.1109/ASYNC.2006.27
Slack matching is an optimization that determines
the amount of buffering that must be added to each channel of
a slack elastic asynchronous system in order to reduce its cycle
time to a specified target. We present two methods of expressing
the slack matching problem as a mixed integer linear programming
problem. The first method is applicable to systems composed
of either full-buffers or half-buffers but not both. The second
method is applicable to systems composed of any combination
of full-buffers and half-buffers.https://resolver.caltech.edu/CaltechAUTHORS:20110225-095524705Asynchronous techniques for system-on-chip design
https://resolver.caltech.edu/CaltechAUTHORS:MARprocieee06
Year: 2006
DOI: 10.1109/JPROC.2006.875789
SoC design will require asynchronous techniques as the large parameter variations across the chip will make it impossible to control delays in clock networks and other global signals efficiently. Initially, SoCs will be globally asynchronous and locally synchronous (GALS). But the complexity of the numerous asynchronous/synchronous interfaces required in a GALS will eventually lead to entirely asynchronous solutions. This paper introduces the main design principles, methods, and building blocks for asynchronous VLSI systems, with an emphasis on communication and synchronization. Asynchronous circuits with the only delay assumption of isochronic forks are called quasi-delay-insensitive (QDI). QDI is used in the paper as the basis for asynchronous logic. The paper discusses asynchronous handshake protocols for communication and the notion of validity/neutrality tests, and completion tree. Basic building blocks for sequencing, storage, function evaluation, and buses are described, and two alternative methods for the implementation of an arbitrary computation are explained. Issues of arbitration, and synchronization play an important role in complex distributed systems and especially in GALS. The two main asynchronous/synchronous interfaces needed in GALS-one based on synchronizer, the other on stoppable clock-are described and analyzed.https://resolver.caltech.edu/CaltechAUTHORS:MARprocieee06Asynchronous Nano-Electronics: Preliminary Investigation
https://resolver.caltech.edu/CaltechAUTHORS:20100722-151724013
Year: 2008
DOI: 10.1109/ASYNC.2008.22
This paper is a preliminary investigation in implementing
asynchronous QDI logic in molecular nano-electronics,
taking into account the restricted geometry, the lack of control
on transistor strengths, the high timing variations. We
show that the main building blocks of QDI logic can be successfully
implemented; we illustrate the approach with the
layout of an adder stage. The proposed techniques to improve
the reliability of QDI apply to nano-CMOS as well.https://resolver.caltech.edu/CaltechAUTHORS:20100722-151724013A Necessary and Sufficient Timing Assumption for Speed-Independent Circuits
https://resolver.caltech.edu/CaltechAUTHORS:20100506-101621122
Year: 2009
DOI: 10.1109/ASYNC.2009.27
This paper presents a proof that the adversary path timing
assumption is both necessary and sufficient for correct SI
circuit operation. This assumption requires that the delay
of a wire on one branch of a fork be less than the delay
through a gate sequence beginning at another branch in the
same fork. Both the definition of the timing assumption and
the proof build on a general, formal notion of computation
given with respect to production rule sets. This underlying
framework can be used for a variety of proof efforts or
as a basis for defining other useful notions involving asynchronous
computation.https://resolver.caltech.edu/CaltechAUTHORS:20100506-101621122Asynchronous logic for high variability nano-CMOS
https://resolver.caltech.edu/CaltechAUTHORS:20170320-175344479
Year: 2009
DOI: 10.1109/ICECS.2009.5410925
At the nanoscale level, parameter variations in fabricated devices cause extreme variability in delay. Delay variations are also the main issue in subthreshold operation. Consequently, asynchronous logic seems an ideal, and probably unavoidable choice, for the design of digital circuits in nano CMOS or other emerging technologies. This paper examines the robustness of one particular asynchronous logic: quasi-delay insensitive or QDI. We identify the three components of this logic that can be affected by extreme variability: staticizer, isochronic fork, and rings. We show that staticizers can be eliminated, and isochronic forks and rings can be made arbitrarily robust to timing variations.https://resolver.caltech.edu/CaltechAUTHORS:20170320-175344479A Distributed Implementation Method for Parallel Programming
https://resolver.caltech.edu/CaltechAUTHORS:20120418-114041991
Year: 2012
DOI: 10.7907/c1h76-gdn90
A method is described for implementing on a finite network of processing "cells", called the "implementation graph", programs whose potential parallelism is not fixed by the implementation but varies according to the input parameters. First, programming constructs are described
permitting a computation, regarded as a dynamic structure called the "computation graph", to diffuse through the implementation graph. Second, the implementation problem of
mapping an unbounded number of computation nodes on a finite number of cells is tackled. Processor allocation and message buffering completely disappear from the programmer's concerns. The mechanism proposed is considered a generalization of the stack mechanism.https://resolver.caltech.edu/CaltechAUTHORS:20120418-11404199125 Years Ago: The First Asynchronous Microprocessor
https://resolver.caltech.edu/CaltechAUTHORS:20140206-111915844
Year: 2014
DOI: 10.7907/Z9QR4V3H
Twenty-five years ago, in December 1988, my
research group at Caltech submitted the world's
first asynchronous ("clockless") microprocessor
design for fabrication to MOSIS. We received
the chips in early 1989; testing started in February 1989. The chips were found fully functional on first silicon. The results were presented at the Decennial Caltech VLSI Conference in March of the same year. The first entirely
asynchronous microprocessor had been designed
and successfully fabricated. As the technology finally reaches industry, and with the benefit of a
quarter-century hindsight, here is a recollection
of this landmark project.https://resolver.caltech.edu/CaltechAUTHORS:20140206-111915844A Compact Transregional Model for Digital CMOS Circuits Operating Near Threshold
https://resolver.caltech.edu/CaltechAUTHORS:20141106-133106801
Year: 2014
DOI: 10.1109/TVLSI.2013.2282316
Power dissipation is currently one of the most important design constraints in digital systems. In order to reduce power and energy demands in the foremost technology, namely CMOS, it is necessary to reduce the supply voltage to near the device threshold voltage. Existing analytical models for MOS devices are either too complex, thus obscuring the basic physical relations between voltages and currents, or they are inaccurate and discontinuous around the region of interest, i.e., near threshold. This paper presents a simple transregional compact model for analyzing digital circuits around the threshold voltage. The model is continuous, physically derived (by way of a simplified inversion-charge approximation), and accurate over a wide operational range: from a few times the thermal voltage to approximately twice the threshold voltage in modern technologies.https://resolver.caltech.edu/CaltechAUTHORS:20141106-133106801Quantifying Near-Threshold CMOS Circuit Robustness
https://resolver.caltech.edu/CaltechAUTHORS:20141125-133400175
Year: 2014
DOI: 10.7907/Z9M043CG
In order to build energy efficient digital CMOS circuits, the supply voltage must be reduced to near-threshold.
Problematically, due to random parameter variation, supply
scaling reduces circuit robustness to noise. Moreover, the effects of parameter variation worsen as device dimensions diminish, further reducing robustness, and making parameter variation one of the most significant hurdles to continued CMOS scaling. This paper presents a new metric to quantify circuit robustness with respect to variation and noise along with an efficient method of calculation. The method relies on the statistical analysis of standard cells and memories resulting an an extremely compact representation of robustness data. With this metric and method of
calculation, circuit robustness can be included alongside energy, delay, and area during circuit design and optimization.https://resolver.caltech.edu/CaltechAUTHORS:20141125-133400175DD1: A QDI, Radiation-Hard-by-Design, Near-Threshold 18uW/MIPS Microcontroller in 40nm Bulk CMOS
https://resolver.caltech.edu/CaltechAUTHORS:20160901-124814686
Year: 2015
DOI: 10.1109/ASYNC.2015.15
This paper describes DD1, an asynchronous radiation-hard 8-bit AVR^® microcontroller (MCU) implemented in TSMC 40LP, a low-power bulk 40nm CMOS process. Designed for extreme reliability, DD1 uses quasi-delay-insensitive (QDI) asynchronous logic and contains full-custom radiation-hard memories and logic cells. The chip was found fully functional on first silicon over a range of operating voltages from near-threshold (500mV) to above the nominal V_(DD) (1.1V). It qualifies as both ultra-low power (<;100μW/MHz) and radiation-hard by design. At 550mV the MCU operates at 1MIPS with a power consumption of 18μW/MIPS. At 1.1V it runs at 20MIPS consuming 75μW/MIPS (1.5mW total). After extensive testing, it was found to be total-dose and latch-up immune and has an upset immunity of 2E-6 SEE/device-day (CREME96 geosynchronous near-earth orbit).https://resolver.caltech.edu/CaltechAUTHORS:20160901-124814686