Reference counting
|
In computer science, reference counting is a technique of storing the number of references, pointers or handles to a resource such as an object or block of memory.
Contents |
Use in garbage collection
Reference counting is often known as a garbage collection algorithm where each object contains a count of the number of references to it held by other objects. If an object's reference count reaches zero, the object has become inaccessible, and it is destroyed.
Simple reference counts require frequent updates. Whenever a reference is destroyed or overwritten, the reference count of the object it references is decremented, and whenever one is created or copied, the reference count of the object it references is incremented.
Advantages and disadvantages
The main advantage of reference counting over tracing is that objects are reclaimed as soon as they can no longer be referenced, and in an incremental fashion, without long pauses for collection cycles and with clearly defined lifetime of every object. In real-time applications or systems with limited memory, this is important to maintain responsiveness. Reference counting is also among the simplest forms of garbage collection to implement. Weighted reference counts are a good solution for garbage collecting a distributed system.
Reference counting in naïve form has two main disadvantages over the more popular tracing garbage collection, both of which require additional mechanisms to ameliorate.
One is that the frequent updates it involves are a source of inefficency. While tracing garbage collectors can impact efficiency severely via context switching and cache line faults, they collect relatively infrequently, while accessing objects is done continually. Also, less importantly, reference counting requires every memory-managed object to reserve space for a reference count. In tracing garbage collectors, this information is stored implicitly in the references that refer to that object, saving space, although tracing garbage collectors, particularly incremental ones, can require additional space for other purposes.
The other disadvantage is that the naïve algorithm described above can't handle reference cycles, an object which refers directly or indirectly to itself. A mechanism relying purely on reference counts will never consider cyclic chains of objects for deletion, since their reference count is guaranteed to stay nonzero. Methods for repairing exist but can also increase the overhead and complexity of reference counting — on the other hand, these methods need only be applied to data that might form cycles, often a small subset of all data.
Graph interpretation
When dealing with garbage collection schemes, it's often helpful to think of the reference graph, which is a directed graph where the vertices are objects and there is an edge from an object A to an object B if A holds a reference to B. We also have a special vertex or vertices representing the local variables and references held by the runtime system, and no edges ever go to these nodes, although edges can go from them to other nodes.
In this context, the simple reference count of an object is the in degree of its vertex. Deleting a vertex is like collecting an object. It can only be done when the vertex has no incoming edges, so it does not affect the out degree of any other vertices, but it can affect the in degree of other vertices, causing their corresponding objects to be collected as well.
In the graph interpretation, reference cycles are cycles. The connected component containing the special vertex contains the objects that can't be collected, while other connected components of the graph only contain garbage. By the nature of reference counting, each of these garbage components must contain at least one cycle.
Dealing with inefficiency of updates
Incrementing and decrementing reference counts every time a reference is created or destroyed can significantly impede performance. Not only do the operations take time, but they damage cache performance and can lead to pipeline bubbles. Even read-only operations like calculating the length of a list require a large number of reads and writes for reference updates with naïve reference counting.
One simple technique is for the compiler to combine a number of nearby reference updates into one. This is especially effective for references which are created and quickly destroyed. Care must be taken, however, to put the combined update at the right position so that a premature free is avoided.
The Deustch-Bobrow method of reference counting capitalizes on the fact that most reference count updates are in fact generated by references stored in local variables. It ignores these references, only counting references in data structures, but before an object with reference count zero can be deleted, the system must verify with a scan of the stack and registers that no other reference to it still exists.
Another technique devised by Henry Baker involves deferred increments, in which references which are stored in local variables do not immediately increment the corresponding reference count, but instead defer this until it is necessary. If such a reference is destroyed quickly, then there is no need to update the counter. This eliminates a large number of updates associated with short-lived references. However, if such a reference is copied into a data structure, then the deferred increment must be performed at that time. It is also critical to perform the deferred increment before the object's count drops to zero, resulting in a premature free. See the paper (http://citeseer.nj.nec.com/baker94minimizing.html) for more.
A dramatic decrease in the overhead on counter updates was obtained by Levanoni (http://www.cs.technion.ac.il/~levanoni/) and Petrank (http://www.cs.technion.ac.il/~erez/). They showed how to eliminate more than 99% of the counter updates for typical Java benchmarks. Furthermore, they present an enhanced algorithm that may run concurrently with multithreaded applications employing only fine synchronization. See the paper (http://www.cs.technion.ac.il/%7Eerez/Papers/refcount.pdf) for more.
Dealing with reference cycles
There are a variety of ways of handling the problem of collecting reference cycles. One is that a system may explicitly forbid reference cycles. In some systems like filesystems this is a common solution. Cycles are also sometimes ignored in systems with short lives and a small amount of cyclic garbage, particularly when the system was developed using a methodology of avoiding cyclic data structures wherever possible, typically at the expense of efficiency.
Another solution is to periodically use a tracing garbage collector to reclaim cycles. Since cycles typically constitute a relatively small amount of reclaimed space, the collection cycles can be spaced much farther apart than with an ordinary tracing garbage collector.
Bacon describes a cycle-collection algorithm for reference counting systems with some similarities to tracing systems, including the same theoretical time bounds, but that takes advantage of reference count information to run much more quickly and with less cache damage. It's based on the observation that an object cannot appear in a cycle until its reference count is decremented to a nonzero value. All objects which this occurs to are put on a roots list, and then periodically the program searches through the objects reachable from the roots for cycles. It knows it has found a cycle when decrementing all the reference counts on a cycle of references brings them all down to zero. An enhanced version of this algorithm is able to run concurrently with other operations. See the paper (http://www.research.ibm.com/people/d/dfb/papers/Bacon01Concurrent.pdf) for more.
Variants of reference counting
Although it's possible to augment simple reference counts in a variety of ways, often a better solution can be found by performing reference counting in a fundamentally different way. Here we describe some of the variants on reference counting and their benefits and drawbacks.
Weighted reference counting
In weighted reference counting, we assign each reference a weight, and each object tracks not the number of references referring to it, but the total weight of the references referring to it. The initial reference to a newly-created object has a large weight, such as 216. Whenever this reference is copied, half of the weight goes to the new reference, and half of the weight stays with the old reference. Because the total weight does not change, the object's reference count does not need to be updated.
Destroying a reference decrements the total weight by the weight of that reference. When the total reaches zero, all references have been destroyed. If an attempt is made to copy a reference with a weight of 1, we have to "get more weight" by adding to the total weight and then adding this new weight to our reference, and then split it.
The property of not needing to access a reference count when a reference is copied is particularly helpful when the object's reference count is expensive to access, for example because it is in another process, on disk, or even across a network. It can also help increase concurrency by avoiding many threads locking a reference count to increase it. Thus, weighted reference counting is most useful in parallel, multiprocess, database, or distributed applications.
The primary problem with simple weighted reference counting is that destroying a reference still requires accessing the reference count, and if many references are destroyed this can cause the same bottlenecks we seek to avoid. Some adaptations of weighted reference counting seek to avoid this by attempting to give weight back from a dying reference to one which is still active.
Weighted reference counting was independently devised by Bevan, in the paper Distributed garbage collection using reference counting, and Watson, in the paper An efficient garbage collection scheme for parallel computer architectures, both in 1987.
Examples of use
Delphi
A language that uses reference counting for garbage collection is Delphi. Delphi is not a completely garbage collected language, in that user-defined types must still be manually allocated and deallocated. It does provide automatic collection, however, for a few built-in types, such as strings, dynamic arrays, and interfaces, for ease of use and to simplify the generic database functionality.
Some of the reasons reference counting may have been preferred to other forms of garbage collection in Delphi include:
- The general benefits of reference counting, such as prompt collection.
- Cycles either cannot occur or do not occur in practice using only the small set of garbage-collected built-in types.
- The overhead in code size required for reference counting is very small, and no separate thread of control is needed for collection as would be needed for a tracing garbage collector.
- Many instances of the most commonly used garbage-collected type, the string, have a short lifetime, since they are typically intermediate values in string manipulation.
- Because garbage-collection is only done on built-in types, reference counting can be efficiently integrated into the library routines used to manipulate each datatype, keeping the overhead needed for updating of reference counts low.
References
- This article was originally based on material from the Free On-line Dictionary of Computing, which is licensed under the GFDL.
External links
- The Memory Manager Reference: Beginner's Guide: Recycling: Reference Counts (http://www.memorymanagement.org/articles/recycle.html#reference)
- Minimizing Reference Count Updating with Deferred and Anchored Pointers for Functional Data Structures, Henry G. Baker (http://citeseer.nj.nec.com/baker94minimizing.html)
- Concurrent Cycle Collection in Reference Counted Systems, David F. Bacon (http://www.research.ibm.com/people/d/dfb/papers/Bacon01Concurrent.pdf)
- An On-the-Fly Reference-Counting Garbage Collector for Java, Yossi Levanoni and Erez Petrank (http://www.cs.technion.ac.il/%7Eerez/Papers/refcount.pdf)