3rdparty/openpgm-svn-r1085/doc/draft-ietf-rmt-bb-pgmcc-03.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226


Internet Engineering Task Force 				  RMT WG
INTERNET-DRAFT					     Luigi Rizzo/U. Pisa
draft-ietf-rmt-bb-pgmcc-03.txt		       Gianluca Iannaccone/Intel
						  Lorenzo Vicisano/Cisco
							Mark Handley/UCL
							    12 July 2004
						   Expires: January 2005


	    PGMCC single rate multicast congestion control:
			 Protocol Specification


Status of this Document

This document is an Internet-Draft and is in full conformance with all
provisions of Section 10 of RFC2026.

Internet-Drafts are working documents of the Internet Engineering Task
Force (IETF), its areas, and its working groups.  Note that other groups
may also distribute working documents as Internet-Drafts.

Internet-Drafts are valid for a maximum of six months and may be
updated, replaced, or obsoleted by other documents at any time.  It is
inappropriate to use Internet-Drafts as reference material or to cite
them other than as a "work in progress".

The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt

To view the list Internet-Draft Shadow Directories, see
http://www.ietf.org/shadow.html.

This document is a product of the IETF RMT WG.	Comments should be
addressed to the authors, or the WG's mailing list at rmt@lbl.gov.


				Abstract


     This document describes PGMCC, a single rate multicast
     congestion control scheme which is TCP-friendly and achieves
     scalability, stability and fast response to variations in
     network conditions.  PGMCC is suitable for both non-reliable


Rizzo/Iannaccone/Vicisano/Handley				[Page 1]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


     and reliable data transfers.  It is mainly designed for NAK-
     based multicast protocols, and uses a window-based, TCP-like
     control loop using positive ACKs between one representative of
     the receiver group (the ACKER) and the sender.  The ACKER is
     selected dynamically and may change over time.

     PGMCC is made of two components: a window-based control loop,
     which closely mimics TCP behavior, and a fast and low-overhead
     procedure to select (and track changes of) the ACKER.  The
     scheme is robust to measurement errors, and supports fast
     response to changes in the receiver set and/or network
     conditions.


Rizzo/Iannaccone/Vicisano/Handley				[Page 2]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


			   Table of Contents


     1. Introduction. . . . . . . . . . . . . . . . . . . . . .   4
      1.1. Terminology. . . . . . . . . . . . . . . . . . . . .   4
     2. Protocol Overview . . . . . . . . . . . . . . . . . . .   4
      2.1. Packet Contents. . . . . . . . . . . . . . . . . . .   6
       2.1.1. Data Packets. . . . . . . . . . . . . . . . . . .   6
       2.1.2. Feedback Packets. . . . . . . . . . . . . . . . .   7
       2.1.3. Field sizes and formats . . . . . . . . . . . . .   8
      2.2. Window-based controller. . . . . . . . . . . . . . .   9
      2.3. Acker Selection. . . . . . . . . . . . . . . . . . .  11
       2.3.1. Initial Acker election. . . . . . . . . . . . . .  11
       2.3.2. Acker dropouts. . . . . . . . . . . . . . . . . .  12
      2.4. TCP Throughput Equation. . . . . . . . . . . . . . .  12
      2.5. RTT measurement. . . . . . . . . . . . . . . . . . .  13
       2.5.1. Explicit Timestamp. . . . . . . . . . . . . . . .  13
       2.5.2. Implicit timestamp. . . . . . . . . . . . . . . .  13
       2.5.3. Sequence numbers. . . . . . . . . . . . . . . . .  14
       2.5.4. Recommendations . . . . . . . . . . . . . . . . .  15
      2.6. Loss rate measurement. . . . . . . . . . . . . . . .  15
      2.7. Timeouts . . . . . . . . . . . . . . . . . . . . . .  16
      2.8. Interaction with feedback suppression
      schemes . . . . . . . . . . . . . . . . . . . . . . . . .  16
      2.9. Interaction with ECN . . . . . . . . . . . . . . . .  17
     3. Procedures - Sender . . . . . . . . . . . . . . . . . .  17
     4. Procedures -- Receiver. . . . . . . . . . . . . . . . .  18
     5. Security Considerations . . . . . . . . . . . . . . . .  19
     6. Authors' Addresses. . . . . . . . . . . . . . . . . . .  20
     7. Acknowledgments . . . . . . . . . . . . . . . . . . . .  20
     8. Full Copyright Statement. . . . . . . . . . . . . . . .  21


Rizzo/Iannaccone/Vicisano/Handley				[Page 3]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


1.  Introduction

This document describes PGMCC, a single rate multicast congestion
control scheme which is TCP-friendly and achieves scalability, stability
and fast response to variations in network conditions.

PGMCC is designed for multicast sessions with one sender and one or more
receivers, and is a good match for transport protocols using negative
acknowledgements (NAKs) to collect feedback from the receivers.  The
congestion control scheme implemented by PGMCC closely mimics the
congestion-control behavior of TCP, as it uses a window-based control
loop which is run between the sender and a selected receiver called the
ACKER. The role of the ACKER is to provide timely feedback in the same
way as a TCP receiver; additionally, the ACKER is selected dynamically
among the receivers as the one which would experience the lowest
throughput if separate TCP sessions were run between the sender and each
of the receivers.

Scalability in PGMCC comes from the use of negative acknowledgements
(NAKs) for collecting feedback from receivers other than the ACKER.  As
a consequence, the usual techniques for NAK suppression and aggregation
can be used to reduce the amount of feedback to the source and improve
the scalability of the scheme.

PGMCC is designed to completely decouple congestion control from data
integrity. As a consequence, the scheme can work with both reliable data
transfer and unreliable communication protocols such as those used for
video or audio streaming.

While designed with multicast in mind, PGMCC can be equally used as a
replacement for TCP for unicast sessions which require a lower degree of
reliability than what TCP offers.


1.1.  Terminology

In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL",
"SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and
"OPTIONAL" are to be interpreted as described in RFC 2119 and indicate
requirement levels for compliant PGMCC implementations.


2.  Protocol Overview

PGMCC is based on two separate but complementary mechanisms:

  o A window-based control loop which closely emulates TCP congestion
    control.


Rizzo/Iannaccone/Vicisano/Handley		    Section 2.	[Page 4]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


    The window-based control loop is simply an adaptation of the TCP
    congestion control scheme to transport protocols where missing
    (because of network errors or congestion) data packets are not
    necessarily retransmitted, and so the congestion control scheme
    cannot rely on cumulative acknowledgements.  In PGMCC, the
    ``congestion window'' is simulated using a token-based scheme which
    permits congestion control to be decoupled from retransmission
    state. One of the receivers in the group operates as the ACKER, i.e.
    the node in charge of sending positive acknowledgements back to the
    source and thus controlling the rate of the transfer.


  o A procedure to select the ACKER.
    The purpose of this procedure is to make sure that, in presence of
    multiple receivers, the ACKER is dynamically selected to be the
    receiver which would have the lowest throughput if separate TCP
    sessions were run between the sender and each receiver.
    For the acker selection mechanism, PGMCC uses a throughput equation
    to determine the expected throughput for a given receiver as a
    function of the loss rate and round-trip time. Unlike other schemes
    [2], the TCP throughput equation is not used to determine the actual
    sending rate, which is completely controlled by the window-based
    control loop.


In principle, PGMCC's congestion control mechanism works as follows:


  o Receivers measure the loss rate and feed this information back to
    the sender, either in ACK or NAK messages.


  o The sender also uses these feedback messages to measure the round-
    trip time (RTT) to each receiver.


  o The loss rate and RTT are then fed into PGMCC's throughput equation,
    to determine the expected throughput to that receiver.


  o The sender then selects as the acker the receiver with the lowest
    expected throughput, as computed by the equation.

The dynamics of the acker selection mechanism are sensitive to how the
measurements are performed and applied. In the rest of this document we
suggest specific mechanisms to perform and apply these measurements.
Other mechanisms are possible, but it is important to understand how the
interactions between mechanisms affect the dynamics of PGMCC.


Rizzo/Iannaccone/Vicisano/Handley		    Section 2.	[Page 5]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


2.1.  Packet Contents

Before specifying the sender and receiver functionality, we describe the
information required by PGMCC to perform its tasks. This information is
carried in the data packets sent by the sender, and in the feedback
packets sent by the receiver.  As PGMCC will be used along with some
transport protocol, the actual data and feedback packets will contain
further information for use by the protocol itself. For this reason, we
do not specify packet formats, as these depend on the details of the
transport protocol used.

Note that the requirements of the transport protocol in terms of packet
generation may differ from those of PGMCC. As an example, most NAK-based
reliable multicast protocols do not use positive acknowledgements, but
PGMCC requires ACKs for clocking out data packets; unreliable transport
protocols might have no interest in generating NAKs for data integrity
purposes, yet PGMCC depends on NAKs reaching the data sender in order to
elect the ACKER.


Implementors may decide to insert PGMCC-related information in already
existing protocol packets whenever possible, but in cases such as the
ones described in the previous paragraph, it might be necessary to
define and generate new packets exclusively for congestion control
purposes. As an example, in a prototype implementation of PGMCC on top
of the PGM protocol [7], some of the information used by PGMCC is
already present in the original protocol packets, and PGMCC-specific
information is carried as PGM options in ODATA and NAK packets. However,
a new packet type has been defined for ACKs, which are generated
according to the rules defined in this document.


2.1.1.	Data Packets

Each data packet sent by the data sender contains the following
information:


  o A SEQUENCE NUMBER. This number is incremented by one for each data
    packet transmitted.  The field must be sufficiently large that it
    does not wrap causing two different packets with the same sequence
    number to be in the receiver's recent packet history at the same
    time.


  o A TIMESTAMP (or equivalent information, see Section 2.5) indicating
    when the packet with this sequence number has been sent.  There is
    no requirement for synchronized clocks between the sender and the


Rizzo/Iannaccone/Vicisano/Handley		Section 2.1.1.	[Page 6]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


    receivers.	The timestamp is used to measure network round-trip
    times, so needs sufficient resolution for this task.  A resolution
    of 1ms would be adequate.


  o The ACKER IDENTITY, i.e. the identity of the receiver in charge of
    sending an acknowledgement for this data packet.  The ACKER is
    elected as a result of the process described in Section 2.3.
    A special value is used to indicate that no ACKER is designated for
    this packet -- this can happen at the beginning of a session or when
    the current ACKER leaves the group.  Receivers interpret this value
    as a request to elect a new acker.


2.1.2.	Feedback Packets

There are two types of feedback packets used by PGMCC: ACK packets and
NAK packets.
ACK packets are generated by the current ACKER, and are used to detect
loss or successful delivery of packets, and to regulate the throughput
accordingly. ACK packets also contain information used to determine the
TCP-equivalent throughput for the ACKER.
NAK packets are sent by any receiver who experiences loss.  They contain
information used to determine the TCP-equivalent throughput for that
receiver. In an actual protocol instantiation (such as PGM [7]), NAK
packets might also be used by the protocol to request the retransmission
of specific packets, and indicate the identity of the packet being
requested.

Both ACK and NAK packets are sent by data receivers, and contain the
following information:


  o The TIMESTAMP (or equivalent information) derived from the most
    recently received data packet according to one of the techniques
    described in Section 2.5.
    This value is used by the sender to measure the RTT to the receiver
    who generated this feedback packet.


  o ``p'', the receiver's current estimate of the LOSS RATE.  The loss
    rate is measured by receivers as described in Section 2.6

In addition to the above, ACK packets (sent by the acker designated in
the corresponding data packets) must also contain the following
information:


Rizzo/Iannaccone/Vicisano/Handley		Section 2.1.2.	[Page 7]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


  o RX_MAX, the highest sequence number among received data packets
    (taking care to deal with sequence number wrapping correctly).

  o ACK_BITMAP, a bitmap indicating the receive status of the latest N
    (typically N=32) data packets with sequence numbers RX_MAX-(N-1) to
    RX_MAX.


This information is used by the sender to record which packets have been
received or lost, and manipulate the transmit window accordingly.  Note
that each ACK packet contains information about multiple packets, and
this increases the robustness of the scheme to loss of ACK packets.
This is necessary because ACKs are not sent reliably (unlike TCP's ACKs,
which are cumulative).


2.1.3.	Field sizes and formats

The following sizes and formats are suggested for the various fields
used by PGMCC and transmitted over the network:


  o SEQUENCE NUMBERS
    32 bit, unsigned, network order.


  o TIMESTAMPS
    32 bit, unsigned, network order.  A resolution of 1ms or better is
    desirable.


  o ACKER IDENTITY
    Same size and format of a network layer address (e.g. 32 bit for
    IPv4).  Note though that using an IP address for the Acker Identify
    will cause problems with NAT traversal.  Transport protocol
    designers might examine the SSRC mechanism used by RTP [6] as an
    alternative form of node identifier that could be used as Acker
    Identity.


  o LOSS RATE (``p'')
    16-bit unsigned integer, in network format, with 0 indicating no
    loss and 2^16-1 indicating 100% loss.


  o ACK BITMAP
    32-bit, in network format, with least significant bit indicating
    receive status of packet RX_MAX.


Rizzo/Iannaccone/Vicisano/Handley		Section 2.1.3.	[Page 8]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


2.2.  Window-based controller

In a window-based congestion control scheme such as TCP, the
``congestion window'' represents, among other things, the maximum amount
of packets in flight at any time, which in turn controls the throughput
of the session. The sender keeps track of the actual number of packets
in flight, basing on its transmissions and the reception of
acknowledgements.

The sender may dynamically change the size of the window, according to
the congestion control scheme being used. In TCP, and PGMCC, an
``Additive Increase Multiplicative Decrease'' (AIMD) scheme is used: in
absence of loss, the window is increased by some fixed amount (typically
one packet) per round trip time (RTT), whereas upon loss the window is
reduced to a fraction of its original value (typically halved) in each
RTT in which a loss event is experienced.

In PGMCC the window is managed using a token-based mechanism, controlled
by two variables:

  o A ``Window Size'', W, which describes the current window size in
    packets.

  o A ``Token Count'', T, which indicates the number of packets that can
    be transmitted.  T is bounded above by W.  It is decremented every
    time a packet is transmitted, and incremented every time an ACK is
    received, according to the rules below.

Note that these two variables need to hold non-integer data.  Typically
a fixed point representation with at least 16 bits for both integer and
fractional parts would be acceptable for implementation purposes.

The information contained in each ACK is used to determine how many new
packets are acknowledged by that ACK, and whether there are
unacknowledged packets which were not reported in previous ACKs.  The
sender also schedules a timeout to react in case no ACKs are received.

The sender behaves as follows:


  o INITIALIZATION
    At startup, or after a timeout, both W and T are set to 1.


  o ACK RECEPTION, NO LOSS DETECTED
    If the incoming ACK reports new acknowledged packets, and no loss
    (as defined in the next paragraph) is detected, then the window is
    inflated by one packet per RTT.


Rizzo/Iannaccone/Vicisano/Handley		  Section 2.2.	[Page 9]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


    NOTE: during the slow-start phase, TCP opens the window
    exponentially up to the SSTHRESH value, which is computed by TCP
    according to the dynamics of the session and updated upon losses.

    We do recommend that PGMCC uses a similar strategy, but using a
    fixed, small value for SSTHRESH (e.g. 4 packets).  In fact, due to
    the dynamicity of the ACKER, which might change on every single
    packet, it is hard to compute a reliable estimate of the SSTHRESH
    without keeping state for multiple receivers, and the benefits are
    small in any event.

    In summary, the reaction to ACK reception on no loss modifies T and
    W as follows (here, N is the number of new packets acknowledged by
    the incoming ACK):

      if (W < SSTHRESH) then
	  D = min(N, SSTHRESH - W) // use the first D acks for
    exp.opening
	  N = N - D		       // and the remaining ones for
    linear opening
	  T = T + 2*D
	  W = W + D
      endif
      // do linear window opening with the remaining acks
      T = T + N * ( 1 + 1/W )
      W = W + N/W


  o PACKET TRANSMISSION
    One token is consumed for each packet transmitted:

				 T = T - 1


  o ACK RECEPTION, LOSS DETECTED
    If the incoming ACK reports an unacknowledged data packet which is
    followed by at least 3 acknowledged data packets, then the packet is
    assumed to be lost and PGMCC reacts by halving the window, in the
    same way as TCP after 3 duplicate acknowledgements.  This is
    achieved by modifying T and W as follows:

			   T = T - W/2 , W = W/2

    to simulate the multiplicative decrease.
    Additionally, all window manipulation is suspended for the
    subsequent RTT.  This is achieved by recording the current transmit
    sequence number, and canceling any further manipulation of the
    window until feedback is received for the next transmitted packet,


Rizzo/Iannaccone/Vicisano/Handley		 Section 2.2.  [Page 10]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


    or until a timeout occurs.


2.3.  Acker Selection

The ACKER selection process in PGMCC aims at locating the receiver which
would have the lowest throughput if each receiver were using a separate
TCP connection to transfer data.

Because the steady-state throughput of a TCP connection can be
characterized in a reasonably accurate way in terms of its loss rate and
round trip time [3], the throughput for each receiver can be estimated
by using these two parameters.

Whenever an ACK or NAK packet from any of the receivers reaches it, the
sender is able to compute the expected throughput T_i for that receiver
by using the equation shown in Section 2.4, with the round trip time RTT
and loss rate p and measured as described in Sections 2.5 and 2.6,
respectively.  At any given time, the sender stores the expected
throughput for the current ACKER, T_acker. This value is updated every
time an ACK or NAK from the current ACKER is received (note that, after
a new ACKER is selected, the sender will typically receive ACKs from the
old ACKER for one RTT, and the feedback from different ACKERs might be
interleaved if the paths leading to them have different round trip
times).

Whenever an ACK or NAK is received from another node i (a previous ACKER
or some other receiver), the expected throughput T_i for that node is
computed, and compared with T_acker.  Node i is selected as the new
acker if

			   T_i < C * T_acker

where the constant C between 0 and 1 provides some hysteresis and avoids
too frequent oscillations in the choice of the ACKER.  A suggested value
for C is 0.75.

Note that, from an implementation point of view (see Section 2.4), it is
more convenient to compute T_i ^(-2), so the above equation must be
modified accordingly.


2.3.1.	Initial Acker election

Upon reception of a data packet reporting that no acker is currently
selected, receivers generate a dummy NAK report which is used to elect


Rizzo/Iannaccone/Vicisano/Handley	       Section 2.3.1.  [Page 11]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


the initial acker. The NAK is sent with the usual feedback suppression
mechanism dictated by the transport protocol (possibly with shorter time
constants) to avoid feedback implosion, and the sender will select the
source of the first incoming NAK as the new ACKER.


2.3.2.	Acker dropouts


If the ACKER decides to disconnect from the session, it can cause the
session to stop. To avoid this, it is recommended that an ACKER deciding
to leave the session informs the sender by sending an ACK packet (or a
duplicate) carrying an "ACKER_LEAVING" option.	The reception of this
packet by the sender will in turn trigger an initial acker election
phase.


2.4.  TCP Throughput Equation

Any realistic equation of TCP throughput as a function of loss event
rate and RTT should be suitable for use in PGMCC.  Unlike other schemes
[2] where the throughput equation directly controls the transmit rate,
in PGMCC the equation is used only for acker selection purposes, and the
throughput values are only compared among themselves.  As a consequence,
we can use the following equation, derived from the one presented in [3]
by setting RTO = 4 * RTT (as it is common practice):

	 M = 1/T = RTT_i * sqrt(p) * (1 + 9*p * (1 + 32*(p)^2))


where

    M = 1/T is proportional to the inverse of the throughput for the
    receiver under consideration;

    RTT is the round trip time for the receiver under consideration;

    p is the loss rate for the receiver under consideration, between 0
    and 1.0;

and multiplying constants are omitted.

The above equation is accurate on a wide range of loss rates, and also
covers situations where retransmission timeouts have a significant
impact on the throughput of the protocol.

Note that when p=0, the equation yields 1/T = M = 0. This does not


Rizzo/Iannaccone/Vicisano/Handley		 Section 2.4.  [Page 12]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


constitute a problem as we can still compare the M values computed for
different receivers to determine the acker.  Also note that it is easier
to compute M^2 instead of M, because the former does not require the use
of sqrt().

In future, different throughput equations may be substituted for this
equation.  The requirement is that the throughput equation be a
reasonable approximation of the sending rate of TCP for conformant TCP
congestion control.

The parameters p and RTT need to be measured or calculated by a PGMCC
implementation.  The measurement of RTT is specified in Section 2.5; the
measurement of p is specified in Section 2.6.

2.5.  RTT measurement

In PGMCC, the RTT is measured by the sender making use of the timestamp
(or equivalent information) echoed back by each receiver in feedback
messages. Three procedures are possible to measure the RTT, as follows.
In no case is it required to have clock synchronization between sender
and receivers.


2.5.1.	Explicit Timestamp

This first technique relies on the transmission of a timestamp TS_j with
each data packet j.
The receiver will record the most recently received timestamp, and will
echo it back to the source when generating an ACK or a NAK. If the
feedback is delayed, the time elapsed between the reception of the
timestamp and the generation of the feedback should be added to the
echoed timestamp.
The sender computes the RTT by subtracting the received timestamp from
the current value of the clock.

The resolution of the timestamp value should be good enough for
reasonable precision measurement of typical network round trip times. If
receivers need to apply correction for delayed feedback, it is necessary
that receivers know the resolution of the timestamp clock. A suggested
value is 1ms.


2.5.2.	Implicit timestamp

With this technique, the sender will record a timestamp TS_j for each
transmitted data packet j, but the timestamp will not be transmitted
with the packet itself.
The receiver will record the most recently received sequence number, and


Rizzo/Iannaccone/Vicisano/Handley	       Section 2.5.2.  [Page 13]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


will echo it back to the source when generating an ACK or a NAK.
The sender computes the RTT by looking up the timestamp associated with
the sequence number received in the feedback packet, and subtracting it
from the current clock value.

If the feedback from the receiver is delayed, as it is commonly the case
for NAKs, the receiver can compute, and send back to the source, a
correction term corresponding to the time elapsed between the reception
of the timestamp and the generation of the feedback. The correction term
will then be subtracted by the sender in order to obtain the correct
estimate of the RTT.

This RTT measurement technique is equivalent to the previous one, but it
saves some space in data packets as the timestamp does not need to be
sent explicitly. Feedback packets might become larger if the correction
value is transmitted explicitly; but in many cases, the sequence number
will already be present for other reasons (e.g. ACK packets), and
wherever space is a concern the sequence number and the correction term
can be packed in a single 32-bit word without loss of precision.


2.5.3.	Sequence numbers

This technique is the least precise, but it does not rely on the
presence of a high resolution clock on the nodes.
The sender will not compute any timestamp, and just send data packets
with their sequence number j.
The receiver will record the most recently received sequence number, and
will echo it back to the source when generating an ACK or a NAK.
The sender computes the RTT as the difference between the most recently
sent sequence number and the sequence number received from the ACK or
NAK packet.

Note that in this case the RTT is not measured in seconds, but in
"sequence numbers", which are monotonically, but not uniformly,
increasing with time. The two measurements are equivalent if the sender
transmits at a constant rate. When the data rate changes over time (as
it is normally happens, given that PGMCC controls the actual data rate),
then the "measured" RTT values grow with the actual transmit rate. This
can influence the correctness of the results when comparing two
measurement done over different and only partially overlapping time (and
sequence number) intervals where the transmit rate incurs a significant
change.


Rizzo/Iannaccone/Vicisano/Handley	       Section 2.5.3.  [Page 14]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


2.5.4.	Recommendations

Whenever possible, the measurement of the RTT should be carried out
using either explicit or implicit timestamps, and by keeping track of
the "correction term" (the delay between data reception and feedback
generation).

If the receiver does not have a clock with a suitable resolution, the
correction term might not be present (or be inaccurate). In this case
the timestamps received by the sender on NAK packets might be in error,
in the worst case, by as much as the packet interarrival time.	This
error will normally not be present on ACK packets, which are sent
immediately.  A suitable correction should be applied by the sender in
order to avoid systematic errors.

The measurement based on sequence numbers is less accurate, but also
less sensitive to errors due to the lack of the correction term. In
fact, the measurement error induced by the lack of the correction term
can be at most one unit.  This suggests that, when the correction term
is not available, measurements based on sequence numbers should be
favoured.  Simulations have shown that the acker selection mechanism
performs moderately better when the RTT measurement is based on
timestamps, but performance is reasonably good also with measurements
based on sequence numbers.


2.6.  Loss rate measurement


The loss measurement in PGMCC is entirely performed by receivers.  The
measurement results do not directly influence the transmit rate, but are
only used for comparison purposes. As a consequence, the scheme is
reasonably robust to different measurement techniques, as long as they
are not influenced too strongly by single loss events.

The main method suggested for loss measurement is Exponentially Weighted
Moving Average (EWMA), which is formally equivalent to a single-pole
digital low pass filter applied to a binary signal x_i, where x_i = 1 if
packet i is lost, x_i = 0 if packet i is successfully received.

The loss rate p_i upon reception or detection of loss of packet i is
computed as


		 p_i = c_p * p_{i-1} + (1 - c_p ) * p_i
 where the constant c_p between 0 and 1 is related to the bandpass of
the filter. Experiments have shown good performance with c = 500/65536,


Rizzo/Iannaccone/Vicisano/Handley		 Section 2.6.  [Page 15]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


and computations performed with fixed point arithmetic and 16 fractional
digits.

As an alternative to EWMA, the technique used in TFRC [2] can be
adopted. Simulations have shown a moderate improvement in the acker
selection mechanism by measuring loss using the TFRC loss estimator,
which is however slightly more expensive to compute than the EWMA loss
estimator in presence of packet reordering.


2.7.  Timeouts


When a packet is transmitted, the sender schedules a timeout to prevent
stalls upon loss of ACKs or disconnection of the ACKER.  In TCP, which
has a similar problem, the timeout value is computed by accumulating
statistics (SRTT and RTTVAR) on RTT samples, starting from a default
initial value (3s) when no RTT samples are available.

PGMCC can use a similar scheme to compute the timeouts, remembering that
upon ACKER changes (which may be very frequent), the computation of SRTT
and RTTVAR must be restarted from the beginning, unless the sender
decides to keep state for at least a small number of recent ackers.

Because the ACKER can leave the group without notifying the sender,
after a number of successive timeouts the sender MUST force the election
of a new ACKER.  We recommend this new election to be performed after
two successive timeouts.


2.8.  Interaction with feedback suppression schemes


Several schemes are used by NAK-based multicast protocols to reduce the
amount of feedback directed toward the source and make the protocol
scale with large populations of receivers.  Such schemes typically rely
on randomly delaying NAK generation, and suppressing pending NAKs when
an equivalent NAK or a retransmission is heard; or, intermediate nodes
such as routers can implement some form of feedback aggregation and
filtering.

Such schemes might prevent NAKs from potential ACKER candidates from
reaching the source. This filtering might impact the speed at which
PGMCC selects the correct ACKER, though initial experience from
simulations seem to suggest that PGMCC behavior is not severely affected
by NAK suppression schemes.


Rizzo/Iannaccone/Vicisano/Handley		 Section 2.8.  [Page 16]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


2.9.  Interaction with ECN


PGMCC can use ECN notifications in much the same way as actual losses,
and use such notifications to control the throughput of the session.

At the receiver, ECN-marked data packets can be considered as lost
packets for the purpose of loss rate computation and ACK/NAK generation.
If the ACKER sends an ACK for ECN-marked packets, that ACK MUST report
that the packet being acknowledged that was ECN marked.  Similarly the
ACKER must indicate in the ACK packet's received packets bitmap that the
packet was ECN-marked, or that the packet was lost.

We note that to support use of the ECN nonce, the ACK packet's received
packets bitmap would require two bits per packet being reported.


3.  Procedures - Sender

The following pseudo-code specifies the complete behavior of the sender
in PGMCC.


initialization:
  T = 1 ; W = 1 ;   /* initialize window and number of tokens */
  RETRY = 0 ;	    /* number of consecutive timeouts so far  */
  < initialize p, RTT for acker to default values >
  ACKER = NO_ACKER; /* no acker is known		      */
  < initialize sequence numbers >
  QUEUED = 0;	    /* packets waiting to be transmitted      */

on transmission request:
  send_packet() ;

on timeout expiration :
  T = 1 ; W = 1 ;   /* initialize window and number of tokens */
  if (RETRY < RETRY_MAX)
    RETRY = RETRY + 1
  else
    ACKER = NO_ACKER ; /* old acker is not valid anymore      */
  send_packet() ;


Rizzo/Iannaccone/Vicisano/Handley		   Section 3.  [Page 17]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


on ACK/NAK reception from receiver I :
  < compute p and RTT for source of this ACK, see Sec. 2.5 and 2.6 >
  RETRY = 0 ;
  if (ACKER == NO_ACKER) { /* select current as acker is no other known */
    ACKER = I ;
    T = T + 1 ;
  }
  if (ACKER != I)
    < select acker according to Sec. 2.3 > ;
  else {
    <update acker statistics (p_ACKER, RTT_ACKER) >
    if (packet_type == ACK) {
      < update_window according to Sec.2.2 >
      send_packet ;
      if (ack_pending)
	update_timeout ;
  }
}

send_packet() {
  if (QUEUED > 0 && T >= 1) {
    < transmit one packet >
    T = T - 1 ;
    QUEUED = QUEUED - 1 ;
  }
  if ( <there are unacked packet> )
    <set a timeout, see Sec.2.7 >
}


4.  Procedures -- Receiver

The following pseudo-code specifies the complete behavior of the
receiver in PGMCC.

A receiver only transmits an ACK packet when it receives a data packet
for which the receiver is designated as the ACKER by the data packet
itself.  A receiver can transmit a NAK packet after it has detected that
a data packet is missing and a suitable delay has passed, as dictated by
the feedback suppression rules of the protocol in use.

The data packet contains acknowledgement status about the most recent 32
sequence numbers known to the receiver.


Rizzo/Iannaccone/Vicisano/Handley		   Section 4.  [Page 18]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


on initialization/session setup:
  < initialize state variables and ACK bitmap >

on DATA packet reception:
  < update p measurement according to Sec.2.6 >
  < record timestamp and packet reception time >
  if (ACKER == this_node) {
    < send an immediate ACK >
  }
  if ( <some data packet is missing and unacknowledged > )
    < schedule a timeout for NAK transmission >

on NAK reception:
  < suppress any pending NAK transmission for the sequence
    number indicated in the NAK >

on timeout:
  if ( < there are missing and unacknowledged packets > ) {
    < send a NAK for one or more of the missing packets >
    < mark such packets as acknowledged >
    if ( <some data packet is missing and unacknowledged > )
      < schedule a timeout for NAK transmission >
  }


5.  Security Considerations

PGMCC is not a transport protocol in its own right, but a congestion
control mechanism that is intended to be used in conjunction with a
transport protocol.  Therefore security primarily needs to be considered
in the context of a specific transport protocol and its authentication
mechanisms.

Congestion control mechanisms can potentially be exploited to create
denial of service.  This may occur through spoofed feedback.  Thus any
transport protocol that uses PGMCC should take care to ensure that
feedback is only accepted from the receiver of the data.  The precise
mechanism to achieve this will however depend on the transport protocol
itself.

In addition, congestion control mechanisms may potentially be
manipulated by a greedy receiver that wishes to receive more than its
fair share of network bandwidth.  A receiver might do this by first
reporting inflated loss and RTT samples, in order to get selected as the
ACKER, and then generating ACK at the desired rate (including possibly
claiming to have received packets that in fact were lost due to
congestion).  Possible defenses against such a receiver could be based
on the sender verifying the correctness of the loss and RTT samples


Rizzo/Iannaccone/Vicisano/Handley		   Section 5.  [Page 19]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


supplied by the receiver. A PGMCC sender SHOULD compare the receiver
reports on loss rate and RTT with the information derived directly from
the incoming stream of ACKs.  In case of discrepancy of the reports, a
PGMCC sender SHOULD mark the current acker as ineligible and initiate a
new acker election. The decision on how large that discrepancy should be
before initiating a new acker election is left to the implementation.

Also, the sender MAY include some form of nonce that the receiver must
feed back to the sender to prove receipt. However, the details of such a
nonce would depend on the transport protocol, and in particular on
whether the transport protocol is reliable or unreliable.


6.  Authors' Addresses

   Luigi Rizzo
   luigi@iet.unipi.it
   Dip. Ing. dell'Informazione,
   Univ. di Pisa
   via Diotisalvi 2, 56122 Pisa, Italy

   Gianluca Iannaccone
   gianluca.iannaccone@intel.com
   Intel Research
   15 JJ Thomson Avenue, Cambridge CB3 0FD, UK

   Lorenzo Vicisano
   lorenzo@cisco.com
   cisco Systems, Inc.
   170 West Tasman Dr.,
   San Jose, CA, USA, 95134

   Mark Handley
   m.handley@cs.ucl.ac.uk
   University College London,
   Gower Street, London WC1E 6BT, UK


7.  Acknowledgments

We would like to acknowledge feedback and discussions on equation-based
congestion control with a wide range of people, including members of the
Reliable Multicast Research Group, the Reliable Multicast Transport
Working Group, and the End-to-End Research Group.


Rizzo/Iannaccone/Vicisano/Handley		   Section 7.  [Page 20]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


[1] Bradner, S., Key words for use in RFCs to Indicate Requirement
Levels (IETF RFC 2119) http://www.rfc-editor.org/rfc/rfc2119.txt

[2] Floyd, S., Handley, M., Padhye, J., Widmer, J., "Equation-Based
Congestion Control for Unicast Applications", ACM SIGCOMM 2000,
Stockholm, Aug. 2000

[3] Padhye, J. and  Firoiu, V. and Towsley, D. and Kurose, J., "Modeling
TCP Throughput: A Simple Model and its Empirical Validation", Proc ACM
SIGCOMM 1998.

[4] Mankin, A., Romanow, A., Brander, S., Paxson, V., "IETF Criteria for
Evaluating Reliable Multicast Transport and Application Protocols,"
RFC2357, June 1998.

[5] Rizzo, L., "pgmcc: a TCP-friendly single-rate multicast congestion
control scheme", ACM SIGCOMM 2000, Stockholm, Aug.2000

[6] Schulzrinne, H., Casner, S., Frederick, R., Jacobson, V., "RTP: A
Transport Protocol for Real-Time Applications", RFC 1889, Jan 1996.

[7] Speakman, T., Crowcroft, J., Gemmell, J., Farinacci, D. , Lin, S.,
Leshchiner, D., Luby, M., Montgomery, T. , Rizzo, L., Tweedly, A.,
Bhaskar, N., Edmonstone, R., Sumanasekera, R., Vicisano, L., PGM
Reliable Transport Protocol Specification, RFC 3208, December 2001.
rfc3208.txt also available at ftp://ftp.rfc-editor.org/in-
notes/rfc3208.txt


8.  Full Copyright Statement

Copyright (C) The Internet Society (2000).  All Rights Reserved.

This document and translations of it may be copied and furnished to
others, and derivative works that comment on or otherwise explain it or
assist in its implementation may be prepared, copied, published and
distributed, in whole or in part, without restriction of any kind,
provided that the above copyright notice and this paragraph are included
on all such copies and derivative works. However, this document itself
may not be modified in any way, such as by removing the copyright notice
or references to the Internet Society or other Internet organizations,
except as needed for the purpose of developing Internet standards in
which case the procedures for copyrights defined in the Internet
languages other than English.


Rizzo/Iannaccone/Vicisano/Handley		   Section 8.  [Page 21]

INTERNET-DRAFT		  Expires: January 2005 	       July 2004


The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.

This document and the information contained herein is provided on an "AS
IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK
FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT
LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT
INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR
FITNESS FOR A PARTICULAR PURPOSE."


Rizzo/Iannaccone/Vicisano/Handley		   Section 8.  [Page 22]