summaryrefslogtreecommitdiffstats
path: root/fs/ceph/messenger.c
Commit message (Collapse)AuthorAgeFilesLines
* ceph: close out mds, osd connections before stopping authSage Weil2010-05-291-0/+6
| | | | | | | | | The auth module (part of the mon_client) is needed to free any ceph_authorizer(s) used by the mds and osd connections. Flush the msgr workqueue before stopping monc to ensure that the destroy_authorizer auth op is available when those connections are closed out. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: all allocation functions should get gfp_maskYehuda Sadeh2010-05-181-5/+5
| | | | | | | | | This is essential, as for the rados block device we'll need to run in different contexts that would need flags that are other than GFP_NOFS. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: save peer feature bits in connection structureSage Weil2010-05-181-0/+1
| | | | | | | These are used for adjusting behavior, such as conditionally encoding a newer message format. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: resync headers with userlandSage Weil2010-05-181-12/+0Star
| | | | | | | Notable changes include pool op defines and types, FLOCK feature bit, and new CMPXATTR osd ops. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: drop src address(es) from message header [new protocol feature]Sage Weil2010-05-181-8/+6Star
| | | | | | | | | | The CEPH_FEATURE_NOSRCADDR protocol feature avoids putting the full source address in each message header (twice). This patch switches the client to the new scheme, and _requires_ this feature on the server. The server will support both the old and new schemes. That means an old client will work with a new server, but a new client will not work with an old server. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: close messenger raceSage Weil2010-05-181-7/+7
| | | | | | | Simplify messenger locking, and close race between ceph_con_close() setting the CLOSED bit and con_work() checking the bit, then taking the mutex. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: clean up connection resetSage Weil2010-05-181-0/+2
| | | | | | Reset out_keepalive_pending and peer_global_seq, and drop unused var. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: simplify ceph_msg_newSage Weil2010-05-181-11/+9Star
| | | | | | | We only need to pass in front_len. Callers can attach any other payload pieces (middle, data) as they see fit. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make ceph_msg_new return NULL on failure; clean up, fix callersSage Weil2010-05-181-13/+7Star
| | | | | | | Returning ERR_PTR(-ENOMEM) is useless extra work. Return NULL on failure instead, and fix up the callers (about half of which were wrong anyway). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use __page_cache_alloc and add_to_page_cache_lruYehuda Sadeh2010-05-181-1/+1
| | | | | | | | | | | | Following Nick Piggin patches in btrfs, pagecache pages should be allocated with __page_cache_alloc, so they obey pagecache memory policies. Also, using add_to_page_cache_lru instead of using a private pagevec where applicable. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: preserve seq # on requeued messages after transient transport errorsSage Weil2010-05-121-1/+10
| | | | | | | | | | If the tcp connection drops and we reconnect to reestablish a stateful session (with the mds), we need to resend previously sent (and possibly received) messages with the _same_ seq # so that they can be dropped on the other end if needed. Only assign a new seq once after the message is queued. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: zero unused message header, footer fieldsSage Weil2010-05-121-1/+5
| | | | | | | | We shouldn't leak any prior memory contents to other parties. And random data, particularly in the 'version' field, can cause problems down the line. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: discard incoming messages with bad seq #Sage Weil2010-05-031-0/+20
| | | | | | | | We can get old message seq #'s after a tcp reconnect for stateful sessions (i.e., the MDS). If we get a higher seq #, that is an error, and we shouldn't see any bad seq #'s for stateless (mon, osd) connections. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix seq counting for skipped messagesSage Weil2010-05-031-0/+2
| | | | | | Increment in_seq even when the message is skipped for some reason. Signed-off-by: Sage Weil <sage@newdream.net>
* Merge branch 'for-linus' of ↵Linus Torvalds2010-04-151-0/+9
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client: ceph: use separate class for ceph sockets' sk_lock ceph: reserve one more caps space when doing readdir ceph: queue_cap_snap should always queue dirty context ceph: fix dentry reference leak in dcache readdir ceph: decode v5 of osdmap (pool names) [protocol change] ceph: fix ack counter reset on connection reset ceph: fix leaked inode ref due to snap metadata writeback race ceph: fix snap context reference leaks ceph: allow writeback of snapped pages older than 'oldest' snapc ceph: fix dentry rehashing on virtual .snap dir
| * ceph: use separate class for ceph sockets' sk_lockSage Weil2010-04-131-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a separate class for ceph sockets to prevent lockdep confusion. Because ceph sockets only get passed kernel pointers, there is no dependency from sk_lock -> mmap_sem. If we share the same class as other sockets, lockdep detects a circular dependency from mmap_sem (page fault) -> fs mutex -> sk_lock -> mmap_sem because dependencies are noted from both ceph and user contexts. Using a separate class prevents the sk_lock(ceph) -> mmap_sem dependency and makes lockdep happy. Signed-off-by: Sage Weil <sage@newdream.net>
| * ceph: fix ack counter reset on connection resetSage Weil2010-04-031-0/+1
| | | | | | | | | | | | | | | | | | If in_seq_acked isn't reset along with in_seq, we don't ack received messages until we reach the old count, consuming gobs memory on the other end of the connection and introducing a large delay when those messages are eventually deleted. Signed-off-by: Sage Weil <sage@newdream.net>
* | include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* ceph: avoid reopening osd connections when address hasn't changedSage Weil2010-03-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | We get a fault callback on _every_ tcp connection fault. Normally, we want to reopen the connection when that happens. If the address we have is bad, however, and connection attempts always result in a connection refused or similar error, explicitly closing and reopening the msgr connection just prevents the messenger's backoff logic from kicking in. The result can be a console full of [ 3974.417106] ceph: osd11 10.3.14.138:6800 connection failed [ 3974.423295] ceph: osd11 10.3.14.138:6800 connection failed [ 3974.429709] ceph: osd11 10.3.14.138:6800 connection failed Instead, if we get a fault, and have outstanding requests, but the osd address hasn't changed and the connection never successfully connected in the first place, do nothing to the osd connection. The messenger layer will back off and retry periodically, because we never connected and thus the lossy bit is not set. Instead, touch each request's r_stamp so that handle_timeout can tell the request is still alive and kicking. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix connection fault con_work reentrancy problemSage Weil2010-03-231-2/+0Star
| | | | | | | | | | | | The messenger fault was clearing the BUSY bit, for reasons unclear. This made it possible for the con->ops->fault function to reopen the connection, and requeue work in the workqueue--even though the current thread was already in con_work. This avoids a problem where the client busy loops with connection failures on an unreachable OSD, but doesn't address the root cause of that problem. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix authenticator timeoutSage Weil2010-03-211-8/+1Star
| | | | | | | | | | | | | | | | | We were failing to reconnect to services due to an old authenticator, even though we had the new ticket, because we weren't properly retrying the connect handshake, because we were calling an old/incorrect helper that left in_base_pos incorrect. The result was a failure to reconnect to the OSD or MDS (with an authentication error) if the MDS restarted after the service had been up a few hours (long enough for the original authenticator to be invalid). This was only a problem if the AUTH_X authentication was enabled. Now that the 'negotiate' and 'connect' stages are fully separated, we should use the prepare_read_connect() helper instead, and remove the obsolete one. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: reset front len on return to msgpool; BUG on mismatched front iovSage Weil2010-03-021-0/+2
| | | | | | | | | | Reset msg front len when a message is returned to the pool: the caller may have changed it. BUG if we try to send a message with a hdr.front_len that doesn't match the front iov. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: reset bits on connection closeSage Weil2010-03-021-0/+3
| | | | | | | | | | | | Clear LOSSYTX bit, so that if/when we reconnect, said reconnect will retry on failure. Clear _PENDING bits too, to avoid polluting subsequent connection state. Drop unused REGISTERED bit. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix connection fault STANDBY checkSage Weil2010-02-251-18/+13Star
| | | | | | | | | | Move any out_sent messages to out_queue _before_ checking if out_queue is empty and going to STANDBY, or else we may drop something that was never acked. And clean up the code a bit (less goto). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: invalidate_authorizer without con->mutex heldSage Weil2010-02-251-8/+9
| | | | | | | | This fixes lock ABBA inversion, as the ->invalidate_authorizer() op may need to take a lock (or even call back into the messenger). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix up unexpected message handlingSage Weil2010-02-231-2/+3
| | | | | | | | Fix skipping of unexpected message types from osd, mon. Clean up pr_info and debug output. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: cancel delayed work when closing connectionSage Weil2010-02-171-2/+5
| | | | | | | | | | | | This ensures that if/when we reopen the connection, we can requeue work on the connection immediately, without waiting for an old timer to expire. Queue new delayed work inside con->mutex to avoid any race. This fixes problems with clients failing to reconnect to the MDS due to the client_reconnect message arriving too late (due to waiting for an old delayed work timeout to expire). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: allow connection to be reopened by fault callbackSage Weil2010-02-171-1/+1
| | | | | | | | | | | Fix the messenger to allow a ceph_con_open() during the fault callback. Previously the work wasn't getting queued on the connection because the fault path avoids requeued work (normally spurious). Loop on reopening by checking for the OPENING state bit. This fixes OSD reconnects when a TCP connection drops. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix msgr to keep sent messages until ackedSage Weil2010-02-141-2/+2
| | | | | | | | | The test was backwards from commit b3d1dbbd: keep the message if the connection _isn't_ lossy. This allows the client to continue when the TCP connection drops for some reason (network glitch) but both ends survive. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: allow renewal of auth credentialsSage Weil2010-02-111-0/+9
| | | | | | | | | Add infrastructure to allow the mon_client to periodically renew its auth credentials. Also add a messenger callback that will force such a renewal if a peer rejects our authenticator. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: include type in ceph_entity_addr, filepathSage Weil2010-01-291-0/+1
| | | | | | | Include a type/version in ceph_entity_addr and filepath. Include extra byte in filepath encoding as necessary. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: keep reserved replies on the request structureYehuda Sadeh2010-01-251-10/+10
| | | | | | | | This includes treating all the data preallocation and revokation at the same place, not having to have a special case for the reserved pages. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: alloc message data pages and check if tid existsYehuda Sadeh2010-01-251-31/+2Star
| | | | | | | | | | | Now doing it in the same callback that is also responsible for allocating the 'front' part of the message. If we get a message that we haven't got a corresponding tid for, mark it for skipping. Moving the mutex unlock/lock from the osd alloc_msg callback to the calling function in the messenger. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: refactor messages data section allocationYehuda Sadeh2010-01-251-28/+39
| | | | Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: allocate middle of message before stating to readYehuda Sadeh2010-01-251-60/+82
| | | | | | | Both front and middle parts of the message are now being allocated at the ceph_alloc_msg(). Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: remove unused erank fieldSage Weil2010-01-141-11/+8Star
| | | | | | | The ceph_entity_addr erank field is obsolete; remove it. Get rid of trivial addr comparison helpers while we're at it. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: support ceph_pagelist for message payloadSage Weil2009-12-231-4/+20
| | | | | | | | | | | The ceph_pagelist is a simple list of whole pages, strung together via their lru list_head. It facilitates encoding to a "buffer" of unknown size. Allow its use in place of the ceph_msg page vector. This will be used to fix the huge buffer preallocation woes of MDS reconnection. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: add feature bits to connection handshake (protocol change)Sage Weil2009-12-231-10/+37
| | | | | | | | Define supported and required feature set. Fail connection if the server requires features we do not support (TAG_FEATURES), or if the server does not support features we require. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: control access to page vector for incoming dataSage Weil2009-12-231-0/+29
| | | | | | | | | | | | | | When we issue an OSD read, we specify a vector of pages that the data is to be read into. The request may be sent multiple times, to multiple OSDs, if the osdmap changes, which means we can get more than one reply. Only read data into the page vector if the reply is coming from the OSD we last sent the request to. Keep track of which connection is using the vector by taking a reference. If another connection was already using the vector before and a new reply comes in on the right connection, revoke the pages from the other connection. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use connection mutex to protect read and write stagesSage Weil2009-12-231-18/+31
| | | | | | | | | Use a single mutex (previously out_mutex) to protect both read and write activity from concurrent ceph_con_* calls. Drop the mutex when doing callbacks to avoid nested locking (the callback may need to call something like ceph_con_close). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: remove unaccessible codeYehuda Sadeh2009-12-221-4/+0Star
| | | | Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: plug leak of incoming message during connection fault/closeSage Weil2009-12-221-2/+8
| | | | | | | If we explicitly close a connection, or there is a socket error, we need to drop any partially received message. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: hex dump corrupt server data to KERN_DEBUGSage Weil2009-12-221-0/+20
| | | | | | Also, print fsid using standard format, NOT hex dump. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: don't save sent messages on lossy connectionsSage Weil2009-12-221-3/+7
| | | | | | | For lossy connections we drop all state on socket errors, so there is no reason to keep sent ceph_msg's around. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: detect lossy state of connectionSage Weil2009-12-221-2/+4
| | | | | | | The server indicates whether a connection is lossy; set our LOSSYTX bit appropriately. Do not set lossy bit on outgoing connections. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: plug msg leak in con_faultSage Weil2009-12-221-2/+7
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: carry explicit msg reference for currently sending messageSage Weil2009-12-221-4/+18
| | | | | | | Carry a ceph_msg reference for connection->out_msg. This will allow us to make out_sent optional. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use kref for ceph_msgSage Weil2009-12-081-29/+18Star
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: simplify ceph_buffer interfaceSage Weil2009-12-071-1/+1
| | | | | | | | | We never allocate the ceph_buffer and buffer separtely, so use a single constructor. Disallow put on NULL buffer; make the caller check. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: reset msgr backoff during open, not after successful handshakeSage Weil2009-11-211-2/+1Star
| | | | | | | | Reset the backoff delay when we reopen the connection, so that the delays for any initial connection problems are reasonable. We were resetting only after a successful handshake, which was of limited utility. Signed-off-by: Sage Weil <sage@newdream.net>