summaryrefslogtreecommitdiffstats
path: root/fs/ceph/mds_client.c
Commit message (Collapse)AuthorAgeFilesLines
* ceph: clean up on forwarded aborted mds requestSage Weil2010-05-291-4/+9
| | | | | | | | | | If an mds request is aborted (timeout, SIGKILL), it is left registered to keep our state in sync with the mds. If we get a forward notification, though, we know the request didn't succeed and we can unregister it safely. We were trying to resend it, but then bailing out (and not unregistering) in __do_request. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make lease code DN specificSage Weil2010-05-291-2/+2
| | | | | | | | | | | The lease code includes a mask in the CEPH_LOCK_* namespace, but that namespace is changing, and only one mask (formerly _DN == 1) is used, so hard code for that value for now. If we ever extend this code to handle leases over different data types we can extend it accordingly. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make mds requests killable, not interruptibleSage Weil2010-05-291-2/+2
| | | | | | | | | | The underlying problem is that many mds requests can't be restarted. For example, a restarted create() would return -EEXIST if the original request succeeds. However, we do not want a hung MDS to hang the client too. So, use the _killable wait_for_completion variants to abort on SIGKILL but nothing else. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: Storage class should be before const qualifierTobias Klauser2010-05-221-2/+2
| | | | | | | | | | | The C99 specification states in section 6.11.5: The placement of a storage-class specifier other than at the beginning of the declaration specifiers in a declaration is an obsolescent feature. Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: all allocation functions should get gfp_maskYehuda Sadeh2010-05-181-5/+6
| | | | | | | | | This is essential, as for the rados block device we'll need to run in different contexts that would need flags that are other than GFP_NOFS. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use common helper for aborted dir request invalidationSage Weil2010-05-181-16/+23
| | | | | | | We invalidate I_COMPLETE and dentry leases in two places: on aborted mds request and on request replay. Use common helper to avoid duplicate code. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: cope with out of order (unsafe after safe) mds replySage Weil2010-05-181-0/+6
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: throw out dirty caps metadata, data on session teardownSage Weil2010-05-181-3/+41
| | | | | | | | | | | | | | | | | The remove_session_caps() helper is called when an MDS closes out our session (either normally, or as a result of a failed reconnect), and when we tear down state for umount. If we remove the last cap, and there are no cap migrations in progress, then there is little hope of us flushing out that data to the mds (without heroic efforts to reconnect and flush). So, to avoid leaving inodes pinned (due to dirty state) and crashing after umount, throw out dirty caps state and unpin the inodes. Print a warning to the console so we know something was lost. NOTE: Although we drop wrbuffer refs, we don't actually mark pages clean; maybe a truncate should be queued? Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: attempt mds reconnect if mds closes our sessionSage Weil2010-05-181-2/+3
| | | | | | | | | | | | | | | | | Currently, if our session is closed (due to a timeout, or explicit close, or whatever), we just sit there doing nothing unless/until the MDS restarts, at which point we try to reconnect. Change client to attempt an immediate reconnect if our session is closed. Note that currently the MDS doesn't support this, and our attempt will fail. We'll get a session CLOSE, our caps and dirty cap state will be dropped, and the client will be free to attempt to reconnect. That's clearly not as nice as a successful reconnect, but it at least allows us to try to carry on, and in the future the MDS will support a reconnect and we will fare better. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: clean up send_mds_reconnect interfaceSage Weil2010-05-181-31/+16Star
| | | | | | | | | | | Pass a ceph_mds_session, since the caller has it. Remove the dead code for sending empty reconnects. It used to be used when the MDS contacted _us_ to solicit a reconnect, and we could reply saying "go away, I have no session." Now we only send reconnects based on the mds map, and only when we do in fact have an open session. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: wait for mds OPEN reply to indicate reconnect successSage Weil2010-05-181-15/+13Star
| | | | | | | | | | | | | | | We used to infer reconnect success by watching the MDS state, essentially assuming that hearing nothing meant things were ok. That wasn't particularly reliable. Instead, the MDS replies with an explicit OPEN message to indicate success. Strictly speaking, this is a protocol change, but it is a backwards compatible one that does not break new clients + old servers or old clients + new servers. At least not yet. Drop unused @all argument from kick_requests while we're at it. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: only send cap releases when mds is OPEN|HUNGSage Weil2010-05-181-1/+3
| | | | | | | | | On OPENING we shouldn't have any caps (or releases). On CLOSING, we should wait until we succeed (and throw it all out), or don't (and are OPEN again). On RECONNECTING we can wait until we are OPEN. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: dicard cap releases on mds restartSage Weil2010-05-181-0/+41
| | | | | | | | | If the MDS restarts, the expire caps state is no longer shared, and can be thrown out. Caps state will be rebuilt on the MDS during the reconnect process that follows. Zero out any release messages and adjust the release counter accordingly. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: clean up cap release loop vs spinlockSage Weil2010-05-181-4/+3Star
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: skip mds sync on forced unmountSage Weil2010-05-181-0/+3
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: simplify ceph_msg_newSage Weil2010-05-181-6/+5Star
| | | | | | | We only need to pass in front_len. Callers can attach any other payload pieces (middle, data) as they see fit. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make ceph_msg_new return NULL on failure; clean up, fix callersSage Weil2010-05-181-23/+17Star
| | | | | | | Returning ERR_PTR(-ENOMEM) is useless extra work. Return NULL on failure instead, and fix up the callers (about half of which were wrong anyway). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: handle kzalloc() failureCheng Renquan2010-05-181-0/+4
| | | | | Signed-off-by: Cheng Renquan <crquan@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: reduce build_path debug outputSage Weil2010-05-181-6/+4Star
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: invalidate affected dentry leases on aborted requestsSage Weil2010-05-171-1/+6
| | | | | | | | | | | | | If we abort a request, we return to caller, but the request may still complete. And if we hold the dir FILE_EXCL bit, we may not release a lease when sending a request. A simple un-tar, control-c, un-tar again will reproduce the bug (manifested as a 'Cannot open: File exists'). Ensure we invalidate affected dentry leases (as well dir I_COMPLETE) so we don't have valid (but incorrect) leases. Do the same, consistently, at other sites where I_COMPLETE is similarly cleared. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix race between aborted requests and fill_traceSage Weil2010-05-171-0/+11
| | | | | | | | | When we abort requests we need to prevent fill_trace et al from doing anything that relies on locks held by the VFS caller. This fixes a race between the reply handler and the abort code, ensuring that continue holding the dir mutex until the reply handler completes. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: clean up mds reply, error handlingSage Weil2010-05-171-57/+54Star
| | | | | | | | | | | | | | | | We would occasionally BUG out in the reply handler because r_reply was nonzero, due to a race with ceph_mdsc_do_request temporarily setting r_reply to an ERR_PTR value. This is unnecessary, messy, and also wrong in the EIO case. Clean up by consistently using r_err for errors and r_reply for messages. Also fix the abort logic to trigger consistently for all errors that return to the caller early (e.g., EIO from timeout case). If an abort races with a reply, use the result from the reply. Also fix locking for r_err, r_reply update in the reply handler. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix cap removal racesSage Weil2010-05-121-2/+3
| | | | | | | | | | | | | | The iterate_session_caps helper traverses the session caps list and tries to grab an inode reference. However, the __ceph_remove_cap was clearing the inode backpointer _before_ removing itself from the session list, causing a null pointer dereference. Clear cap->ci under protection of s_cap_lock to avoid the race, and to tightly couple the list and backpointer state. Use a local flag to indicate whether we are releasing the cap, as cap->session may be modified by a racing thread in iterate_session_caps. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix locking for waking session requests after reconnectSage Weil2010-05-111-13/+16
| | | | | | | | | | The session->s_waiting list is protected by mdsc->mutex, not s_mutex. This was causing (rare) s_waiting list corruption. Fix errors paths too, while we're here. A more thorough cleanup of this function is coming soon. Signed-off-by: Sage Weil <sage@newdream.net>
* include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo2010-03-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
* ceph: fix use after free on mds __unregister_requestSage Weil2010-03-291-1/+2
| | | | | | | | There was a use after free in __unregister_request that would trigger whenever the request map held the last reference. This appears to have triggered an oops during 'umount -f' when requests are being torn down. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix session check on mds replySage Weil2010-03-231-1/+1
| | | | | | | | | Fix a broken check that a reply came back from the same MDS we sent the request to. I don't think a case that actually triggers this would ever come up in practice, but it's clearly wrong and easy to fix. Reported-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: handle kmalloc() failureDan Carpenter2010-03-231-0/+2
| | | | | | | | Return ERR_PTR(-ENOMEM) if kmalloc() fails. We handle allocation failures the same way later in the function. Signed-off-by: Dan Carpenter <error27@gmail.com> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: propagate mds session allocation failures to callerSage Weil2010-03-231-1/+6
| | | | | | Return error to original caller if register_session() fails. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: prevent dup stale messages to console for restarting mdsSage Weil2010-03-231-1/+1
| | | | | | | | Prevent duplicate 'mds0 caps stale' message from spamming the console every few seconds while the MDS restarts. Set s_renew_requested earlier, so that we only print the message once, even if we don't send an actual request. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix mds sync() race with completing requestsSage Weil2010-03-231-7/+20
| | | | | | | | | | | | | | | | | | | | The wait_unsafe_requests() helper dropped the mdsc mutex to wait for each request to complete, and then examined r_node to get the next request after retaking the lock. But the request completion removes the request from the tree, so r_node was always undefined at this point. Since it's a small race, it usually led to a valid request, but not always. The result was an occasional crash in rb_next() while dereferencing node->rb_left. Fix this by clearing the rb_node when removing the request from the request tree, and not walking off into the weeds when we are done waiting for a request. Since the request we waited on will _always_ be out of the request tree, take a ref on the next request, in the hopes that it won't be. But if it is, it's ok: we can start over from the beginning (and traverse over older read requests again). Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: remove bogus mds forward warningSage Weil2010-02-261-5/+1Star
| | | | | | | The must_resend flag is always true, not false. In any case, we can just ignore it anyway. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix client_request_forward decodingSage Weil2010-02-231-8/+2Star
| | | | | | | | | The tid is in the message header, not body. Broken since 6df058c0. No need to look at next mds session; just mark the request and be done. (The old error path was broken too, but now it's gone.) Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: drop messages on unregistered mds sessions; cleanupSage Weil2010-02-231-43/+42Star
| | | | | | | | | | Verify the mds session is currently registered before handling incoming messages. Clean up message handlers to pull mds out of session->s_mds instead of less trustworthy src field. Clean up con_{get,put} debug output. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix iterate_caps removal raceSage Weil2010-02-171-9/+42
| | | | | | | | | | | | | | | | | | | | | We need to be able to iterate over all caps on a session with a possibly slow callback on each cap. To allow this, we used to prevent cap reordering while we were iterating. However, we were not safe from races with removal: removing the 'next' cap would make the next pointer from list_for_each_entry_safe be invalid, and cause a lock up or similar badness. Instead, we keep an iterator pointer in the session pointing to the current cap. As before, we avoid reordering. For removal, if the cap isn't the current cap we are iterating over, we are fine. If it is, we clear cap->ci (to mark the cap as pending removal) but leave it in the session list. In iterate_caps, we can safely finish removal and get the next cap pointer. While we're at it, clean up put_cap to not take a cap reservation context, as it was never used. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use rbtree for snap_realmsSage Weil2010-02-171-11/+5Star
| | | | | | | Switch from radix tree to rbtree for snap realms. This is much more appropriate given that realm keys are few and far between. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use rbtree for mds requestsSage Weil2010-02-171-59/+90
| | | | | | | | | | The rbtree is a more appropriate data structure than a radix_tree. It avoids extra memory usage and simplifies the code. It also fixes a bug where the debugfs 'mdsc' file wasn't including the most recent mds request. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: allow renewal of auth credentialsSage Weil2010-02-111-0/+13
| | | | | | | | | Add infrastructure to allow the mon_client to periodically renew its auth credentials. Also add a messenger callback that will force such a renewal if a peer rejects our authenticator. Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net> Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: include type in ceph_entity_addr, filepathSage Weil2010-01-291-1/+1
| | | | | | | Include a type/version in ceph_entity_addr and filepath. Include extra byte in filepath encoding as necessary. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: allocate middle of message before stating to readYehuda Sadeh2010-01-251-2/+0Star
| | | | | | | Both front and middle parts of the message are now being allocated at the ceph_alloc_msg(). Signed-off-by: Yehuda Sadeh <yehuda@hq.newdream.net>
* ceph: properly handle aborted mds requestsSage Weil2010-01-251-5/+23
| | | | | | | | | | | | | | | | | | Previously, if the MDS request was interrupted, we would unregister the request and ignore any reply. This could cause the caps or other cache state to become out of sync. (For instance, aborting dbench and doing rm -r on clients would complain about a non-empty directory because the client didn't realize it's aborted file create request completed.) Even we don't unregister, we still can't process the reply normally because we are no longer holding the caller's locks (like the dir i_mutex). So, mark aborted operations with r_aborted, and in the reply handler, be sure to process all the caps. Do not process the namespace changes, though, since we no longer will hold the dir i_mutex. The dentry lease state can also be ignored as it's more forgiving. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use ceph_pagelist for mds reconnect message; change encoding (protocol ↵Sage Weil2009-12-231-100/+56Star
| | | | | | | | | | | | change) Use the ceph_pagelist to encode the MDS reconnect message. We change the message encoding (protocol change!) at the same time to make our life easier (we don't know how many snaprealms we have when we start encoding). An empty message implies the session is closed/does not exist. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: include transaction id in ceph_msg_header (protocol change)Sage Weil2009-12-231-2/+3
| | | | | | | | | | Many (most?) message types include a transaction id. By including it in the fixed size header, we always have it available even when we are unable to allocate memory for the (larger, variable sized) message body. This will allow us to error out the appropriate request instead of (silently) dropping the reply. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: do not touch_caps while iterating over caps listSage Weil2009-12-231-4/+10
| | | | | | | | | | Avoid confusing iterate_session_caps(), flag the session while we are iterating so that __touch_cap does not rearrange items on the list. All other modifiers of session->s_caps do so under the protection of s_mutex. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: make mds ops interruptibleSage Weil2009-12-221-6/+9
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: hex dump corrupt server data to KERN_DEBUGSage Weil2009-12-221-0/+4
| | | | | | Also, print fsid using standard format, NOT hex dump. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: use kref for struct ceph_mds_requestSage Weil2009-12-071-35/+34Star
| | | | Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: reset requested max_size after mds reconnectSage Weil2009-11-201-4/+17
| | | | | | | | | | | | The max_size increase request to the MDS can get lost during an MDS restart and reconnect. Reset our requested value after the MDS recovers, so that any blocked writes will re-request a larger max_size upon waking. Also, explicit wake session caps after the reconnect. Normally the cap renewal catches this, but not in the cases where the caps didn't go stale in the first place, which would leave writers waiting on max_size asleep. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: fix debugfs entry, simplify fsid checksSage Weil2009-11-201-10/+2Star
| | | | | | | | | | | We may first learn our fsid from any of the mon, osd, or mds maps (whichever the monitor sends first). Consolidate checks in a single helper. Initialize the client debugfs entry then, since we need the fsid (and global_id) for the directory name. Also remove dead mount code. Signed-off-by: Sage Weil <sage@newdream.net>
* ceph: negotiate authentication protocol; implement AUTH_NONE protocolSage Weil2009-11-191-4/+65
| | | | | | | | | | | | | | | | When we open a monitor session, we send an initial AUTH message listing the auth protocols we support, our entity name, and (possibly) a previously assigned global_id. The monitor chooses a protocol and responds with an initial message. Initially implement AUTH_NONE, a dummy protocol that provides no security, but works within the new framework. It generates 'authorizers' that are used when connecting to (mds, osd) services that simply state our entity name and global_id. This is a wire protocol change. Signed-off-by: Sage Weil <sage@newdream.net>