[doc] add some ramblings on thinp metadata v2
This commit is contained in:
parent
9bc5d9fbfe
commit
eeb66ad83a
155
doc/thinp-version-2/notes.md
Normal file
155
doc/thinp-version-2/notes.md
Normal file
@ -0,0 +1,155 @@
|
||||
It's time for a major update to the thin provisioning target. This is a chance
|
||||
to add new features, and address deficiencies in the current version.
|
||||
|
||||
Features
|
||||
========
|
||||
|
||||
Features that we should consider (some are more realistic than others).
|
||||
|
||||
- Performance enhancements for solid state storage. eg, streaming writes.
|
||||
Take erasure size into consideration.
|
||||
|
||||
- Compression.
|
||||
|
||||
- Resilience in the face of damaged metadata. Measure potential data loss
|
||||
compared to size of damage.
|
||||
|
||||
- Support zeroed data in the metadata to avoid storing zeroes on disk.
|
||||
|
||||
- Get away from the fixed block size.
|
||||
|
||||
Since it's always a compromise between provisioning performance and snapshot
|
||||
efficiency.
|
||||
|
||||
- Performance improvement for metadata.
|
||||
|
||||
Space maps are too heavy.
|
||||
|
||||
- Performance improvement for multicore.
|
||||
|
||||
- Reduce metadata size.
|
||||
|
||||
- Efficient use of multiple devices.
|
||||
|
||||
Currently thinp is totally unaware of how the data device is built up.
|
||||
|
||||
Anti-features
|
||||
=============
|
||||
|
||||
Not considering these at all:
|
||||
|
||||
- Dedup.
|
||||
|
||||
Metadata
|
||||
========
|
||||
|
||||
Problems with the existing metadata
|
||||
-----------------------------------
|
||||
|
||||
- Btrees are fragile
|
||||
Either use a different data structure, or add enough info that trees can be
|
||||
inferred and rebuilt.
|
||||
|
||||
- metadata is huge
|
||||
Start using ranges.
|
||||
|
||||
- space maps
|
||||
|
||||
Reference counting ranges will be more tedious. Find free now needs to find
|
||||
ranges quickly.
|
||||
|
||||
Ideas
|
||||
-----
|
||||
|
||||
- What could we use instead of btrees?
|
||||
Skip lists. Difficult to make these fit the persistent-data scheme I think
|
||||
these are better as an in core data structure (where spacial locality is less
|
||||
important).
|
||||
|
||||
- Drop reference counting from space maps completely.
|
||||
|
||||
This would allow them to be implemented with a simpler data structure, like a
|
||||
radix tree or a trie. It would be impossible to ascertain which blocks were
|
||||
free without a complete walk of the metadata. This is possibly ok if the
|
||||
metadata shrinks drastically through the use of ranges.
|
||||
|
||||
- Space maps do not need to be 'within' the persistent-data structure system
|
||||
since we never snapshot them.
|
||||
|
||||
|
||||
Blob abstraction
|
||||
================
|
||||
|
||||
A storage abstraction, a bit different from a block device. Presents a virtual
|
||||
address space.
|
||||
|
||||
(read dev-id begin end data)
|
||||
(write dev-id begin end data)
|
||||
(erase dev-id begin end)
|
||||
(copy src-dev-id src-begin src-end dest-dev-id dest-begin)
|
||||
|
||||
How do we cope with a device being split across different blobs? We need a
|
||||
data structure to hold this metadata information:
|
||||
|
||||
(map dev-id begin end) -> [(blob begin end)]
|
||||
|
||||
Could we use bloom filters in some way? (can't see how, we'd need to cope with
|
||||
erasure and false positives).
|
||||
|
||||
Write:
|
||||
|
||||
We always want to write into the highest priority blob (ie. SSD), so we need to
|
||||
write to new blob, commit, then we can erase from old blobs.
|
||||
|
||||
Read:
|
||||
|
||||
Look up blobs, issue IOs and wait for all to complete.
|
||||
|
||||
Erase:
|
||||
|
||||
Look up blobs, issue erase.
|
||||
|
||||
Dealing with atomicity
|
||||
----------------------
|
||||
|
||||
Blobs store their metadata in different ways, do they individually implement
|
||||
transactions, or can we enforce transactionality from above? I think the
|
||||
address space has to be managed for all blobs in one space. So each blob
|
||||
presents a *physical* address space, and the core maps thin devices to physical
|
||||
spaces.
|
||||
|
||||
Journal blob: records changes in a series, efficient for SSDs, slow start up
|
||||
since we need to walk the journal to build an in core map.
|
||||
|
||||
Transparent blob: no smarts phys addresses are translated to the data dev with a
|
||||
linear mapping. This suggests we have to have pretty much all of current thinp
|
||||
metadata in the core.
|
||||
|
||||
Compression blob: Adds an additional layer of remapping to provide compression.
|
||||
|
||||
Aging
|
||||
-----
|
||||
|
||||
Data ages from one blob to another. Because the journal blob is held in
|
||||
temporal order it's trivial to work out what should be archived. But the
|
||||
transparent one? Perhaps this should be another instance of the journal blob?
|
||||
|
||||
|
||||
ALL blobs now mix metadata and data. Core metadata needs to go somewhere
|
||||
(special dev id for fast blob)?
|
||||
|
||||
|
||||
Temp btrees
|
||||
-----------
|
||||
|
||||
If we're journalling we can relax the way we use btrees. There's a couple of
|
||||
options:
|
||||
|
||||
- Treat the btree as totally expendable, use no shadowing at all.
|
||||
|
||||
- Commit period for btree can be controlled by the journal, avoiding commits
|
||||
whenever a REQ_FLUSH comes in.
|
||||
|
||||
|
||||
|
||||
|
Loading…
Reference in New Issue
Block a user