I am a software engineer, and have written code for DSLAMs since about 1999. I currently work for Calix and specialize in ADSL/2/2+/VDSL2.
The phone networks were optimized to carry the spoken voice, which has a practical range of about 300 Hz to something less than 4000 Hz. So analog phone equipment was designed to carry up to 4 KHz analog signals. When this is carried digitally, you need to sample at twice the frequency of the signal (as per Nyquist theory), so digital TDM voice technologies such as DDS, T1, T3, etc are built up in a hierarchy starting at 8 KHz X 8 bits = 64 Kbps. But I digress.
The analog equipment in use by land-line telephone companies were guaranteed to transmit analog signals up to 4 KHz. So old-tyme analog modems (1980s/1990s-era) had to stay below this frequency. As you note, Shannon's theorem predicted that they ought to be able to approach about 35 Kbps, but for a long time, people didn't know how to go that fast. In the mid to late 1980s, modems used a symbol (or "baud") rate of 1200 or 2400 symbols per second, and people were able to pack up to 4 bits into each symbol, which got us to 9600 bps. Then the symbol rate was pushed up to about 3600 symbols/sec, and that got us to 14,400 bps, but we still didn't know how to approach the theoretical limit of 35 Kbps. Eventually, an encoding method called Trellis Coding (invented late 70s-early 80s) became well-known, and this allowed us to get up to 10 bits in each symbol. Now we could approach Shannon's limit. But necessary limitations on power reduced the speed to 33.6 Kbps.
And that's it for an analog modem transmitting over traditional analog phone equipment, 33.6 Kbps. It does not go higher. The spate of 56 Kbps modems that came out in the 1990s were, indeed, employing a "trick". They took advantage of the fact that a lot of phone equipment was newer than was guaranteed by traditional phone companies. I am fuzzy on this part, but the way it was explained to me was that you can "leave out a final step of converting to traditional analog and transmit digitally" -- I may be remembering that incorrectly. The bottom line is that it doesn't always work. A 56 Kbps modem will fall back to 33.6 Kbps or slower if the equipment it is connected to does not support the trick.
So on to xDSL. As you point out, the discrete multi-tone (DMT) form of xDSL is essentially hundreds or thousands of parallel analog modems, each operating at a different frequency. This is not the only form of DSL. There are other forms of modulation. One early form of DSL was SDSL, which used 2B1Q modulation, not DMT. Another example is SHDSL, which was designed to replace T1 and E1 lines. This is also different from DMT. But the common forms of DSL as defined by the ITU in ITU-T G.992.1 (ADSL), G.992.3 (ADSL2), G.992.5 (ADSL2+), and G.993.2 (VDSL2) are all based on DMT. By the way, I've never heard of a tone being referred to as a "bucket". In my experience, they are called "tones" or "subcarriers".
If you go look at those highly-technical specs, and in particular refer to the annexes, you'll see that the frequencies at which the individual modems operate are spaced at 4.3125 KHz (except for VDSL2 mode 30a, which uses a spacing of 8.625 KHz), and for every frequency, the modems are limited to a particular power. In the earlier specs (ADSL/2/2+) the power was pretty much fixed by the PSD limit mask as published in the annexes. Starting with ADSL2+, the ITU began to specify ways of adjusting the power further with techniques referred to as upstream and downstream power backoff. The current state-of-the-art in power adjustment is a technology called "vectoring" which gives control of the power spectrum to an expert system that can adjust power over not only all of the tones used on a particular line, but over multiple lines used over a particular DSLAM, and eventually over multiple DSLAMs. The idea to to adjust the power on all of the frequencies such that all of the lines have a much-reduced phenonemenon of Far-End Cross-talk (FEXT). i.e. if they all play nicely together instead of all trying to blast at the maximum power they're allowed, they can all achieve higher rates.
And this (finally) gets to the heart of your question. It is not so simple as to say "each tone can get about 56 Kbps, and we have N tones, so we get N * 56 Kbps". Both sides of an xDSL connection go through a process called handshaking (specified by ITU-T G.994.1), where they test all of the tones they want to use and agree on which tones are noise-free enough to use, and how many bits can be placed into each tone. This has to do with the length of the cable over which they are operating, its diameter, whether it has bridged taps, whether water is leaking through insulation somewhere on the cable, whether there is radio-frequency interference from, say, ham radios, and probably 50 other things of which I am unaware. Then there is configuration information (are we shaping our power to annex A, annex B, or using vectoring, etc) that further modifies the decisions used by both sides. In the end, the two sides agree on a rate the line can support, and that's what you get.
By the way, bit swapping is the process of continually monitoring an xDSL line while it is up, and deciding to move some bits from one tone to another.
Hope this helps. Even though xDSL technology is destined to eventually be replaced by fiber, it is a highly-sophisticated technology. The wires over which it operates were never intended to support speeds of up to 100 Mbps (as you can get with VDSL2 on very short loops), and are often 50 or more years old. I find xDSL to be really quite impressive.