-
Unknown A
Hi everyone.
-
Unknown B
Today we are continuing our implementation of Make More, our favorite character level language model. Now, you'll notice that the background behind me is different. That's because I am in Kyoto and it is awesome. So I'm in a hotel room here. Now, over the last few lectures, we've built up to this architecture that is a multi layer perceptron character level language model. So we see that it receives three previous characters and tries to predict the fourth character in a sequence using a very simple multi layer perceptron using one hidden layer of neurons with 10h non arities. So what I'd like to do now in this lecture is I'd like to complexify this architecture. In particular, we would like to take more characters in a sequence as an input, not just three. And in addition to that, we don't just want to feed them all into a single hidden layer because that squashes too much information too quickly.
-
Unknown B
Instead, we would like to make a deeper model that progressively fuses this information to make its guess about the next character in a sequence. And so we'll see that as we make this architecture more complex, we're actually going to arrive at something that looks very much like a WaveNet. So WaveNet is this paper published by defined in 2016. And it is also a language model basically, but it tries to predict audio sequences instead of character level sequences or word level sequences. But fundamentally the modeling setup is identical. It is an autoregressive model and it tries to predict the next character in a sequence. And the architecture actually takes this interesting hierarchical sort of approach to predicting the next character in a sequence with this tree like structure. And this is the architecture and we're going to implement it in the course of this video. So let's get started.
-
Unknown B
So the starter code for part 5 is very similar to where we ended up in part three. Recall that part four was the manual doc replication exercise. That is kind of an aside. So we are coming back to part three, copy pasting chunks out of it, and that is our starter code for part five. I've changed very few things otherwise. So a lot of this should look familiar to you if you've gone through part three. So in particular, very briefly, we are doing imports. We are reading our data set of words and we are processing the data set of words into individual examples. And none of this data generation code has changed.
-
Unknown A
And basically we have lots and lots of examples.
-
Unknown B
In particular, we have 182,000 examples of three characters trying to predict the fourth one. And we've Broken up every one of these words into little problems of given three characters.
-
Unknown A
Predict the fourth one.
-
Unknown B
This is our data set and this is what we're trying to get the neural net to do. Now, in part three, we started to develop our code around these layer modules that are, for example, a class linear. And we're doing this because we want to think of these modules as building blocks and like a Lego building block bricks that we can sort of like stack up into neural networks, and we can feed data between these layers and stack them up into sort of graphs. Now, we also developed these layers to have APIs and signatures very similar to those that are found in Pytorch. So we have TORCH nn and it's got all these layer building blocks that.
-
Unknown A
You would use in practice.
-
Unknown B
And we were developing all of these to mimic the APIs of these. So, for example, we have linear. So there will also be a torch NN linear, and its signature will be very similar to our signature and the functionality will be also quite identical, as.
-
Unknown A
Far as I'm aware.
-
Unknown B
So we have the LINEAR layer with the Vacherum 1D layer and the 10H layer that we developed previously. And LINEAR just does a matrix multiply in the forward pass of this module. BatchNorm, of course, is this crazy layer that we developed in the previous lecture. And what's crazy about it is, well, there's many things. Number one, it has these running mean and variances that are trained outside of batch propagation. They are trained using exponential moving average inside this layer when we call the forward pass. In addition to that, there's this training flag because the behavior of batch form is different during train time and evaluation time. And so suddenly we have to be very careful that BatchNorm is in its.
-
Unknown A
Correct state, that it's in the evaluation.
-
Unknown B
State or training state. So that's something to now keep track of, something that sometimes introduces bugs because you forget to put it into the right mode. And finally, we saw that batchnorm couples the statistics or the activations across the examples in the batch. So normally we thought of the batch as just an efficiency thing, but now we are coupling the computation across batch elements and it's done for the purposes of controlling the activation statistics, as we saw in the previous video. So it's a very weird layer, at least a lot of bugs, partly, for.
-
Unknown A
Example, because you have to modulate the.
-
Unknown B
Training and eval phase and so on. In addition, for example, you have to wait for the mean and the variance to settle and to actually reach a steady state. And so you have to make sure that you basically there's state in this layer and state is harmful usually. Now, I brought out the generator object. Previously we had a generator G and so on inside these layers. I've discarded that in favor of just initializing the torch RNG outside here, just once, globally, just for simplicity. And then here we are starting to build out some of the neural network elements. This should look very familiar. We have our embedding table C and then we have a list of layers and it's a linear feeds to batch norm feeds to 10h and then a linear output layer and its weights are scaled down. So we are not confidently wrong at initialization.
-
Unknown B
We see that this is about 12,000 parameters. We're telling Pytorch that the parameters require gradients. The optimization is, as far as I'm aware, identical and should look very, very familiar. Nothing changed here. Loss function looks very crazy. We should probably fix this. And that's because 32 batch elements are too few. And so you can get very lucky or unlucky in any one of these batches and it creates a very thick loss function. So we're going to fix that soon. Now, once we want to evaluate the trained neural network, we need to remember because of the batch norm layers, to set all the layers to be training equals false. So this only matters for the bashroom layer so far. And then we evaluate, we see that currently we have validation loss of 2.10, which is fairly good, but there's still a ways to go. But even at 2.10 we see that when we sample from the model, we actually get relatively name like results that do not exist in a training set.
-
Unknown B
So for example evon, kilo, pros, alaya, etc. So certainly not reasonable, not unreasonable I would say, but not amazing. And we can still push this validation loss even lower and get much better samples that are even more namelike. So let's improve this model now. Okay, first let's fix this graph because it is daggers in my eyes and I just can't take it anymore. So Lossi, if you recall, is a Python list of floats. So for example, the first 10 elements look like this. Now what we'd like to do basically is we need to average up some of these values to get a more sort of representative value along the way. So one way to do this is the following. In Pytorch, if I create for example a tensor of the first ten numbers, then this is currently a one dimensional array. But recall that I can view this array as two dimensional.
-
Unknown B
So, for example, I can view it as a 2 by 5 array and this is a 2D tensor. Now 2 by 5 and you see what Pytorch has done is that the first row of this tensor is the first five elements and the second row is the second five elements. I can also view it as a five by two as an example. And then recall that I can also use negative one in place of one of these numbers. And Pytorch will calculate what that number must be in order to make the number of elements work out. So this can be like this or like that. Both will work. Of course, this would not work. Okay, so this allows it to spread out some of the consecutive values into rows. So that's very helpful because what we can do now is first of all we're going to create a torch shot tensor out of the list of floats, and then we're going to view it as whatever it is, but we're going to stretch it out into rows of 1000 consecutive elements.
-
Unknown B
So the shape of this now becomes 200 by 1000 and each row is 1000 consecutive elements in this list. So that's very helpful because now we can do a mean along the rows and the shape of this will just be 200. And so we've taken basically the mean on every row. So plt plot of that should be something nicer, much better. So we see that we've basically made a lot of progress. And then here, this is the learning rate decay. So here we see that the learning rate decay subtracted a ton of energy out of the system and allowed us to settle into sort of the local minimum in this optimization. So this is a much nicer plot. Let me come up and delete the monster. And we're going to be using this going forward. Now, next up, what I'm bothered by is that you see our forward pass is a little bit gnarly and takes way too many lines of code.
-
Unknown B
So in particular, we see that we've organized some of the layers inside the layers list, but not all of them for no reason. So in particular, we see that we still have the embedding table special cased outside of the layers. And in addition to that, the viewing operation here is also outside of our layers. So let's create layers for these and then we can add those layers to just our list. So in particular, the two things that we need is here we have this embedding table and we are indexing at the integers inside the batch XB inside the tensor XB so that's an embedding table lookup just done with indexing. And then here we see that we have this view operation which if you recall from the previous video, simply rearranges the character embeddings and stretches them out into a row. And effectively what that does is the concatenation operation, basically, except it's free, because viewing is very cheap in Pytorch and no memory is being copied.
-
Unknown B
We're just re representing how we view that tensor. So let's create modules for both of these operations, the embedding operation and the flattening operation. I actually wrote the code just to save some time. We have a module embedding and a module flatten, and both of them simply do the indexing operation in a forward pass and the flattening operation here. And this C now will just become a soft weight inside an embedding module. And I'm calling these layers specifically embedding and platinum because it turns out that both of them actually exist in Pytorch. So in Pytorch we have NN embedding, and it also takes the number of embeddings and the dimensionality of the embedding, just like we have here. But in addition, Pytorch takes a lot of other keyword arguments that we are not using for our purposes yet. And for flatten, that also exists in Pytorch. And it also takes additional keyword arguments.
-
Unknown A
That we are not using.
-
Unknown B
So we have a very simple platen, but both of them exist in Pytorch, they're just a bit more simpler. And now that we have these, we can simply take out some of these special cased things. So instead of C we're just going to have an embedding and vocab size and N embed. And then after the embedding we are going to flatten. So let's construct those modules. And now I can take out this C and here I don't have to special case it anymore because now C is the embeddings weight and it's inside layers, so this should just work. And then here our forward pass simplifies substantially because we don't need to do these now outside of these layers, outside and explicitly. They're now inside layers, so we can delete those. But now to kick things off, we want this little X, which in the beginning is just xb the tensor of integers, specifying the identities of these characters at the input.
-
Unknown B
And so these characters can now directly feed into the first layer and this should just work. So let Me come here and insert a break, because I just want to make sure that the first iteration of this runs and that there's no mistake so that ran properly.
-
Unknown A
And basically we've substantially simplified the forward pass here. Okay, I'm sorry, I changed my microphone, so hopefully the audio is a little bit better now. One more thing that I would like to do in order to pytorchify our code even further is, is that right now we are maintaining all of our modules in a naked list of layers. And we can also simplify this because we can introduce the concept of Pytorch containers. So in Torch nn, which we are basically rebuilding from scratch here, there's a concept of containers, and these containers are basically a way of organizing layers into lists or dicts and so on. So in particular, there's a sequential which maintains a list of layers and is a module class in Pytorch. And it basically just passes a given input through all the layers sequentially, exactly as we are doing here. So let's write our own sequential.
-
Unknown A
I've written a code here and basically the code for sequential is quite straightforward. We pass in a list of layers, which we keep here, and then given any input in a forward pass, we just call all the layers sequentially and return the result. In terms of the parameters, it's just all the parameters of the child modules. So we can run this and we can again simplify this substantially because we don't maintain this naked list of layers. We now have a notion of a model which is a module and in particular is a sequential of all the layers. And now parameters are simply just model parameters. And so that list comprehension now lives here. And then here we are doing all the things we used to do. Now here the code again simplifies substantially because we don't have to do this forwarding here. Instead we just call the model on the input data.
-
Unknown A
And the input data here are the integers inside xb. So we can simply do logits, which are the outputs of our model are simply the model called onxb. And then the cross entropy here takes the logits and the targets. So this simplifies substantially. And then this looks good. So let's just make sure this runs. That looks good. Now here we actually have some work to do still here, but I'm going to come back later. For now there's no more layers, there's a model layers, but it's naughty to access attributes of these classes directly. So we'll come back and fix this later. And then here, of course, this simplifies substantially as well, because logits are the model called on X. And then these logits come here. So we can evaluate the train invalidation loss, which currently is terrible because we just initialized it in neural net.
-
Unknown A
And then we can also sample from the model. And this simplifies dramatically as well, because we just want to call the model onto the context and outcome logits. And these logits go into softmax and get the probabilities, etc. So we can sample from this model. What did I screw up? Okay, so I fixed the issue and we now get the result that we expect, which is gibberish, because the model is not trained, because we reinitialize it from scratch. The problem was that when I fixed this cell to be model layers instead of just layers, I did not actually run the cell. And so our neural net was in a training mode. And what caused the issue Here is the BatchNorm layer, as BatchNorm layer often likes to do, because batchnormal is in the training mode. And here we are passing in an input which is a batch of just a single example made up of the context.
-
Unknown A
And so if you are trying to pass in a single example into a batch norm, that is in the training mode, you're going to end up estimating the variance using the input. And the variance of a single number is not a number because it is a measure of a spread. So for example, the variance of just a single number five you can see, is not a number. And so that's what happened. And Bastrum basically caused an issue, and then that polluted all of the further processing. So all that we had to do was make sure that this runs. And we basically made the issue of. Again, we didn't actually see the issue with the loss. We could have evaluated the loss, but we got the wrong result because bastionorm was in the training mode. And so we still get a result. It's just a wrong result because it's using the sample statistics of the batch, whereas we want to use the running mean and running variance inside the batch norm.
-
Unknown A
And so again, an example of introducing a bug in line because we did not properly maintain the state of what is training or not. Okay, so I rerun everything and here's where we are. As a reminder, we have the training loss of 2.05 and validation of 2.10. Now, because these losses are very similar to each other, we have a sense that we are not overfitting too much on this task and we can make additional progress in our performance. By scaling up the size of the neural network and making everything bigger and deeper. Now, currently we are using this architecture here, where we are taking in some number of characters, going into a single hidden layer and then going to the prediction of the next character. The problem here is we don't have a naive way of making this bigger in a productive way. We could of course use our layers, sort of building blocks and materials to introduce additional layers here and make the network deeper.
-
Unknown A
But it is still the case that we are crushing all of the characters into a single layer all the way at the beginning. And even if we make this a bigger layer and add neurons, it's still kind of like silly to squash all that information so fast in a single step. So we'd like to do instead is we'd like, we'd like our network to look a lot more like this in the Wavenet case. So you see, in the Wavenet, when we are trying to make the prediction for the next character in the sequence, it is a function of the previous characters that feed in. But all of these different characters are not just crushed to a single layer and then you have a sandwich, they are crushed slowly. So in particular, we take two characters and we fuse them into sort of like a bigram representation and we do that for all these characters consecutively.
-
Unknown A
And then we take the bigrams and we fuse those into four character level chunks and then we fuse that again. And so we do that in this like tree, like hierarchical manner. So we fuse the information from the previous context slowly into the network as it gets deeper. And so this is the kind of architecture that we want to implement. Now, in the Wavenet's case, this is a visualization of a stack of dilated causal convolution layers. And this makes it sound very scary, but actually the idea is very simple. And the fact that it's a dilated causal convolution layer is really just an implementation detail to make everything fast. We're going to see that later, but for now let's just keep the basic idea of it, which is this progressive fusion. So we want to make the network deeper and at each level we want to fuse only two consecutive elements, two characters, then two bigrams, then two, four grams and so on.
-
Unknown A
So let's implement this. Okay, so first up, let me scroll through where we built the data set and let's change the block size from three to eight. So we're going to be taking eight characters of context to predict the ninth character. So the data set now looks like this we have a lot more context feeding in to predict any next character in a sequence. And these eight characters are going to be processed in this tree like structure now. Now if we scroll here, everything here should just be able to work. So we should be able to redefine the network. You see that the number of parameters has increased by 10,000 and that's because the block size has grown. So this first linear layer is much, much bigger. Our linear layer now takes eight characters into this middle layer. So there's a lot more parameters there.
-
Unknown A
But this should just run. Let me just break right after the very first iteration. So you see that this runs just fine. It's just that this network doesn't make too much sense. We're crushing way too much information way too fast. So let's now come in and see how we could try to implement the hierarchical scheme. Now, before we dive into the detail of the RE implementation here, I was just curious to actually run it and see where we are in terms of the baseline performance of just lazily scaling up the context length. So I let it run. We get a nice loss curve and then evaluating the loss, we actually see quite a bit of improvement just from increasing the context length. So I started a little bit of a performance log here. And previously where we were is we were Getting performance of 2.10 on the validation loss.
-
Unknown A
And now simply scaling up the context length from 3 to 8 gives us a performance of 2.02. So quite a bit of an improvement here. And also when you sample from the model, you see that the names are definitely improving qualitatively as well. So we could of course spend a lot of time here tuning things and making it even bigger and scaling up the network further, even with the simple sort of setup here. But let's continue and let's implement the hierarchical model and treat this as just a rough baseline performance. But there's a lot of optimization like left on the table in terms of some of the hyperparameters that you're hopefully getting a sense of now. Okay, so let's scroll up now, come back up. And what I've done here is I've created a bit of a scratch space for us to just like look at the forward pass of the neural net and inspect the shape of the tensors along the way as the neural net forwards.
-
Unknown A
So here I'm just temporarily for debugging, creating a batch of just say four examples, so four random integers. Then I'm plucking out those rows from our training set and then I'm passing in into the model the input XB. Now, the shape of XB here, because we have only four examples, is 4 by 8, and this 8 is now the current block size. So inspecting XB, we just see that we have four examples. Each one of them is a row of XB. And we have eight characters here. And this integer tensor just contains the identities of those characters. So the first layer of our neural net is the embedding layer. So passing XB this integer tensor through the embedding layer creates an output that is 4 by 8 by 10. So our embedding table has for each character, a 10 dimensional vector that we are trying to learn.
-
Unknown A
And so what the embedding layer does here is it plucks out the embedding vector for each one of these integers and organizes it all in a 4 by 8 by 10 tensor. Now, so all of these integers are translated into ten dimensional vectors inside this three dimensional tensor. Now. Now passing that through the flattened layer, as you recall, what this does is it views this tensor as just a 4 by 80 tensor. And what that effectively does is that all these ten dimensional embeddings for all these eight characters just end up being stretched out into a long row. And that looks kind of like a concatenation operation, basically. So by viewing the tensor differently, we now have a 4 by 80, and inside this 80, it's all the 10 dimensional vectors just concatenate next to each other. And then the linear layer of course, takes 80 and creates 200 channels just via matrix multiplication.
-
Unknown B
So, so far, so good.
-
Unknown A
Now I'd like to show you something surprising. Let's look at the insides of the linear layer and remind ourselves how it works. The linear layer here, in a forward pass, takes the input X, multiplies it with a weight and then optionally adds bias. And the weight here is two dimensional as defined here, and the bias is one dimensional here. So effectively, in terms of the shapes involved, what's happening inside this linear layer looks like this right now. And I'm using random numbers here, but I'm just illustrating the shapes and what happens. Basically, a 4x80 input comes into the linear layer, gets multiplied by this 80 by 200 weight matrix inside, and there's a 200 bias. And the shape of the whole thing that comes out of the linear layer is 4 by 200, as we see here. Now notice here, by the way, that this here will create a 4 by 200 tensor and then plus 200, there's a broadcasting happening here, but 4 by 200 broadcasts with 200.
-
Unknown A
So everything works here. So now the surprising thing that I'd like to show you that you may not expect is that this input here that is being multiplied doesn't actually have to be two dimensional. This matrix multiply operator in Pytorch is quite powerful and in fact you can actually pass in higher dimensional arrays or tensors and everything works fine. So for example, this could be 4 by 5 by 80 and the result in that case will become 4 by 5 by 200. You can add as many dimensions as you like on the left here. And so effectively what's happening is that the matrix multiplication only works on the last dimension and the dimensions before it in the input tensor are left unchanged. So that is basically these dimensions on the left are all treated as just a batch dimension. So we can have multiple batch dimensions and then in parallel over all those dimensions, we are doing the matrix multiplication on the last dimension.
-
Unknown A
So this is quite convenient because we can use that in our network now, because remember that we have these eight characters coming in and we don't want to now flatten all of it out into a large eight dimensional vector, because we don't want to matrix multiply 80 into a weight matrix, multiply immediately. Instead, we want to group these like this. So every consecutive two elements, 1, 2 and 3 and 4 and 5 and 6 and 7 and 8, all of these should be now basically flattened out and multiplied by a weight matrix. But all of these four groups here we'd like to process in parallel. So it's kind of like a batch dimension that we can introduce and then we can in parallel basically process all of these bigram groups in the four batch dimensions of an individual example, and also over the actual batch dimension of the, you know, four examples in our example here.
-
Unknown A
So let's see how that works. Effectively what we want is right now we take a 4 by 80 and multiply it by 80 by 200. In the linear layer, this is what happens. But instead what we want is we don't want 80 characters or 80 numbers to come in. We only want two characters to come in on the very first layer and those two characters should be fused. So in other words, we just want 20 to come in. Right, 20 numbers would come in. And here we don't want a 4x80 to feed into the linear layer, we actually want these groups of two to feed in. So instead of 4 by 80 we want this to be a 4 by 4 by 20. So these are the four groups of two and each one of them is 10 dimensional vector. So what we want is now is we need to change the flatten layer so it doesn't output a 4x80, but it outputs a 4x4x20 where basically these every two consecutive characters are packed in on the very last dimension.
-
Unknown A
And then these four is the first batch dimension, and this four is the second batch dimension, referring to the four groups inside every one of these examples. And then this will just multiply like this. So this is what we want to get to. So we're going to have to change the linear layer in terms of how many inputs it expects. It shouldn't expect 80, it should just expect 20 numbers. And we have to change our flatten layer so it doesn't just fully flatten out this entire example, it needs to create a 4x4x20 instead of a 4 by 80. So let's see how this could be implemented. Basically, right now we have an input that is a 4x8x10 that feeds into the flatten layer. And currently the flatten layer just stretches it out. So if you remember the implementation of flatten, it takes rx and it just views it as whatever the batch dimension is and then minus one.
-
Unknown A
So effectively what it does right now is it does E Vue of 4 negative 1. And the shape of this of course is 4 by 80. So that's what currently happens. And we instead want this to be a 4x4x20 where these consecutive 10 dimensional vectors get concatenated. So you know how in Python you can take a list of range of 10, so we have numbers from 0 to 9 and we can index like this to get all the even parts. And we can also index like starting at one and going in steps of two to get all the odd parts. So one way to implement this would be as follows. We can take E and we can index into it for all the batch elements and then just even elements in this dimension. So at indexes 0, 2, 4 and 8 and then all the parts here from this last dimension and this gives us the even characters, and then here this gives us all the odd characters.
-
Unknown A
And basically what we want to do is we want to make sure that these get concatenated in Pytorch. And then we want to concatenate these two tensors along the second dimension. So this and the shape of it would be 4x4y20. This is definitely the result we want. We are explicitly grabbing the even parts and the odd parts and we're arranging those 4 by 4 by 10 right next to each other and concatenate so this works. But it turns out that what also works is you can simply use view again and just request the right shape. And it just so happens that in this case those vectors will again end up being arranged exactly the way we want. So in particular, if we take E and we just view it as a 4 by 4 by 20, which is what we want, we can check that this is exactly equal to.
-
Unknown A
Let me call this. This is the explicit concatenation. I suppose. So explicit shape is 4x4x20. If you just view it as 4x4x20 you can check that when you compare it to explicit, you get a big this is element wise operation. So making sure that all of them are true values to true. So basically, long story short, we don't need to make an explicit call to concatenate etc. We can simply take this input tensor to flatten and we can just view it in whatever way we want. And in particular, we don't want to stretch things out with negative one. We want to actually create a three dimensional array and depending on how many vectors that are consecutive, we want to fuse, like for example, two. Then we can just simply ask for this dimension to be 20 and use a negative one here and Pytorch will figure out how many groups it needs to pack into this additional batch dimension.
-
Unknown A
So let's now go into Flatten and implement this. Okay, so I scrolled up here to flatten and what we'd like to do is we'd like to change it now. So let me create a constructor and take the number of elements that are consecutive that we would like to concatenate now in the last dimension of the output. So here we're just going to remember sol N N and then I want to be careful here because Pytorch actually has a torch flatten and its keyword arguments are different and they kind of like function differently. So R flatten is going to start to depart from Pytorch flatten. So let me call it flatten consecutive or something like that, just to make sure that our APIs are about equal. So this basically flattens only some n consecutive elements and puts them into the last dimension. Now here the shape of X is B by T by C.
-
Unknown A
So let me pop those out into variables. And recall that in our example down below, B was 4, T was 8 and C was 10. Now instead of doing X view of B by negative 1, right, this is what we had before. We want this to be B. And basically here we want C times N that's how many consecutive elements we want. And here instead of negative one, I don't super love the use of negative one because I like to be very explicit so that you get error messages when things don't go according to your expectation. So what do we expect here? We expect this to become T divide N using integer division here. So that's what I expect to happen. And then one more thing I want to do here is remember previously, all the way in the beginning N was three and basically we're concatenating all the three characters that existed there.
-
Unknown A
So we basically concatenated everything. And so sometimes I can create a spurious dimension of one here. So if it is the case that x shapeat is 1, then it's kind of like a spurious dimension. So we don't want to return a three dimensional tensor with a one here, we just want to return a two dimensional tensor exactly as we did before. So in this case, basically we will just say X equals X squeeze, that is a Pytorch function. And squeeze takes a dimension that it either squeezes out all the dimensions of a tensor that are one, or you can specify the exact dimension that you want to be squeezed. And again, I like to be as explicit as possible always. So I expect to squeeze out the first dimension only of this tensor, this three dimensional tensor. And if this dimension here is one, then I just want to return B by C times N.
-
Unknown A
And so self out will be X. And then we return self out. So that's the candidate implementation. And of course this should be self.in instead of just N. So let's run and let's come here now and take it for a spin. So flatten consecutive. And in the beginning let's just use eight. So this should recover the previous behavior. So flatten consecutive of eight, which is the current block size. We can do this. That should recover the previous behavior. So we should be able to run the model and here we can inspect. I have a little code snippet here where I iterate over all the layers. I print the name of the this class and the shape. And so we see the shapes as we expect them after every single layer and its output. So now let's try to restructure it using our flatten consecutive and do it hierarchically.
-
Unknown A
So in particular we want to flatten consecutive not just not block size, but just two. And then we want to process this with linear. Now the number of inputs to this linear will not be N embed times block size, it will now only be n embed times 2, 20. This goes through the first layer and now we can in principle just copy paste this. Now the next linear layer should expect n hidden times 2 and the last piece of it should expect an hidden times two again. So this is sort of like the naive version of it. So running this, we now have a much, much bigger model and we should be able to basically just forward the model and now we can inspect the numbers in between. So 4x20 was flattened consecutively into 4x4 by 20. This was projected into 4x4x200. And then Bastorm just worked out of the box and we have to verify that batchtorm does the correct thing, even though it takes a three dimensional input instead of two dimensional input.
-
Unknown A
Then we have 10h, which is element wise. Then we crushed it again. So we flattened consecutively and ended up with a 4x2x400. Now then linear brought it back down to 200 bash from 10h. And lastly we get a 4x400 and we see that the flatten consecutive for the last flatten here it squeezed out that dimension of 1. So we only ended up with 4 by 400 and then linear bashnorm 10h and the last linear layer to get our logits. And so the logits end up in the same shape as they were before. But now we actually have a nice three layer neural net and it basically corresponds to. Whoops, sorry. It basically corresponds exactly to this network now, except only this piece here, because we only have three layers, whereas here in this example, there's four layers with a total receptive field size of 16 characters instead of just eight characters.
-
Unknown A
So the block size here is 16. So this piece of it is basically implemented here. Now we just have to kind of figure out some good channel numbers to use here. Now, in particular, I changed the number of hidden units to be 68 in this architecture, because when I use 68, the number of parameters comes out to be 22,000. So that's exactly the same that we had before. And we have the same amount of capacity at this neural net in terms of the number of parameters. But the question is whether we are utilizing those parameters in a more efficient architecture. So what I did then is I got rid of a lot of the debugging cells here and I rerun the optimization and scrolling down to the result, we see that we get the identical performance roughly. So our validation loss now is 2.029, and previously it was 2.027.
-
Unknown A
So controlling for the number of parameters changing from the flat to hierarchical is not giving us anything yet. That said, there are two things to point out. Number one, we didn't really torture the architecture here very much. This is just my first guess. And there's a bunch of hyperparameter search that we could do in terms of how we allocate our budget of parameters to what layers. Number two, we still may have a bug inside the Batchtorm 1D layer. So let's take a look at that, because it runs, but does it do the right thing? So I pulled up the layer inspector sort of that we have here and printed out the shapes along the way. And currently it looks like the BatchNorm is receiving an input that is 32 by 4 by 68. Right. And here on the right I have the current implementation of BatchNorm that we have right now.
-
Unknown A
Now, this BASH norm assumed in the way we wrote it and at the time that X is two dimensional. So it was N by D, where N was the batch size. So that's why we only reduced the mean and the variance over the 0th dimension. But now X will basically become three dimensional. So what's happening inside the batch norm layer right now and how come it's working at all and not giving any errors? The reason for that is basically because everything broadcasts properly, but the batch norm is not doing what we need, what we want it to do. So in particular, let's basically think through what's happening inside the BASH form, looking at what's, what's, what's happening here. I have the code here. So we're receiving an input of 32 by 4 by 68. And then we are doing here X mean. Here I have E instead of X, but we're doing the mean over zero and that's actually giving us one by four by 68.
-
Unknown A
So we're doing the mean only over the very first dimension, and that's giving us a mean and a variance that still maintain this dimension here. So these means are only taken over 32 numbers in the first dimension. And then when we perform this, everything broadcasts correctly still. But basically what ends up happening is when we also look at the running mean, the shape of it. So I'm looking at the model layers, which is the first batch term layer, and then looking at whatever the running mean became and its shape, the shape of this running mean now is 1 by 4 by 68. Right? Instead of it being, you know, just size of dimension, because we have 68 channels, we expect to have 68 means and variances that we're maintaining. But actually we have an array of 4 by 68. And so basically what this is telling us is this batch norm is only this batch norm is currently working in parallel over four times 68 instead of just 68 channels.
-
Unknown A
So basically we are maintaining statistics for every one of these four positions, individually and independently. And instead what we want to do is we want to treat this four as a batch dimension, just like the 0th dimension. So as far as the batch norm is concerned, it doesn't want to average, we don't want to average over 32 numbers, we want to now average over 32 times 4 numbers for every single one of these 68 channels. And so let me now remove this. It turns out that when you look at the documentation of torch mean, so let's go to torch mean in one of its signatures, when we specify the dimension, we see that the dimension here is not it can be int or it can also be a tuple of ints. So we can reduce over multiple integers at the same time over multiple dimensions at the same time.
-
Unknown A
So instead of just reducing over 0, we can pass in a tuple 01 and here 01 as well. And then what's going to happen is the output of course is going to be the same. But now what's going to happen is because we reduce over 0 and 1, if we look at imine shape, we see that now we've reduced, we took the mean over both the zeroth and the first dimension. So we're just getting 68 numbers and a bunch of spurious dimensions here. So now this becomes one by one by 68. The running mean and the running variance analogously will become one by one by 68. So even though there are disperious dimensions, the correct thing will happen in that we are only maintaining means and variances for 64, sorry for 68 channels. And we're now calculating the mean and variance across 32 times 4 dimensions.
-
Unknown A
So that's exactly what we want. And let's change the implementation of BatchNorm1D that we have so that it can take in two dimensional or three dimensional inputs and perform accordingly. So at the end of the day the fix is relatively straightforward. Basically, the dimension we want to reduce over is either 0 or the top of 0 and 1 depending on the dimensionality of x. So if x.endim is 2, so it's a two dimensional tensor, then the dimension we want to reduce over is just the integer 0L. If x.endim is 3, so it's a three dimensional tensor, then the dimms we're going to assume are zero and one that we want to reduce over. And then here we just pass in dim. And if the dimensionality vax is anything else, we'll now get an error, which is good. So that should be the fix. Now, I want to point out one more thing.
-
Unknown A
We're actually departing from the API of Pytorch here a little bit, because when you come to Batchderm1D in Pytorch, you can scroll down and you can see that the input to this layer can either be N by C, where N is the batch size and C is the number of features or channels, or it actually does accept three dimensional inputs, but it expects it to be N by C by L, where L is say like the sequence length or something like that. So this is problem because you see how C is nested here in the middle. And so when it gets three dimensional inputs, this batch form layer will reduce over 0 and 2 instead of 0 and 1. So basically, PyTorch BatchNorm 1D layer assumes that C will always be the first dimension, whereas we assume here that C is the last dimension and there are some number of batch dimensions beforehand.
-
Unknown A
And so it expects N by C or M by C by L. We expect N by C or N by L by C. And. And so it's a deviation. I think it's okay. I prefer it this way, honestly. So this is the way that we will keep it for our purposes. So I redefined the layers, reinitialized the neural net and did a single forward pass with a break, just for one step. Looking at the shapes along the way, they're of course identical, all the shapes are the same. But the way we see that things are actually working as we want them to now is that when we look at the Bastrom layer, the running mean shape is now one by one by 68. So we're only maintaining 68 means for every one of our channels. And we're treating both the zeroth and the first dimension as a batch dimension, which is exactly what we want.
-
Unknown A
So let me retrain the neural net now. Okay, so I retrained the neural net with the bug fix. We get a nice curve, and when we look at the validation performance, we do actually see a slight improvement. So we went from 2.029 to 2.022. So basically the bug inside the batch room was holding us back like a little bit, it looks like. And we are getting a tiny improvement now, but it's not clear if this is statistically significant. And the reason we slightly Expect an improvement is because we're not maintaining so many different means and variances that are only estimated using 32 numbers effectively. Now we are estimating them using 32 times 4 numbers. So you just have a lot more numbers that go into any one estimate of the mean and variance. And it allows things to be a bit more stable and less wiggly inside those estimates of those statistics.
-
Unknown A
So, pretty nice. With this more general architecture in place, we are now set up to push the performance further by increasing the size of the network. So, for example, I bumped up the number of embeddings to 24 instead of 10, and also increased number of hidden units. But using the exact same architecture, we now have 76,000 parameters. And the training takes a lot longer, but we do get a nice curve. And then when you actually evaluate the performance, we are now getting validation performance of 1.993. So we've crossed over the 2.0 sort of territory and we're at about 1.99. But we are starting to have to wait quite a bit longer. And we're a little bit in the dark with respect to the correct setting of the hyperparameters here and the learning rates and so on, because the experiments are starting to take longer to train.
-
Unknown A
And so we are missing sort of like an experimental harness on which we could run a number of experiments and really tune this architecture very well. So I'd like to conclude now with a few notes. We basically improved our performance from a starting of 2.1 down to 1.9, but I don't want that to be the focus because, honestly, we're kind of in the dark. We have no experimental harness. We're just guessing and checking, and this whole thing is terrible. We're just looking at the training loss. Normally you want to look at both the training and the validation loss together. The whole thing looks different if you're actually trying to squeeze out numbers. That said, we did implement this architecture from the WaveNet paper, but we did not implement this specific forward pass of it, where you have a more complicated linear layer sort of, that is this gated linear layer kind of, and there's residual connections and skip connections and so on.
-
Unknown A
So we did not implement that. We just implemented this structure. I would like to briefly hint or preview how what we've done here relates to convolutional neural networks as used in the wavenet paper. And basically, the use of convolutions is strictly for efficiency. It doesn't actually change the model we've implemented. So here, for example, Let me look at a specific name to work with an example. So there's a name in our training set and it's DeAndre and it has seven letters. So that is eight independent examples in our model. So all these rows here are independent examples of Diondre. Now you can forward of course any one of these rows independently. So I can take my model and call it on any individual index. Notice by the way, here I'm being a little bit tricky. The reason for this is that extra AT seven shape is just one dimensional array of eight, so you can't actually call the model on it.
-
Unknown A
You're going to get an error because there's no batch dimension. So when you do extra at list of seven, then the shape of this becomes one by eight. So I get an extra batch dimension of one. And then we can forward the model so that forwards a single example. And you might imagine that you actually may want to forward all of these eight at the same time. So pre allocating some memory and then doing a for loop eight times and forwarding all of those eight here will give us all the logits in all these different cases. Now for us with the model as we've implemented it, right now this is eight independent calls to our model. But what convolutions allow you to do is it allow you to basically slide this model efficiently over the input sequence? And so this for loop can be done not outside in Python, but inside of kernels in Cuda.
-
Unknown A
And so this for loop gets hit hidden into the convolution. So the convolution basically you can think of it as it's a for loop applying a little linear filter over space of some input sequence. And in our case the space we're interested in is one dimensional and we're interested in sliding these filters over the input data. So this diagram actually is fairly good as well. Basically what we've done is here they are highlighting in black one individual, one single sort of like tree of this calculation. So just calculating the single output example here. And so this is basically what we've implemented here. We've implemented a single this black structure, we've implemented that and calculated a single output like a single example. But what convolutions allow you to do is it allows you to take this black structure and kind of like slide it over the input sequence here and calculate all of these orange outputs at the same time.
-
Unknown A
Or here that corresponds to calculating all of these outputs at all the positions of DeAndre at the same time. And the reason that this is much more efficient is because number one, as I mentioned, the FOR loop is inside the CUDA kernels in the sliding, so that makes it efficient. But number two, notice the variable reuse here. For example, if we look at this circle, this node here, this node here is the right child of this node, but it's also the left child of the node here. And so basically this node and its value is used twice. And so right now, in this naive way, we'd have to recalculate it, but here we are allowed to reuse it. So in the convolutional neural network, you think of these linear layers that we have up above as filters. And we take these filters and they're linear filters, and you slide them over input sequence and we calculate the first layer and then the second layer and then a third layer and then the output layer of the sandwich.
-
Unknown A
And it's all done very efficiently using these convolutions. So we're going to cover that in a future video. The second thing I hope you took away from this video is you've seen me basically implement all of these layer Lego building blocks or module building blocks, and I'm implementing them over here. And we've implemented a number of layers together and we've also implemented these containers and we've overall pytorchified our code quite a bit more. Now, basically what we're doing here is we're reimplementing TORCH nn, which is the neural networks library on top of TORCH Tensor. And it looks very much like this, except it is much better because it's in Pytorch instead of jankling my jupyter notebook. So I think going forward, I will probably have considered us having unlocked TORCH nn, we understand roughly what's in there, how these modules work, how they're nested and what they're doing on top of Torch Tensor.
-
Unknown A
So hopefully we'll just switch over and start using Torch NN directly. The next thing I hope you got a bit of a sense of is what the development process of building deep neural networks looks like, which I think was relatively representative to some extent. So, number one, we are spending a lot of time in the documentation page of Pytorch and we're reading through all the layers, looking at documentations, what are the shapes of the inputs, what can they be, what does the layer do, and so on. Unfortunately, I have to say the Pytorch documentation is not very good. They spend a ton of time on hardcore engineering of all kinds of distributed primitives, etc. But as far As I can tell, no one is maintaining documentation. It will lie to you, it will be wrong, it will be incomplete, it will be unclear. So unfortunately it is what it is and you just kind of do your best with what they've given us.
-
Unknown A
Number two, the other thing that I hope you got a sense of is there's a ton of trying to make the shapes work and there's a lot of gymnastics around these multidimensional arrays and are they two dimensional, three dimensional, four dimensional? What layers take what shapes? Is it NCL or nlc? And you're permuting and viewing and it just can get pretty messy. And so that brings me to number three. I very often prototype these layers and implementations in jupyter notebooks and make sure that all the shapes work out. And I'm spending a lot of time basically babysitting the shapes and making sure everything is correct. And then once I'm satisfied with the functionality in a jupyter notebook, I will take that code and copy paste it into my repository of actual code that I'm training with. And so then I'm working with VS code on the side.
-
Unknown A
So I usually have Jupyter notebook and VS code. I develop a jupyter notebook, I paste into VS code and then I kick off experiments from the repo, of course from the code repository. So that's roughly some notes on the.
-
Unknown B
Development process of working with neural nets.
-
Unknown A
Lastly, I think this lecture unlocks a lot of potential further lectures because number one, we have to convert our neural network to actually use these dilated causal convolutional layers, so implementing the con net. Number two, potentially starting to get into what this means. What are residual connections and skip connections and why are they useful? Number three, as I mentioned, we don't have any experimental harness, so right now I'm just guessing, checking everything. This is not representative of typical deep learning workflows. You have to set up your evaluation harness. You can kick off experiments, you have lots of arguments that your script can take. You're kicking off a lot of experimentation, you're looking at a lot of plots of training and validation losses and you're looking at what is working and what is not working. And you're working on this like population level and you're doing all these hyperparameter searches.
-
Unknown A
And so we've done none of that so far. So how to set that up and how to make it good? I think it's a whole another topic. And number three, we should probably cover recurrent neural RNNs, LSTMS grooves and of course transformers. So many places to go. And we'll cover that in the future. For now. Bye. Sorry, I forgot to say that if you are interested, I think it is kind of interesting to try to beat this number 1.993 because I really haven't tried a lot of experimentation here and there's quite a bit of longing for it potentially to still push this further. So I haven't tried any other ways of allocating these channels in this neural net. Maybe the number of dimensions for the embedding is all wrong. Maybe it's possible to actually take the original network with just one hidden layer and make it big enough and actually beat my fancy hierarchical network.
-
Unknown A
It's not obvious. That would be kind of embarrassing if this did not do better. Even once you torture it a little bit, maybe you can read the Wavenet paper and try to figure out how some of these layers work and implement them yourself. So you using what we have. And of course you can always tune some of the initialization or some of the optimization and see if you can improve it that way. So I'd be curious if people can come up with some ways to beat this. And yeah, that's it for now. Bye.