CS50 x 2024 Notes Arrays - 04

⑴

We introduce now a few lower-level features of C itself and better understand how we can start solving some of those problems like the readability of text or the encryption of data.

These were our so-called types last week when we introduced at least a subset of them or used them just to store data in a certain format, so to speak. Like in week 0, we said that everything at the end of the day is just 0s and 1s, binary. And I claimed conceptually that how a computer knows if a set of bits is a number versus a letter versus a color or a sound or an image or a video is just context-dependent, like you're using Photoshop or you're using Microsoft Word or something else. But last week, we saw a little more precisely that it's not quite as broad strokes as that. It's more about what the programmer has told the software is being stored in a given variable. Is it an integer ? Is it a char, a character ? Is it a whole string ? Is it a longer integer or the like ? So you now have this control.

The catch, though, recall, though, is that each of these types has only a finite amount of space allocated to it. So for instance, an integer is typically 4 bytes, and 4 bytes is 32 bits because it's 8 times 4, 32 bits, we claimed, is roughly 4 billion, but if you want to represent negative and positive numbers, the biggest integer you can store is like 2 billion. Now that's really big for a lot of applications, but years ago, Facebook, for instance, was rumored to be using integers when they had fewer users. But now that they have billions of user - - 3-plus billion users, an integer is no longer big enough for the Facebook, the Googles, the Microsofts and so forth of the world. So we also have longs, which use twice as many bytes, but exponentially bigger range of values.

Meanwhile, a bool, interestingly, is a byte, which is kind of bad design in what sense ? Why might that be bad design ? It should only be 1 bit, rather, because a 0 or 1 should suffice. Turns out, it's just easier to use a whole byte, even though we're wasting seven of those bits, but bools are represented nonetheless with 1 byte. Floats tend to be 4 bytes. Doubles tend to be 8 types. Some of this is system-dependent, but nowadays on modern computers this tends to be a useful rule of thumb. The only one I can't commit to here is a string, because a string, recall, is a sequence of text. And maybe it has no characters, one character, two, 10, 100. So it's a variable number of bytes presumably where each byte represents a given character.

⑵

So with that said, how do we get from an actual computer to information being represented therein ?

Well, let me remind us that this is what's inside of our Macs, PCs, phones. Even though this isn't a scale and it might not be the same shape, this is memory, random access memory. And one these black chips on the circuit board here, are the bytes that we keep talking about.

In fact, let's go ahead and zoom in on one of these chips fill the screen here. And just for an artist's depiction's sake, let me propose that if you've got, I don't know - - a megabyte, a gigabyte - - like a lot of bytes packed into this chip nowadays, it stands to reason that no matter how many of them you have. We could just number them from top to bottom.

And we could say that this is byte 1, or you know what ? This is byte 0, 1, 2, 3 and this is maybe byte 1 billion or whatever it is. So you can think of memory as having addresses or just locations, numeric indices that identify each of those bytes individually. Why a byte ? Individual bits are not that useful, so 8 bits, again, 1 byte tends to be the de facto standard.

So, for instance, if you're storing just a single character, a charm, it might be stored literally in this top-left corner, so to speak, of the chip of memory.

If you're storing maybe an integer, 4 bytes, it might take up that many bytes.

If you're storing a long, it might take up that many bytes instead. Now we don't have to dwell on the particulars of the circuit board and these traces and all the connections.

So let me just abstract this way and claim that what your computer's memory really is just kind of this canvas, I mean kind of in the Photoshop sense. If you've ever made pictures, it's just a grid of pixels, up, down, left, right, that's really all your memory is. It's this canvas that you can manipulate the bits on to store numbers anywhere you want in the computer's memory. So in fact, let's consider how your computer is actually storing information using just these bytes. And end of the day, no matter how sophisticated your Mac, you PC, your phone is, like this is all, it has access to for storing information. It's a canvas of bytes, and what you do with this, now really invites design decisions.

So let's consider this. Here is an excerpt from a program wherein maybe I'm prompting the user for three scores. Like three test, scores, exam scores, something like that. So we can certainly whip up some code like this.

And in just a moment, let me go ahead and flip over to VS Code here. And I'll write up a new program called scores.c. And in this, let me go ahead and first include stdio.h, int main void at the top. And in here, let me go ahead and assume that, it's not been the greatest semester. So my first score, which I'll call score1, was a 72, my second score was a 73, but my third score, score3, was like a 33. Now you might remember these numbers in another context, they might spell a message, but in this case, it's just integers. It's just numbers because I'm telling the computer to treat these as ints. Now if I want to figure out what my average is, I can do a bit of math.

So let me just print out that my average is, and I don't want to shortchange myself. I'm not going to use %i, because I don't want to lose even anything after the decimal point. So we're going to use a float instead. And my average i claim will be score1 plus score2 plus score3 divided by 3, semicolon. With paretheses, because just like grade school math, like order of operations, I parenthesize the numerator. So I can divide the whole thing by 3. But I have screwed up already. I am going to shortchange myself and not give myself as high a grade as I deserve, but this one's subtle. What have I done wrong ? Yeah, I might want to cast these scores to floats because if you do integral math, sum integers by an integer, it's going to be an integer as the result, so it's going to throw away anything after the decimal point. Even if it's something-point-1, something-point-5, something-point-9, that fraction is going to be thrown away. There's a bunch of ways to fix this. I could just use floats or doubles for all of these. I could cast score1, score2 or score3 as you propose.

Frankly, the simplest way is just change the denominator because so long as I've got one float involved in the math, this will promote the whole arithmetic expression to being floating point math instead of integer math.

So let me go ahead now and do make scores, Enter, so far, so good, ./scores, and my average seems to be not great, but 59.333333. But I would have lost that third if I hadn't used a float in this particular way. Well, let's consider now what's actually going on inside of the computer when I store these three variables.

So back to the grid here, just my canvas of memory. It doesn't really matter where things end up. I might put it here, I might put it here, the computer makes these decisions.

But for the artist's sake, I'm going to put it at the top left-hand corner. So score1 is containing the integer 72. Why is it taking up four squares, though ? Because it's an integer. And on this system, an integer is 4 bytes.

So I've drawn it to scale, if you will, score2 is the number 73, it also takes 4 bytes. By coincidence, but also by convention, it will likely end up next to the first integer in memory because I've only got three variables going on anyway, so the computer quite likely will store them back to back to back.

And indeed, by that logic, score3, containing the number 33, is going to fill in this space here. We'll consider down the road what happens, if things get fragmented - - something's here, something's here, something's here, but for now, we can assume that this is probably contiguous, though not necessarily so.

All right, so that's pretty straightforward but what's really going on ? Well, these are just bytes of memory - - that is, bits of memory times 8. And so what's really going on is this pattern of 0s and 1s is being stored to represent 72. This pattern of 0s and 1s is being stored to represent 73, and similarly, 33. But that's very low level detail that we don't really care about, so we'll generally just think about these as numbers like 72, 73, 33. So if we go back to the actual code, here, I wonder if this is the best idea. These three lines of code are correct. I got my 59 and 1/3 for my average, which I claim is correct,

but code-wise, this should maybe rub you the wrong way. Even if you hadn't programmed before CS50, why might this not be the best approach to storing things like scores in a programm ? How might this get us in trouble ? It's not the best because you have to use a whole bunch of different variables for each score. They're almost identically named, though, but just imagine in almost any question involving the design of your code, what happens is n, the number of things involved, gets larger ? Am I really going to start writing code that has score4, score5, score6, score10, score20 ？I mean, your code is just going to look like this mess of mostly copy-paste except that the number at the end of the variable is changing. Like that should make your cringe a little bit because it's not going to end well eventually. And typographical errors are going to get in the way most likely because we'll make mistake. So how can we do a little bit better than that ? Well, let me propose that we introduce what we're going to now call an array.

An array is a sequence of values back to back to back in memory. So an array is just a chunk of memory storing values back to back to back. So no gaps, no fragmentation. From left to right, top to bottom, just as I already drew. But these arrays in C, at least, are going to give a slightly new syntax that addresse exactly your concern.

So here instead is I would propose how you define a one variable - - not three, one variable called scores, plural, each of whose values is going to be an int, and you want three integers tucked away in that variable. So now I can pluralize the name of my variable because by using square brackets and the number 3, I'm telling the compiler, give me enough room for not one, not two, but three integers in total. And the computer is going to do me a favor by storing them back to back to back in the computer's memory.

Now assigning values to these variables is almost the same, but the syntax looks like this. To assign the first value, I do scores, bracket, 0 equals 72, scores, bracket, 1 equals 73, scores, bracket, 2 equals 33. And it's square brackets consistently. And notice, this is a feature - - or a downside of C. We very frequently use the same syntax for slightly different ideas.

This first line tells the computer, give me an array of size 3.

These next three lines mean, go into this array at location 0 and put this value there; location 2, put this value there. So same syntax, but different meaning depending on the context here. But the equal sign indeed means that this is assignment from right to left just like last week. So what does this mean in the computer's memory ? Well, in this case here, we now have a slightly different way of doing this.

Let me go back to VS Code here, and let me propose that instead of having these three separate variables,

Let me give myself an int, scores variable of size 3, and then do scores, bracket, 0 equals 72, scores, bracket, 1 equals 73, scores, bracket, 2 equals 33.

And now I have to change this syntax slightly, but same idea. So a couple of key details I started counting at 0. Why ? That's just the way it is with arrays. You must start counting at 0 unless you want to waste one of those spaces. And what you definitely don't want to do is go into scores, bracket, 3 because I only ask the computer for three integers.

If I blindly do something like this, you're going too far. You're going beyond the end of the chunk of memory and bad things will often happen. So we won't do that just yet.

But for now, 0, 1 and 2 are the first, second and third locations. So if I recompile this code - - so make scores seems ok, ./scores, and I get the exact same answer there. But let me make it more dynamic because this is a little stupid that I'm compiling a program with my scores hardcoded. What if I have a fourth exam tomorrow or something like that ?

So let's make it more dynamic and I think the syntax will start to make a little more sense. Let's go ahead and use get_int and ask the user for a score. Let's go ahead and get_int and ask the user for another score. Let's go ahead and get_int and ask the user for a third score, now storing the return values in each of those variables.

If I now do make scores - - oh, darn it, a mistake, similar to one I've made before, but we didn't see the error message last time, what'd I do wrong ? So I'm missing the cs50 header file. So how do you know that ?

Well, implicit declaration of function get_int. So it just doesn't know what get_int is. Well, who does know what get_int is ? The CS50 Library, that should be your first instinct.

All right, let me go to the top here and squeeze in the CS50 Library like this. Now let me clear my terminal, make scores again. We're back in business. And notice, I don't need to do -l cs50 anymore, make is doing that for me for Clang, but we don't even see Clang being executed, but it is being executed underneath the hood, so to speak. All right, so ./scores, here we go, 72, 73, 33. Math is still the same, but now the program is more interactive.

Now this, too, hopefully should rub you the wrong way. This is correct, I would claim, but bad design still. Because I'm literally doing the same thing again and again.

And notice, this number is just changing slightly I would think that a little plus-plus could help here get_int score, get_int score, get_int score - - that's the exact same thing. So a loop is a perfect solution here.

So let me go over into this code here, and I can still for now declare it to be of size 3, but I think I could do something like this - - for int i get 0, i is less than 3, i++. Inside of the loop now, I can do scores, bracket, i, and now arrays are getting really interesting because you can use and reuse them, but dynamically go to a specific location. Equals get_int, quote-unquote, " score ". Now I can type that phrase just once and this loop ultimately will do the same thing, but it's going better. The code is getting better designed because it's more compact and I'm not repeating myself.

Still works the code here. Now how else - - there's one design flaw here that I still don't love it's a little more subtle. So instead of dividing by 3.0. maybe I should divided it by the array size, which at the moment is technically still 3, but I do concur that that is worrisome because they could get out of sync. But there's something else that still isn't quite right. So it does feel like there should be a better solution here. But let me also identify one other issue I really don't like, and this is, indeed, subtle.

I got 3 here,

I've got 3 here,

and I essentially have 3 here albeit a floating point version. This is just ripe for me making a mistake eventually and changing one of those values, but not the other two ? So how might I fix this ?

I might at least do something like this I could say integer maybe n for scores, I'll set that equal to 3. I could then use n here, I could use n here, I could use n here, but that's a step backwards, because I don't want an int because I'm going to run into the same math issue as before,

but I could convert it - - that is cast it to a float and we did that briefly last week. But there's one other thing I could do here that we did introduced last week. This is better because I don't have a magic number floating around in multiple places.

Yeah, if I really want to be proper, I should probably say this should be a constant integer. Why ? Because I don't want to accidently change it myself. I don't want to be collaborating with a colleague and they foolishly change it on me. This just sends a stronger signal to the compiler, do not let the humans change this value. And now just to point out one other feature of C.

If you have a number like this, like the number 3. I've deliberately capitalized this variable name really for the first time. Any time you have a constant, it tends to be a convetion to capitalize it just to draw your attention to it. It doesn't mean anything technically capitalizing a variable does nothing to it, but it draws attention visually to it to the human. So if you declare something as a constant, it's commonplace to capitalize it just because, moreover, if you have a constant that you might want to occasionally modify - - maybe next semester when there's four exams or five exams instead of three, it actually is OK,

sometimes to define what might be called a global variable, a variable that is not inside of curly braces, it's literally at the top of the file outside of main, and despite what I said about scope last week, a global variable like this on line 4 will be in scope to every function in this life. So it's actually a way of sharing a variable across multiple functions, which is generally fine if you're using a constant.

If you intend to change it, there's probably a better way than actually using a global variable, but this is just in contrast to what I previously did, which I would call, by contrast, a local variable. But again, I'm just trying to reduce the probably of making mistakes somewhere in the code.

And I do agree, I don't like that I'm still adding all of these scores manually even though clearly I had a loop a moment ago. But for now, let's at least consider what's been going on inside of the computer's memory.

So with this array, I now have not three variables, score1, score2, score3. I have one variable, an array variable, called scores, plural. And if I want to access the first element, its scores, bracket, 0. If I want to access the second element, it's scores, bracket, 1. If I want to access the third element, it's scores, bracket, 2, If I were to make a mistake and do scores, bracket, 3, which is the fourth element, I'd end up in no man's land here and worst case, your program could cash or something weird will happen, spinning beach balls, those kinds of things. Just don't make those mistakes. And C makes it easy to make those mistakes, so the onus is really on you programmatically. Is there any way to create an array just by using syntax alone without prompting the human for it ?

If you want to have an array of integers called, for instance, array, you could actually do like 13, 42, 50, something like this, this would give you an array if you use this syntax. This would give you an array of size 3 where the three values by default are 13, 42 and 50. It's not syntax we'll use for now, but there is syntax like that. It's not quite as user-friendly, though, as other languages, if you've indeed programmed before. Oh, is there a way to calculate the length of an array ? Short answer, no, and I'm about to show you one demonstration of this. Those of you who have programmed before in Java, in JavaScript, in certain other languages, it's very easy to get the length of an array. You essentially just ask the array, what's its length ? C does not give you that capability. The onus is entirely on you and me to remember, s as with another variable, like n, how long the array is.