Module 1: Language elements

  • are programming languages real languages?

  • what is the basic syntax of languages like C++?

Programming languages

As we go about learning how to design algorithms and abstractions, we need a language to describe our work. In Statics we’d use the language of mathematics and the visual language of the free-body diagram. In computer programming, we will use a textual language to describe our designs, and this textual language will be converted by a tool called a compiler into the specific instructions that a CPU knows how to execute. Programming languages genuinely are languages, but they are synthetic languages (like mathematics) rather than natural languages.

Programming languages really are languages

All languages have (at least) two central things to worry about:

Syntax

The rules of what makes for valid ("well-formed") language. Natural languages have constructs like subjects ("I", "you", "they"), objects and verbs that can be combined into sentences. This sentence you are currently reading is a well-formed sentence. However, so is this one: ’Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. That is a valid, well-formed sentence: you can identify subjects, objects, verbs, adjectives, etc., all of which follow the rules about word order, subject-verb agreement. Nonetheless, the sentence is nonsense. To make sense of language, we need more than syntax.

Semantics

The meaning of language. Semantics tell us what things mean, often including whether or not we agree that the words mean the same thing. "My dog is a pilot" is a syntactically-value sentence, but is semantically rather meaningless. On the other hand, the phrase "the pilot is high" has more than one possible interpretation…​ with rather dramatic potential consequences!

This distinction between syntax and semantics is also important for programming languages. The compiler (the tool mentioned above that converts our textual programs into instructions for the computer to execute) will spot syntactic problems with our programs and give up with an error message. For example, if you tried to type the following text in the middle of the C++ code for your next lab:

this sentence is not valid C++!

the compiler would print an error message and stop — it is syntactically invalid. However, if we wrote a program to calculate the average value of some data, but we forgot to divide by the number of elements (i.e., we calculated $\sum X$ rather than $\frac{\sum X}{n}$), it could be syntactically well-formed but semantically incorrect: the program would look like a valid program, but it would output incorrect results.

Programming languages are not natural languages

Although programming languages are real languages, we did say they are not natural languages (which evolve naturally over time) — they are synthetic languages, designed by engineers and computer scientists. This has some implications:

Good news

They have a very small, closed vocabulary and very regular rules.

Bad news

They are fussier about grammar than an eighth-grade English teacher!

Our objective is to get you familiar enough with C to write meaningful sets of instructions for computers. This will require a dive into rules of programming language syntax, so that we can figure out how to write well-formed C and express the semantics that we want the computer to understand. This is a two-step process:

  1. get the grammar right (syntax), then

  2. make it meaningful (semantics).

So without any further ado…​ to syntax!

C++ syntax

The syntax of the C++ programming language (or any language, for that matter!) can be viewed at different levels of abstraction. At a very low level, there are characters that come from an alphabet of acceptable characters. These characters are combined into tokens, a higher-level abstraction in programming languages. Once we’ve checked that a token is made of valid alphabet characters, we can start thinking about higher-level questions than "what characters are there": we can start to think about the different semantics of different kinds of tokens.

Tokens can be used to build expressions and expressions can be incorporated into statements. When we talk about statments in a programming language, we are typically interested in answering questions like, "what does this statement do?", while paying less attention to the sequence of characters contained in the statement. This is another example of abstraction: participating in a conversation about abstract things (statements) while ignoring some of the concrete details (characters, rules about tokens, etc.).

Here is an example of a C statement that contains one expression and is constructed from tokens, which are themselves constructed from characters in the C alphabet:

Constructing statements from expressions

The alphabet

Every language uses an alphabet to represent its words. Most European languages use the latin alphabet, although there are some variations with accents (e.g., "crêpe", "Straße" or "římský"). Other languages use entirely different alphabets, such as the Cyrillic alphabet ("кириллицы"), the Hangul alphabet ("한글") or Devanagari script ("संस्कृतम्").

Most programming languages, including C++, use a Latin-derived alphabet that includes:

  • uppercase Roman letters (A-Z),

  • lowercase Roman letters (a-z),

  • the digits 0-9,

  • any combination of newline, tab and/or space characters (collectively known as whitespace) and

  • a specific set of symbols available on almost every keyboard: < > { } ( ) : ; , . ? / * & + - ! ^ ' " = | _

Tokens

Natural languages combine characters from the alphabet to form words. The most basic unit of a programming language is called a token. Tokens can include simple words or operators (e.g., the less-than sign <). Crucially, tokens do not contain whitespace. The following are all C++ tokens:

while window { x George +

There are five tokens in the following expressions, no matter how much whitespace we add:

(x+b)
     (
           x
           +
           b
     )

There are effectively four kinds of tokens:

Identifiers

words that can be used as names for things (e.g., finalGrade)

Keywords

identifiers that have been reserved by the language for pre-defined names (e.g., int)

Symbols

special characters that provide structure or represent operations (e.g., +)

Literals

constant values written literally in the source code (e.g., 3.14159)

Identifiers

Identifiers are the names that we, as programmers, give to things. For example, if we want to create a variable called $x$, we can use an identifier x as a variable name. Identifiers must conform to the naming rules:

  • they can include alphabetic characters (a-z, A-Z)

  • they can include digits (0-9)

  • they can include the underscore character (_)

  • they may not start with a digit

  • they should not start with an underscore

Here are some legal identifiers:

x y i circle arg22 thisIsAVeryLongIdentifier
swap Swap PI DEFAULT_WINDOW _legalButDontUse

C++ is case sensitive so swap and Swap are different names. It is possible to choose poor identifiers for things:

1x
thisIsAnIncrediblyLongIdentifier1 thisIsAnIncrediblyLongIdentifier2
default-window
1x

starts with a digit — illegal

default-window

includes a - — illegal

thisIsAnIncrediblyLongIdentifier1 thisIsAnIncrediblyLongIdentifier2

legal but too long (in fact, some old compilers may ignore everything past character 31, making them indistinguishable)

Name Style Rules

To communicate clearly, it’s helpful if we agree on some common understandings and conventions in our writing. Organizations that do a lot of writing (e.g., newspapers) will make their writers conform to style guides that specify both grammatical rules (e.g., "at this newspaper, we abhor the Oxford comma") and conventions for word usage (e.g., when to call someone an "activist" vs an "instigator" vs a "criminal" vs a "terrorist"). This kind of discipline will also be helpful in our programming: if we use a common style, it will make it easier for me to read your code, for you to read each other’s code, etc. Style can vary from organization to organization, but within a single company, open-source project, etc., it’s helpful if people use the same style. Here are some style rules we will use for names in this course.

  • names for constants should be uppercase, as PI

  • use underscore to separate words in constant names, as DEFAULT_WINDOW

  • other names should start with a lowercase letter, e.g., circle or starshipEnterpriseVoyage (often referred to as camelcase because of the capitals in the middle)

Keywords

Keywords are sometimes called reserved words as programmers may not use them as identifiers: they are reserved by the language to have special meanings. For example, you can call a variable currentTemperature, but you can’t call it int: that’s a keyword that C uses to indicate that something is an integer. Here's a list of the C keywords that we will see in this course:

and
bool
char
const
do
double
else
false
for
if
int
namespace
not
or
return
true
using
void
while
xor

That’s it!

By the way, notice how these keywords have different colours? In a C file they are just words, but programs that we use to interact with C code often add syntax highlighting. Syntax highlighting adds colours to source code based on the rules of syntax, so keywords might have one colour, symbols another, etc. (and, as you can see here, even different kinds of keywords might get different colours). The colour isn’t really part of the code: it’s a kind of false colouring that makes the code easier to read.

Symbols

C is rich in symbols, but just a few of them will get you most of the way to understanding most well-written C code. Some symbols are formed from a single character (e.g., +), but others are more complex (e.g., ++ or +=). Some symbols represent operations like addition (+), multiplication (*) or modular arithmetic (% — we’ll come back to that idea in a later lecture). Others help give structure to our code (e.g., { and //). Here are the symbols we will use in this course:

+ - * / % = == != < > <= >= { } ( ) [ ]
&& || ! ++ -- ; , . += -= *= /= << >> /* */ //

Literals

The final kind of token is a literal, whose value is what it literally what it says in the text of the source code. We will see four kinds of literals in our programs starting out:

integer literals

standard integers such as 1, 2, -17 or 2056. There are also some special ones such as 0xff, the hexadecimal (base 16) representation of 255.

double literals

real number literals using decimal notation (3.0, 3.14159 or -17.65) or floating point (exponential) notation (1.0e11 or -23.6e-3).

character literals

single characters such as 'a', 'x', 'H', '!' or '\n'.

string literals

a sequence of characters such as "hello world!\n", "programming" or "The quick fox jumped over the lazy dogs."

Note that single characters are always enclosed by a single quote while a sequence of characters (known as a string) is enclosed in double quotes.

So, that’s our four kinds of tokens: identifiers, keywords, symbols and literals. Now, on to expressions!

Expressions

Just like in mathematics, an expression is something that can be evaluated, i.e., reduced to a single value. Simple values like 4 or the variable x are expressions, since they already are a single value. So are slightly more complex expressions like 4 + 1 or x - 2. So are quite complex expressions like 2 * pow(x, 3) - x + 17/t (the meaning of which you do not need to understand today).

Many expressions in programming languages look a lot like mathematical (or arithmetic) expressions, using some of the same operators:

Symbol Meaning

+

addition, e.g., x + 2

-

subtraction, e.g., x - 2

*

multiplication, e.g., 2 * x (vs the mathematical $2 \times x$ or $2x$)

/

division, e.g., x / 2 (vs the mathematical $x \div 2$ or $\frac{x}{2}$)

()

evaluate the expression in parentheses first

Like mathematical expressions, the order of operations is significant. We will define this more precisely in a future lecture, but for now, multiplication and division happens before addition or subtraction and expressions in parentheses are evaluated first.

Statements

The equivalent of a sentence is a statement. A statement effectively tells the computer to do something. Statements are always terminated in C++ by a semicolon, e.g.:

x = y + z - 4;
cout << "Hello world!\n";

In this example, there are two statements, both of which contain some tokens. The whitespace is not significant. In fact, it’s not even necessary to stay on a single line:

x = 24 * gorganzola
    - 14 * emmenthaler
    + 17 * balderson;
cout  <<
	"The quick brown fox jumped over the lazy dogs\n";

are both legal.

Blocks

The C++ equivalent of a paragraph is a block: a set of statements enclosed in a pair of curly braces ({ and }). We can turn the above statements into a block as follows:

{
   x = y + z - 4;
   cout << "Hello world!\n";
}

It is considered good style to indent statements inside a block.

Comments

Comments are for people: the compiler ignores them, but other programmers can read them. This makes them a major mechanism for documenting programs. They come in two flavours:

// A single line comment. Ends at the end of the line
/*
 * Comments enclosed in slash-asterix ... asterix-slash can
 * extend over several lines
 */
License: CC BY-NC-SA

(c) 2009–2018 Michael Bruce-Lockhart, Theo Norvell, Dennis Peters and Jonathan Anderson. Licensed under a Creative Commons Attribution–Noncommercial–Share-Alike 2.5 Canada License. Permissions beyond the scope of this license may be available at theteachingmachine.org.