!DOCTYPE html> 56:219:522 data processing (data management for data science)
cross listed with 56:824:718 data management and 56:834:650 special problems
sp25 tue 6-8.50 bsb336
https://theaok.github.io/datManPy most current syllabus (updated continuously)(stata version of this class)

datman@googlegroups.com listserv (everyone in class gets these; if you didn't get welcome email/can't post to the group do email me adam.okulicz.kozaryn@gmail.com to add you! )

prerequisites

You need to be comfortable using a computer. Knowledge of Python (R, Stata, etc) and data-management/computer science is helpful but not necessary. We will cover the basics. note to social science/humanities students: This class is mostly coding/programming/scripting. If you do not like programming, this class is not for you. But you may not yet know whether you like it and you may start liking it in this class: it often happened before!

course description and learning objectives/outcomes

  • data (sources, best practices, tips and tricks): this class is as much about Python as about data (you'll use the data you chose that will serve you well beyond this class!!)
  • hands-on/applied Python/Pandas (there will be some theory, too)
  • You will learn how to manage your data: clean, organize, manipulate and automate: eg: data types, text/math functions, recoding, documentation, merging, reshaping, loops, if/else.


    Use AI to write code, but do edit it! Reduce bloat; improve it, play with it, and make sure you understand it!

    required textbooks and materials

    No required textbooks. All required materials (code, readings) will be provided.

    software

    Python >=3.10 (python.org). Can download for free for Linux, Win, Mac. We will use several libs, mostly Pandas.
    BUT no need to download or install anything: we will run Python online (in webbrowser in the cloud), so called "colab" (2 sections down). But first GitHub.
    GitHub
    We will use GitHub to store Python code (.py) in form of a notebook (.ipynb), and we will edit (and run) the notebook in colab (next sec).
  • sign up or login at github.com
  • (depending on os/browser) on top left hit "New" or "Create Rpository" or top right under plus "+" select "New repository"
  • pick some repository name, say "datman" ; keep selected 'Public'; important!: under "Initialize this repository with" check "Add a README file"; and hit at the bottom "Create repository"
  • then hit "Settings" towards the middle-top right; on the left select "Collaborators" tab and hit "Add people" : "theaok", and hit "Add theaok to this repository"
    workflow: my comments, diffs, inline response [lets go over this next week again]
  • i will run it in my Colab, edit, and upload back
  • diff and response to my comments: actually cleaner and better in colab: File-Revision history; or clunky in GitHub: can click my commit message and see the so called diff--the difference between your version and my version: important! do make sure to fix it up for next ps, you may even have inline response to my comments in your next ps (especially if sth complex or if you disagree)
  • dont forget about a meaningful commit message--can keep on uploading newer versions as many times as you like
  • note: when you click the file, you can then click 'History' and see how the file evolved over time :)
  • file naming: ps1.ipynb, ps2.ipynb, etc, or ps1, ps2, etc sections in one file; or just one file and keep it updating throught with new stuff as we go!
  • colab
    Just run Py notebook in Colab and save subsequent versions in Github that will keep track of changes [stick with this for the ps]
  • go to https://github.com/theaok/datManPy/blob/main/pandas.ipynb and hit 'open in colab' OR go to colab.research.google.com and on popup pick GitHub, search for:
    https://github.com/theaok/datManPy/blob/main/pandas.ipynb
    (it should find it and load it into colab, and follow instructions at the top of the file, ie save it in your GitHub etc)
  • best projects:
  • https://colab.research.google.com/github/ewattudo/datamanagement/blob/main/PS5.ipynb
  • https://colab.research.google.com/github/Jonchyk/Datamgmt/blob/main/PS5_Vis_Grouping.ipynb
  • data

    The class is a bit like an independent study: you will carry out some very basic research. You do need your own data for this class ASAP: the more data and the more complex, the better. Software will need to load the data straight up from online. Some data easily downloadable from online eg https://gss.norc.org/get-the-data/stata. But many not. Then you have to put data online yourself [just go over Git<25mb]: https://theaok.github.io/generic/howToPutDataOnline.html

    https://www.libraries.rutgers.edu/subject-librarians?keyword=&division=All&unit=All&specialization=351
    icpsr: biggest repository of survey data; check out also var search
    google is great for data search; and it has data search, too
    google cloud/big query has data ,too
    kdnuggets listing of sources; kdnuggets great in general for data science; maybe start here, easier to wrap your head around
    another kdnuggets listing
    yet another one: maybe esp FiveThirtyEight and Reddit
    kaggle

    NOAA
    NASA

    datsets on GitHub
    datahub
    pew

    grading

    2 keys to success: start early AND ask often many questions; (and study groups: get couple people on zoom, screenshare notebooks, etc) This is a software class. It is different from typical soc sci classes! You will get stuck often and whenever stuck, email listserv, ask me, ask your classmates, as opposed to pulling your hair out! And stop by my office, too. Googling (and built-in Gemini) solves most problems but for many things its better to talk to me and your classmates; also more social/human, if you talk to computer all the time, its not healthy.
  • 100% (5ps x 20%) problem sets [just Py notebook], may cowrite code (upto 2 people) but then the project should be 2 times better than a single-authored one
  • bonus/extra upto 5% engagement, class participation eg answering/asking questions, helping others, listserv discussions
  • bonus/extra upto 5% civic engagement (see bottom of the syllabus)

  • calendar

    [*] = bonus (extra/not required)

    sp25: i have family emergency, probably after jan28 2 classes on zoom

    jan21 intro vid old vid

  • ps0.pdf
  • pandas.ipynb
  • see some vids, can see screen with good resolution for coding steps:)
  • intro.pdf
  • replication.pdf
  • !!zoom only!! jan28 I/O (Input/Output) and basic descriptive statistics vid old vid

  • ps1.pdf
  • find_data.pdf
  • data.pdf
  • pandas.ipynb
  • !!zoom only!! feb4 manipulate data vid old vid

  • ps2.pdf
  • note: added \#3 to ps1: 'do some manipulations such as subset/slice on condition, filter vars or obs using regexp, and groupby/agg'
  • lets start with diffs in colab: File-Revision History: uncheck show output: https://colab.research.google.com/github/worldterminator/worldterminator/blob/main/ps0.ipynb and https://colab.research.google.com/github/nhs47/DatPro/blob/main/ps0_Nabiha.ipynb
  • early/bonus/volunteer present/go through ps1 esp des sta and interpretations
  • manipulate.pdf
  • pandas.ipynb manipulate; and dive into merge (1st basic example)
  • feb11 merge vid old vid

  • pandas.ipynb: merge
  • flip a class work on ps2: (I walk around and sit with each of you; Q and A; otherwhise I look at your colabs, and then approach you with ideas)
  • if time do real world examples from next class
  • feb18 real world examples and plotly vid old vid old vid

  • ps3.pdf
  • QaA and go over your ps1 and ideally ps2
  • chetan, diff in colab, File-Revision history, uncheck Show output
  • real world data management (eg mapping/recoding urbanicity) example (covid city paper)
  • datasets of the week: usda ers; irs soi county-to-county; nj ag use of force data
  • real world merge example: Eric; and another example, Xiao
  • https://colab.research.google.com/github/theaok/vis/blob/main/plotly.ipynb merge is typically a necessary initial step, but usually the final step is to explore the new relationships

  • [*] pandas fancy stuff and other fancy stuff (also focus on your projects, discuss, brianstorm, flip the class)

    feb25 wrap up pandas and pandas extra topics (slow down and flexible: you chose what to learn) vid old vid old vid old vid

  • ps4.pdf
  • go over ps2, go over merge investigation/interpretation again!, QandA on merge, flip the class work on ps3
  • wrap up all of pandas
  • extra topics
  • mar4 ps3 presentations; and profiler, imputations, fuzzywuzzy vid old vid

  • present ps3: 10min sharp + 10min discussion; focus on interesting stuff like research question, data, variables, relationships: descriptive stats and visualizations; skip boring stuff like subseting and renaming
  • profiler, imputations, fuzzywuzzy
  • mar11 theory and flip the class work on ps4 vid old vid

  • theory.pdf
  • sai and chetan present (also see their ind stu, how different they are?); and go over listserv ps3 comments
  • flip the class work on ps4: slow down, focus on your projects, redo/improve/polish
  • mar18 no class sp break

    mar25 ps4 presentations vid old vid

  • revisit theory from last class
  • presentation: no need for slides, just the notebook: 15min sharp + 15min discussion/q and a
  • apr1 theory vid old vid

  • ps5.pdf
  • factor analysis
  • theory.pdf: 2nd sec: CS stuff
  • chetan presentation

    apr8 review, q and a vid

  • Srija presentation
  • go through code from earlier, focus on whats underused, need more elaboration etc like: missing obs/duplicates (eg profiler), groupy agg, recode/map, merge, imputations, apis/fred
  • check out my python notebooks for research; you should use the data you produced in this class to write a paper: publish or perish:
  • pop growth and happiness pdf and colab
  • covid and happiness pdf and colab
  • flip much of the class work on ps5/final project
  • apr15 vid

  • shell and AI ideas
  • Eric joins via livestream https://rcit.rutgers.edu/av-request/live/08699-1-2025 and via zoom https://pwa.zoom.us/wc?mn=8892839953&pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09
  • if time: i will pull up your latest and go over it
  • apr22 wrap up, summarize

    vid
  • wrap.pdf
  • ad http:theaok.github.io/swb
  • revisit theory
  • Srija present sentiment analysis
  • pull up instructive chunks of code from: ???
  • i fork couple best repos as example for future classes
  • apr29 last class ps5/final presentations !!zoom only!! https://rutgers.zoom.us/j/8892839953?pwd=dFhiTE1BZVlnMXdWSWN6d3N3MXI0QT09

    15min sharp + 15min discussion/q and a
    just to be safe, delete the data you have posted online, you never know: someone may be picky about it

    rules

    do not share or link to class videos! These videocasts and podcasts are the exclusive copyrighted property of Rutgers University and the Professor teaching the course. Rutgers University and the Professor grant you a license only to replay them for your own personal use during the course. Sharing them with others (including other students), reproducing, distributing, or posting any part of them elsewhere -- including but not limited to any internet site -- will be treated as a copyright violation and an offense against the honesty provisions of the Code of Student Conduct. Furthermore, for Law Students, this will be reported by the Law School to the licensing authorities in any jurisdiction in which you may apply to the bar. attendance Attendance is required: if you cannot attend without documented emergency you will lose participation credit; either way do let me know ahead of time; we can put you on zoom so you can participate Be advised that you are responsible for any material covered in the class, whether or not it was in the readings or lecture notes. You are also responsible for any announcements made in class. For most students, attendance is simply essential to learning the material. If you do need to miss a class, be sure to consult with a fellow student to learn what transpired.

    incompletes: Generally speaking, the material in this course is best learned as a single unit. I will grant incompletes only in cases where a substantial change in life circumstances occurs that is beyond the control of the student, and only with appropriate documentation.

    study groups. You are encouraged to form a regular study group. Many students over the years have found the study groups to be very helpful. Study groups are permitted and encouraged to work on the problem sets together. However, each individual student should write up his or her own answer to hand in, based on his or her own understanding of the material. Do not hand in a copy of another person’s problem set, even a member of your own group. Writing up your own answer helps you to internalize the group discussions and is a crucial step in the learning process.

    Academic Integrity. I am very serious about this. Make no mistake--I may appear accommodating and informal--but I am extremely strict about academic integrity. Violations of academic integrity include cheating on tests or handing in assignments that do not reflect your own work and/or the work of a study group in which you actively participated. Handing in your own work that was performed not for this class (e.g. other class, any other project) is cheating, too. I have a policy of zero tolerance for cheating. Violations will be referred to the appropriate university authorities.

    For more information see http://fas.camden.rutgers.edu/student-experience/academic-integrity-policy

    Accommodating Students with Disabilities. Any student with a disability affecting performance in the class should contact the disability office ASAP: https://success.camden.rutgers.edu/success-services/disability-services/

    civic engagement component (opportunity for extra credit!)

    Start early. Start thinking about how you want to engage civically today. Universities and social science should serve society. You are encouraged have to engage with local community.

    The idea is that you engage civically using research methods. There are several ways to do it. Ideally, you will partner with a local organization, obtain data from them, do some analysis, and present results to them. You may also use government data, say from census bureau, and present relevant information to locals. A local organization can be Rutgers research institute such as WRI, CURE, LEAP or any other organization such as school or soup kitchen or CamConnect. Rutgers Office of civic engagement may be able to help you contact them. The key idea is partnership: you will use tools from this class to produce output useful to local community. This is similar to taking a role of an apprentice at a local organization or serving as a consultant.

    Using real world data poses challenges, which is a part of exercise. Presenting your findings to stakeholders outside of a class is also challenging. At the same time, it is fairly easy to contribute locally by using simple tools learned in this class. For instance, simple comparison of means between two schools in Camden can be revealing and helpful locally.

    An obvious way would be to use data at your workplace or at a workplace of someone you know. However, you need to make sure that it serves society in some way. For instance, it would be straightforward if you work at a hospital or school or fire department; but it would be difficult if you work at Starbucks.